SlideShare a Scribd company logo
1 of 38
Download to read offline
Lessons: Porting a Streaming Pipeline from
Scala to Rust
2023 Scale by the Bay
Evan Chan
Principal Engineer - Conviva
http://velvia.github.io/presentations/2023-conviva-scala-to-rust
1 / 38
Conviva
2 / 38
Massive Real-time Streaming Analytics
5 trillion events processed per day
800-2000GB/hour (not peak!!)
Started with custom Java code
went through Spark Streaming and Flink iterations
Most backend data components in production are written in Scala
Today: 420 pods running custom Akka Streams processors
3 / 38
Data World is Going Native and Rust
Going native: Python, end of Moore's Law, cloud compute
Safe, fast, and high-level abstractions
Functional data patterns - map, fold, pattern matching, etc.
Static dispatch and no allocations by default
PyO3 - Rust is the best way to write native Python extensions
JVM Rust projects
Spark, Hive DataFusion, Ballista, Amadeus
Flink Arroyo, RisingWave, Materialize
Kafka/KSQL Fluvio
ElasticSearch / Lucene Toshi, MeiliDB
Cassandra, HBase Skytable, Sled, Sanakirja...
Neo4J TerminusDB, IndraDB
4 / 38
About our Architecture
graph LR; SAE(Streaming
Data
Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE SAE -->
DB[(Metrics
Database)] DB --> Dashboards
5 / 38
What We Are Porting to Rust
graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px
SAE(Streaming
Data
Pipeline) Sensors:::highlighted --> Gateways:::highlighted Gateways --> Kafka
Kafka --> SAE:::highlighted SAE --> DB[(Metrics
Database)] DB --> Dashboards
graph LR; Notes1(Sensors: consolidate
fragmented code base) Notes2(Gateway:
Improve on JVM and Go) Notes3(Pipeline:
Improve efficiency
New operator architecture) Notes1 ~~~ Notes2 Notes2 ~~~ Notes3
6 / 38
Our Journey to Rust
gantt title From Hackathon to Multiple Teams dateFormat YYYY-MM
axisFormat %y-%b section Data Pipeline Hackathon :Small Kafka ingestion
project, 2022-11, 30d Scala prototype :2023-02, 6w Initial Rust Port : small
team, 2023-04, 45d Bring on more people :2023-07, 8w 20-25 people 4 teams
:2023-11, 1w section Gateway Go port :2023-07, 6w Rust port :2023-09, 4w
“I like that if it compiles, I know it will work, so it gives confidence.”
7 / 38
Promising Rust Hackathon
graph LR; Kafka --> RustDeser(Rust Deserializer) RustDeser --> RA(Rust Actors -
Lightweight Processing)
Measurement Improvement over Scala/Akka
Throughput (CPU) 2.6x more
Memory used 12x less
Mostly I/O-bound lightweight deserialization and processing workload
Found out Actix does not work well with Tokio
8 / 38
Performance Results - Gateway
9 / 38
Key Lessons or Questions
What matters for a Rust port?
The 4 P's ?
People How do we bring developers onboard?
Performance How do I get performance? Data structures? Static dispatch?
Patterns What coding patterns port well from Scala? Async?
Project How do I build? Tooling, IDEs?
10 / 38
People
How do we bring developers onboard?
11 / 38
A Phased Rust Bringup
We ported our main data pipeline in two phases:
Phase Team Rust Expertise Work
First 3-5, very senior
1-2 with significant
Rust
Port core project
components
Second
10-15, mixed,
distributed
Most with zero
Rust
Smaller, broken down
tasks
Have organized list of learning resources
2-3 weeks to learn Rust and come up to speed
12 / 38
Difficulties:
Lifetimes
Compiler errors
Porting previous patterns
Ownership and async
etc.
How we helped:
Good docs
Start with tests
ChatGPT!
Rust Book
Office hours
Lots of detailed reviews
Split project into async and
sync cores
Overcoming Challenges
13 / 38
Performance
Data structures, static dispatch, etc.
"I enjoy the fact that the default route is performant. It makes you write
performant code, and if you go out the way, it becomes explicit (e.g., with dyn,
Boxed, or clone etc). "
14 / 38
Porting from Scala: Huge Performance Win
graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px
SAE(Streaming
Data
Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE:::highlighted
SAE --> DB[(Metrics
Database)] DB --> Dashboards
CPU-bound, programmable, heavy data processing
Neither Rust nor Scala is productionized nor optimized
Same architecture and same input/outputs
Scala version was not designed for speed, lots of objects
Rust: we chose static dispatch and minimizing allocations
Type of comparison Improvement over Scala
Throughput, end to end 22x
Throughput, single-threaded microbenchmark >= 40x
15 / 38
Building a Flexible Data Pipeline
graph LR; RawEvents(Raw Events) RawEvents -->| List of numbers | Extract1
RawEvents --> Extract2 Extract1 --> DoSomeMath Extract2 -->
TransformSomeFields DoSomeMath --> Filter1 TransformSomeFields -->
Filter1 Filter1 --> MoreProcessing
An interpreter passes time-ordered data between flexible DAG of operators.
Span1
Start time: 1000
End time: 1100
Events: ["start", "click"]
Span2
Start time: 1100
End time: 1300
Events: ["ad_load"]
16 / 38
Scala: Object Graph on Heap
graph TB; classDef default font-
size:24px
ArraySpan["`Array[Span]`"]
TL(Timeline - Seq) --> ArraySpan
ArraySpan --> Span1["`Span(start,
end, Payload)`"] ArraySpan -->
Span2["`Span(start, end,
Payload)`"] Span1 -->
EventsAtSpanEnd("`Events(Seq[A])`")
EventsAtSpanEnd -->
ArrayEvent["`Array[A]`"]
Rust: mostly stack based / 0 alloc:
flowchart TB; subgraph Timeline
subgraph OutputSpans subgraph
Span1 subgraph Events EvA ~~~
EvB end TimeInterval ~~~ Events
end subgraph Span2 Time2 ~~~
Events2 end Span1 ~~~ Span2 end
DataType ~~~ OutputSpans end
Data Structures: Scala vs Rust
17 / 38
Rust: Using Enums and Avoiding Boxing
pub enum Timeline {
EventNumber(OutputSpans<EventsAtEnd<f64>>),
EventBoolean(OutputSpans<EventsAtEnd<bool>>),
EventString(OutputSpans<EventsAtEnd<DataString>>),
}
type OutputSpans<V> = SmallVec<[Spans<V>; 2]>;
pub struct Span<SV: SpanValue> {
pub time: TimeInterval,
pub value: SV,
}
pub struct EventsAtEnd<V>(SmallVec<[V; 1]>);
In the above, the Timeline enum can fit entirely in the stack and avoid all
boxing and allocations, if:
The number of spans is very small, below limit set in code
The number of events in each span is very small (1 in this case, which is
the common case)
The base type is a primitive, or a string which is below a certain length 18 / 38
Avoiding Allocations using SmallVec and
SmallString
SmallVec is something like this:
pub enum SmallVec<T, const N: usize> {
Stack([T; N]),
Heap(Vec<T>),
}
The enum can hold up to N items inline in an array with no allocations, but
switches to the Heap variant if the number of items exceeds N.
There are various crates for small strings and other data structures.
19 / 38
Static vs Dynamic Dispatch
Often one will need to work with many different structs that implement a Trait
-- for us, different operator implementations supporting different types. Static
dispatch and inlined code is much faster.
1. Monomorphisation using generics
fn execute_op<O: Operator>(op: O) -> Result<...>
Compiler creates a new instance of execute_op for every different O
Only works when you know in advance what Operator to pass in
2. Use Enums and enum_dispatch
fn execute_op(op: OperatorEnum) -> Result<...>
3. Dynamic dispatch
fn execute_op(op: Box<dyn Operator>) -> Result<...>
fn execute_op(op: &dyn Operator) -> Result<...> (avoids allocation)
4. Function wrapping
Embedding functions in a generic struct
20 / 38
enum_dispatch
Suppose you have
trait KnobControl {
fn set_position(&mut self, value: f64);
fn get_value(&self) -> f64;
}
struct LinearKnob {
position: f64,
}
struct LogarithmicKnob {
position: f64,
}
impl KnobControl for LinearKnob...
enum_dispatch lets you do this:
#[enum_dispatch]
trait KnobControl {
//...
} 21 / 38
Function wrapping
Static function wrapping - no generics
pub struct OperatorWrapper {
name: String,
func: fn(input: &Data) -> Data,
}
Need a generic - but accepts closures
pub struct OperatorWrapper<F>
where F: Fn(input: &Data) -> Data {
name: String,
func: F,
}
22 / 38
Patterns
Async, Type Classes, etc.
23 / 38
Rust Async: Different Paradigms
"Async: It is well designed... Yes, it is still pretty complicated piece of code, but
the logic or the framework is easier to grasp compared to other languages."
Having to use Arc: Data Structures are not Thread-safe by default!
Scala Rust
Futures futures, async functions
?? async-await
Actors(Akka) Actix, Bastion, etc.
Async streams Tokio streams
Reactive (Akka streams, Monix, ZIO) reactive_rs, rxRust, etc.
24 / 38
Replacing Akka: Actors in Rust
Actix threading model doesn't mix well with Tokio
We moved to tiny-tokio-actor, then wrote our own
pub struct AnomalyActor {}
#[async_trait]
impl ChannelActor<Anomaly, AnomalyActorError> for AnomalyActor {
async fn handle(
&mut self,
msg: Anomaly,
ctx: &mut ActorContext<Anomaly>,
) -> Result<(), Report<AnomalyActorError>> {
use Anomaly::*;
match msg {
QuantityOverflowAnomaly {
ctx: _, ts: _, qual: _,
qty: _, cnt: _, data: _,
} => {}
PoisonPill => {
ctx.stop();
}
}
Ok(())
}
25 / 38
Other Patterns to Learn
Old Pattern New Pattern
No inheritance
Use composition!
- Compose data structures
- Compose small Traits
No exceptions Use Result and ?
Data structures are not
Thread safe
Learn to use Arc etc.
Returning Iterators
Don't return things that borrow other things.
This makes life difficult.
26 / 38
Type Classes
In Rust, type classes (Traits) are smaller and more compositional.
pub trait Inhale {
fn sniff(&self);
}
You can implement new Traits for existing types, and have different impl's for
different types.
impl Inhale for String {
fn sniff(&self) {
println!("I sniffed {}", self);
}
}
// Only implemented for specific N subtypes of MyStruct
impl<N: Numeric> Inhale for MyStruct<N> {
fn sniff(&self) {
....
}
}
27 / 38
Project
Build, IDE, Tooling
28 / 38
"Cargo is the best build tool ever"
Almost no dependency conflicts due to multiple dep versioning
Configuration by convention - common directory/file layouts for example
Really simple .toml - no need for XML, functional Scala, etc.
Rarely need code to build anything, even for large projects
[package]
name = "telemetry-subscribers"
version = "0.3.0"
license = "Apache-2.0"
description = "Library for common telemetry and observability functionality"
[dependencies]
console-subscriber = { version = "0.1.6", optional = true }
crossterm = "0.25.0"
once_cell = "1.13.0"
opentelemetry = { version = "0.18.0", features = ["rt-tokio"], optional = true }
29 / 38
IDEs, CI, and Tooling
IDEs/Editors
VSCode, RustRover (IntelliJ),
vim/emacs/etc with Rust Analyzer
Code Coverage VSCode inline, grcov/lcov, Tarpaulin (Linux only)
Slow build times Caching: cargo-chef, rust-cache
Slow test times cargo-nextest
Property Testing proptest
Benchmarking Criterion
https://blog.logrocket.com/optimizing-ci-cd-pipelines-rust-projects/
VSCode's "LiveShare" feature for distributed pair programming is TOP NOTCH.
30 / 38
Rust Resources and Projects
https://github.com/velvia/links/blob/main/rust.md - this is my list of Rust
projects and learning resources
https://github.com/rust-unofficial/awesome-rust
https://www.arewelearningyet.com - ML focused
31 / 38
What do we miss from Scala?
More mature libraries - in some cases: HDFS, etc.
Good streaming libraries - like Monix, Akka Streams etc.
I guess all of Akka
"Less misleading compiler messages"
Rust error messages read better from the CLI, IMO (not an IDE)
32 / 38
Takeaways
It's a long journey but Rust is worth it.
Structuring a project for successful onramp is really important
Think about data structure design early on
Allow plenty of time to ramp up on Rust patterns, tools
We are hiring across multiple roles/levels!
33 / 38
https://velvia.github.io/about
https://github.com/velvia
@evanfchan
IG: @platypus.arts
Thank You Very Much!
34 / 38
Extra slides
35 / 38
Data World is Going Native (from JVM)
The rise of Python and Data Science
Led to AnyScale, Dask, and many other Python-oriented data
frameworks
Rise of newer, developer-friendly native languages (Go, Swift, Rust, etc.)
Migration from Hadoop/HDFS to more cloud-based data architectures
Apache Arrow and other data interchange formats
Hardware architecture trends - end of Moore's Law, rise of GPUs etc
36 / 38
Why We Went with our Own Actors
1. Initial Hackathon prototype used Actix
Actix has its own event-loop / threading model, using Arbiters
Difficult to co-exist with Tokio and configure both
2. Moved to tiny-tokio-actor
Really thin layer on top of Tokio
25% improvement over rdkafka + Tokio + Actix
3. Ultimately wrote our own, 100-line mini Actor framework
tiny-tokio-actor required messages to be Clone so we could not, for
example, send OneShot channels for other actors to reply
Wanted ActorRef<MessageType> instead of ActorRef<ActorType,
MessageType>
supports tell() and ask() semantics
37 / 38
Scala: Object Graphs and Any
class Timeline extends BufferedIterator[Span[Payload]]
final case class Span[+A](start: Timestamp, end: Timestamp, payload: A) {
def mapPayload[B](f: A => B): Span[B] = copy(payload = f(payload))
}
type Event[+A] = Span[EventsAtSpanEnd[A]]
@newtype final case class EventsAtSpanEnd[+A](events: Iterable[A])
BufferedIterator must be on the heap
Each Span Payload is also boxed and on the heap, even for numbers
To be dynamically interpretable, we need BufferedIterator[Span[Any]]
in many places :(
Yes, specialization is possible, at the cost of complexity
38 / 38

More Related Content

What's hot

Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introductionAlexey Grigorev
 
Présentation de Django @ Orange Labs (FR)
Présentation de Django @ Orange Labs (FR)Présentation de Django @ Orange Labs (FR)
Présentation de Django @ Orange Labs (FR)Martin Latrille
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...confluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Debugging concurrency programs in go
Debugging concurrency programs in goDebugging concurrency programs in go
Debugging concurrency programs in goAndrii Soldatenko
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Observability with Consul Connect
Observability with Consul ConnectObservability with Consul Connect
Observability with Consul ConnectBram Vogelaar
 
End-to-end Streaming Between gRPC Services Via Kafka with John Fallows
End-to-end Streaming Between gRPC Services Via Kafka with John FallowsEnd-to-end Streaming Between gRPC Services Via Kafka with John Fallows
End-to-end Streaming Between gRPC Services Via Kafka with John FallowsHostedbyConfluent
 
Dual write strategies for microservices
Dual write strategies for microservicesDual write strategies for microservices
Dual write strategies for microservicesBilgin Ibryam
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Neo4j
 
Hexagonal architecture for java applications
Hexagonal architecture for java applicationsHexagonal architecture for java applications
Hexagonal architecture for java applicationsFabricio Epaminondas
 
Ingesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseIngesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseGuido Schmutz
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j InternalsTobias Lindaaker
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
 

What's hot (20)

Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
Présentation de Django @ Orange Labs (FR)
Présentation de Django @ Orange Labs (FR)Présentation de Django @ Orange Labs (FR)
Présentation de Django @ Orange Labs (FR)
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Debugging concurrency programs in go
Debugging concurrency programs in goDebugging concurrency programs in go
Debugging concurrency programs in go
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Observability with Consul Connect
Observability with Consul ConnectObservability with Consul Connect
Observability with Consul Connect
 
End-to-end Streaming Between gRPC Services Via Kafka with John Fallows
End-to-end Streaming Between gRPC Services Via Kafka with John FallowsEnd-to-end Streaming Between gRPC Services Via Kafka with John Fallows
End-to-end Streaming Between gRPC Services Via Kafka with John Fallows
 
Dual write strategies for microservices
Dual write strategies for microservicesDual write strategies for microservices
Dual write strategies for microservices
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...
 
Hexagonal architecture for java applications
Hexagonal architecture for java applicationsHexagonal architecture for java applications
Hexagonal architecture for java applications
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Ingesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseIngesting streaming data into Graph Database
Ingesting streaming data into Graph Database
 
Support distributed computing and caching avec hazelcast
Support distributed computing and caching avec hazelcastSupport distributed computing and caching avec hazelcast
Support distributed computing and caching avec hazelcast
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j Internals
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 

Similar to Porting a Streaming Pipeline from Scala to Rust

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Actor model in .NET - Akka.NET
Actor model in .NET - Akka.NETActor model in .NET - Akka.NET
Actor model in .NET - Akka.NETKonrad Dusza
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the wayOleg Podsechin
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayPhil Estes
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinaloscon2007
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinaloscon2007
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterAnne Nicolas
 
Introduction to Real Time Java
Introduction to Real Time JavaIntroduction to Real Time Java
Introduction to Real Time JavaDeniz Oguz
 

Similar to Porting a Streaming Pipeline from Scala to Rust (20)

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Actor model in .NET - Akka.NET
Actor model in .NET - Akka.NETActor model in .NET - Akka.NET
Actor model in .NET - Akka.NET
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the way
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
 
Introduction to Real Time Java
Introduction to Real Time JavaIntroduction to Real Time Java
Introduction to Real Time Java
 

More from Evan Chan

Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureEvan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 

More from Evan Chan (16)

Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Recently uploaded

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 

Recently uploaded (20)

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 

Porting a Streaming Pipeline from Scala to Rust

  • 1. Lessons: Porting a Streaming Pipeline from Scala to Rust 2023 Scale by the Bay Evan Chan Principal Engineer - Conviva http://velvia.github.io/presentations/2023-conviva-scala-to-rust 1 / 38
  • 3. Massive Real-time Streaming Analytics 5 trillion events processed per day 800-2000GB/hour (not peak!!) Started with custom Java code went through Spark Streaming and Flink iterations Most backend data components in production are written in Scala Today: 420 pods running custom Akka Streams processors 3 / 38
  • 4. Data World is Going Native and Rust Going native: Python, end of Moore's Law, cloud compute Safe, fast, and high-level abstractions Functional data patterns - map, fold, pattern matching, etc. Static dispatch and no allocations by default PyO3 - Rust is the best way to write native Python extensions JVM Rust projects Spark, Hive DataFusion, Ballista, Amadeus Flink Arroyo, RisingWave, Materialize Kafka/KSQL Fluvio ElasticSearch / Lucene Toshi, MeiliDB Cassandra, HBase Skytable, Sled, Sanakirja... Neo4J TerminusDB, IndraDB 4 / 38
  • 5. About our Architecture graph LR; SAE(Streaming Data Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE SAE --> DB[(Metrics Database)] DB --> Dashboards 5 / 38
  • 6. What We Are Porting to Rust graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px SAE(Streaming Data Pipeline) Sensors:::highlighted --> Gateways:::highlighted Gateways --> Kafka Kafka --> SAE:::highlighted SAE --> DB[(Metrics Database)] DB --> Dashboards graph LR; Notes1(Sensors: consolidate fragmented code base) Notes2(Gateway: Improve on JVM and Go) Notes3(Pipeline: Improve efficiency New operator architecture) Notes1 ~~~ Notes2 Notes2 ~~~ Notes3 6 / 38
  • 7. Our Journey to Rust gantt title From Hackathon to Multiple Teams dateFormat YYYY-MM axisFormat %y-%b section Data Pipeline Hackathon :Small Kafka ingestion project, 2022-11, 30d Scala prototype :2023-02, 6w Initial Rust Port : small team, 2023-04, 45d Bring on more people :2023-07, 8w 20-25 people 4 teams :2023-11, 1w section Gateway Go port :2023-07, 6w Rust port :2023-09, 4w “I like that if it compiles, I know it will work, so it gives confidence.” 7 / 38
  • 8. Promising Rust Hackathon graph LR; Kafka --> RustDeser(Rust Deserializer) RustDeser --> RA(Rust Actors - Lightweight Processing) Measurement Improvement over Scala/Akka Throughput (CPU) 2.6x more Memory used 12x less Mostly I/O-bound lightweight deserialization and processing workload Found out Actix does not work well with Tokio 8 / 38
  • 9. Performance Results - Gateway 9 / 38
  • 10. Key Lessons or Questions What matters for a Rust port? The 4 P's ? People How do we bring developers onboard? Performance How do I get performance? Data structures? Static dispatch? Patterns What coding patterns port well from Scala? Async? Project How do I build? Tooling, IDEs? 10 / 38
  • 11. People How do we bring developers onboard? 11 / 38
  • 12. A Phased Rust Bringup We ported our main data pipeline in two phases: Phase Team Rust Expertise Work First 3-5, very senior 1-2 with significant Rust Port core project components Second 10-15, mixed, distributed Most with zero Rust Smaller, broken down tasks Have organized list of learning resources 2-3 weeks to learn Rust and come up to speed 12 / 38
  • 13. Difficulties: Lifetimes Compiler errors Porting previous patterns Ownership and async etc. How we helped: Good docs Start with tests ChatGPT! Rust Book Office hours Lots of detailed reviews Split project into async and sync cores Overcoming Challenges 13 / 38
  • 14. Performance Data structures, static dispatch, etc. "I enjoy the fact that the default route is performant. It makes you write performant code, and if you go out the way, it becomes explicit (e.g., with dyn, Boxed, or clone etc). " 14 / 38
  • 15. Porting from Scala: Huge Performance Win graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px SAE(Streaming Data Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE:::highlighted SAE --> DB[(Metrics Database)] DB --> Dashboards CPU-bound, programmable, heavy data processing Neither Rust nor Scala is productionized nor optimized Same architecture and same input/outputs Scala version was not designed for speed, lots of objects Rust: we chose static dispatch and minimizing allocations Type of comparison Improvement over Scala Throughput, end to end 22x Throughput, single-threaded microbenchmark >= 40x 15 / 38
  • 16. Building a Flexible Data Pipeline graph LR; RawEvents(Raw Events) RawEvents -->| List of numbers | Extract1 RawEvents --> Extract2 Extract1 --> DoSomeMath Extract2 --> TransformSomeFields DoSomeMath --> Filter1 TransformSomeFields --> Filter1 Filter1 --> MoreProcessing An interpreter passes time-ordered data between flexible DAG of operators. Span1 Start time: 1000 End time: 1100 Events: ["start", "click"] Span2 Start time: 1100 End time: 1300 Events: ["ad_load"] 16 / 38
  • 17. Scala: Object Graph on Heap graph TB; classDef default font- size:24px ArraySpan["`Array[Span]`"] TL(Timeline - Seq) --> ArraySpan ArraySpan --> Span1["`Span(start, end, Payload)`"] ArraySpan --> Span2["`Span(start, end, Payload)`"] Span1 --> EventsAtSpanEnd("`Events(Seq[A])`") EventsAtSpanEnd --> ArrayEvent["`Array[A]`"] Rust: mostly stack based / 0 alloc: flowchart TB; subgraph Timeline subgraph OutputSpans subgraph Span1 subgraph Events EvA ~~~ EvB end TimeInterval ~~~ Events end subgraph Span2 Time2 ~~~ Events2 end Span1 ~~~ Span2 end DataType ~~~ OutputSpans end Data Structures: Scala vs Rust 17 / 38
  • 18. Rust: Using Enums and Avoiding Boxing pub enum Timeline { EventNumber(OutputSpans<EventsAtEnd<f64>>), EventBoolean(OutputSpans<EventsAtEnd<bool>>), EventString(OutputSpans<EventsAtEnd<DataString>>), } type OutputSpans<V> = SmallVec<[Spans<V>; 2]>; pub struct Span<SV: SpanValue> { pub time: TimeInterval, pub value: SV, } pub struct EventsAtEnd<V>(SmallVec<[V; 1]>); In the above, the Timeline enum can fit entirely in the stack and avoid all boxing and allocations, if: The number of spans is very small, below limit set in code The number of events in each span is very small (1 in this case, which is the common case) The base type is a primitive, or a string which is below a certain length 18 / 38
  • 19. Avoiding Allocations using SmallVec and SmallString SmallVec is something like this: pub enum SmallVec<T, const N: usize> { Stack([T; N]), Heap(Vec<T>), } The enum can hold up to N items inline in an array with no allocations, but switches to the Heap variant if the number of items exceeds N. There are various crates for small strings and other data structures. 19 / 38
  • 20. Static vs Dynamic Dispatch Often one will need to work with many different structs that implement a Trait -- for us, different operator implementations supporting different types. Static dispatch and inlined code is much faster. 1. Monomorphisation using generics fn execute_op<O: Operator>(op: O) -> Result<...> Compiler creates a new instance of execute_op for every different O Only works when you know in advance what Operator to pass in 2. Use Enums and enum_dispatch fn execute_op(op: OperatorEnum) -> Result<...> 3. Dynamic dispatch fn execute_op(op: Box<dyn Operator>) -> Result<...> fn execute_op(op: &dyn Operator) -> Result<...> (avoids allocation) 4. Function wrapping Embedding functions in a generic struct 20 / 38
  • 21. enum_dispatch Suppose you have trait KnobControl { fn set_position(&mut self, value: f64); fn get_value(&self) -> f64; } struct LinearKnob { position: f64, } struct LogarithmicKnob { position: f64, } impl KnobControl for LinearKnob... enum_dispatch lets you do this: #[enum_dispatch] trait KnobControl { //... } 21 / 38
  • 22. Function wrapping Static function wrapping - no generics pub struct OperatorWrapper { name: String, func: fn(input: &Data) -> Data, } Need a generic - but accepts closures pub struct OperatorWrapper<F> where F: Fn(input: &Data) -> Data { name: String, func: F, } 22 / 38
  • 24. Rust Async: Different Paradigms "Async: It is well designed... Yes, it is still pretty complicated piece of code, but the logic or the framework is easier to grasp compared to other languages." Having to use Arc: Data Structures are not Thread-safe by default! Scala Rust Futures futures, async functions ?? async-await Actors(Akka) Actix, Bastion, etc. Async streams Tokio streams Reactive (Akka streams, Monix, ZIO) reactive_rs, rxRust, etc. 24 / 38
  • 25. Replacing Akka: Actors in Rust Actix threading model doesn't mix well with Tokio We moved to tiny-tokio-actor, then wrote our own pub struct AnomalyActor {} #[async_trait] impl ChannelActor<Anomaly, AnomalyActorError> for AnomalyActor { async fn handle( &mut self, msg: Anomaly, ctx: &mut ActorContext<Anomaly>, ) -> Result<(), Report<AnomalyActorError>> { use Anomaly::*; match msg { QuantityOverflowAnomaly { ctx: _, ts: _, qual: _, qty: _, cnt: _, data: _, } => {} PoisonPill => { ctx.stop(); } } Ok(()) } 25 / 38
  • 26. Other Patterns to Learn Old Pattern New Pattern No inheritance Use composition! - Compose data structures - Compose small Traits No exceptions Use Result and ? Data structures are not Thread safe Learn to use Arc etc. Returning Iterators Don't return things that borrow other things. This makes life difficult. 26 / 38
  • 27. Type Classes In Rust, type classes (Traits) are smaller and more compositional. pub trait Inhale { fn sniff(&self); } You can implement new Traits for existing types, and have different impl's for different types. impl Inhale for String { fn sniff(&self) { println!("I sniffed {}", self); } } // Only implemented for specific N subtypes of MyStruct impl<N: Numeric> Inhale for MyStruct<N> { fn sniff(&self) { .... } } 27 / 38
  • 29. "Cargo is the best build tool ever" Almost no dependency conflicts due to multiple dep versioning Configuration by convention - common directory/file layouts for example Really simple .toml - no need for XML, functional Scala, etc. Rarely need code to build anything, even for large projects [package] name = "telemetry-subscribers" version = "0.3.0" license = "Apache-2.0" description = "Library for common telemetry and observability functionality" [dependencies] console-subscriber = { version = "0.1.6", optional = true } crossterm = "0.25.0" once_cell = "1.13.0" opentelemetry = { version = "0.18.0", features = ["rt-tokio"], optional = true } 29 / 38
  • 30. IDEs, CI, and Tooling IDEs/Editors VSCode, RustRover (IntelliJ), vim/emacs/etc with Rust Analyzer Code Coverage VSCode inline, grcov/lcov, Tarpaulin (Linux only) Slow build times Caching: cargo-chef, rust-cache Slow test times cargo-nextest Property Testing proptest Benchmarking Criterion https://blog.logrocket.com/optimizing-ci-cd-pipelines-rust-projects/ VSCode's "LiveShare" feature for distributed pair programming is TOP NOTCH. 30 / 38
  • 31. Rust Resources and Projects https://github.com/velvia/links/blob/main/rust.md - this is my list of Rust projects and learning resources https://github.com/rust-unofficial/awesome-rust https://www.arewelearningyet.com - ML focused 31 / 38
  • 32. What do we miss from Scala? More mature libraries - in some cases: HDFS, etc. Good streaming libraries - like Monix, Akka Streams etc. I guess all of Akka "Less misleading compiler messages" Rust error messages read better from the CLI, IMO (not an IDE) 32 / 38
  • 33. Takeaways It's a long journey but Rust is worth it. Structuring a project for successful onramp is really important Think about data structure design early on Allow plenty of time to ramp up on Rust patterns, tools We are hiring across multiple roles/levels! 33 / 38
  • 36. Data World is Going Native (from JVM) The rise of Python and Data Science Led to AnyScale, Dask, and many other Python-oriented data frameworks Rise of newer, developer-friendly native languages (Go, Swift, Rust, etc.) Migration from Hadoop/HDFS to more cloud-based data architectures Apache Arrow and other data interchange formats Hardware architecture trends - end of Moore's Law, rise of GPUs etc 36 / 38
  • 37. Why We Went with our Own Actors 1. Initial Hackathon prototype used Actix Actix has its own event-loop / threading model, using Arbiters Difficult to co-exist with Tokio and configure both 2. Moved to tiny-tokio-actor Really thin layer on top of Tokio 25% improvement over rdkafka + Tokio + Actix 3. Ultimately wrote our own, 100-line mini Actor framework tiny-tokio-actor required messages to be Clone so we could not, for example, send OneShot channels for other actors to reply Wanted ActorRef<MessageType> instead of ActorRef<ActorType, MessageType> supports tell() and ask() semantics 37 / 38
  • 38. Scala: Object Graphs and Any class Timeline extends BufferedIterator[Span[Payload]] final case class Span[+A](start: Timestamp, end: Timestamp, payload: A) { def mapPayload[B](f: A => B): Span[B] = copy(payload = f(payload)) } type Event[+A] = Span[EventsAtSpanEnd[A]] @newtype final case class EventsAtSpanEnd[+A](events: Iterable[A]) BufferedIterator must be on the heap Each Span Payload is also boxed and on the heap, even for numbers To be dynamically interpretable, we need BufferedIterator[Span[Any]] in many places :( Yes, specialization is possible, at the cost of complexity 38 / 38