2. Google Cloud Platform 2
About Me
Product Manager at Google on Cloud Dataflow
Work closely with the most of the Apache Beam committers
Project Management Committee for Alluxio
Contributed Google Compute Engine support to Alluxio
Originally from Salt Lake City, UT
Father of 3 kids!
Twitter: @ericmander
3. Google Cloud Platform 3
Intermediary API?
Jesse Anderson (formerly Cloudera) in blog post: Strata+Hadoop Trends
I’m open to a better name if you have ideas
4. Google Cloud Platform 4
In the beginning...
There was only one approach to data processing
HDFS GFS
Hadoop MapReduce
Open Source Google
5. Google Cloud Platform 5
In the beginning...
And it required just two APIs, one for job description, one for storage
HDFS API GFS API
Hadoop API MapReduce API
Hadoop MR
HDFS GFS
Open Source Google
6. Google Cloud Platform 6
Then there was an evolution
But MapReduce was really hard (data processing in assembly language)
MapReduce API
MR
7. Google Cloud Platform 7
Flume (2010)
Flume was a programming model (API) innovation
(FlumeJava not Apache Flume)
MapReduce API
Flume
MR
Programming
Model
Higher level abstractions
- PCollections (RDDs)
- PTransforms
Directed Acyclical Graphs (DAGs)
Pipeline optimization (fusing)
8. Google Cloud Platform 8
Millwheel (2013)
Millwheel was an execution model innovation
Introduced a new API, as expected
3 APIs, 2 processing systems!
MapReduce API Millwheel API
Flume
MR Millwheel
Execution Model
Low latency, exactly-once, stream processing
9. Google Cloud Platform 9
Programming model innovation: Batch and streaming unified
Execution model innovation: Managed batch and service
Dataflow (2015)
MapReduce API Millwheel API
Flume Dataflow SDK
Cloud Dataflow
MR Millwheel
Programming
Model
Execution Model
10. Google Cloud Platform 10
Dataflow (2015)
MapReduce API Millwheel API
Flume Dataflow SDK
Cloud Dataflow
MR Millwheel
Programming
Model
Execution Model
Programming model innovation: Batch and streaming unified
Execution model innovation: Managed batch and service
11. Google Cloud Platform 11
“We believe that [...] the Beam model is the future reference programming
model for writing data applications in both stream and batch”
- Kostas Tzoumas, CEO of data Artisans and Flink co-creator
Apache Beam (2016)
Flink API Dataflow SDK
Cloud DataflowFlink Spark
Spark API
Local
Apache Beam
12. Google Cloud Platform 12
Apache Beam
1. The Beam Programming Model (unifies streaming/batch)
a. Transformations
b. Windowing
c. Watermarks + Triggers
d. Accumulation
2. SDKs for writing Beam pipelines
a. Java (Scala thanks to Spotify)
b. Python
3. Runners for existing distributed processing backends
a. Apache Flink (thanks to data Artisans)
b. Apache Spark (thanks to Cloudera and PayPal)
c. Google Cloud Dataflow (fully managed service from Google)
d. Local runner for testing
e. Other runners in progress: Gear Pump, Apache Apex
13. Google Cloud Platform 13
There is once again, only one library we need for data processing, except this time:
- It’s easy/expressive
- And we can still choose from the best execution technology
Apache Beam (2017?)
Flink API Dataflow SDK
DataflowFlink Spark
Spark API
Local
Apache Beam
Gear Pump Apache Apex
14. Google Cloud Platform 14
Coming full circle
There is once again, only one library we need for data processing, except this time:
- It’s easy/expressive
- And we can still choose from the best execution technology
Yet, we’ve tried this before...
Hadoop API MapReduce API
Hadoop MR
15. Google Cloud Platform 15
Apache Crunch (2012)
Apache Crunch is an open source Flume-like API on Hadoop and now Spark.
MapReduce API
Crunch
Hadoop
Programming
Model
MapReduce
Flume
16. Google Cloud Platform 16
Apache Crunch (2012)
Interest in Apache Crunch vs Apache Beam
Why? Perhaps...
● Limited portability need / value
● Missed the streaming revolution
● Community support
17. Google Cloud Platform 17
What about storage?
And it required just two APIs, one for job description, one for storage
HDFS API GFS API
Hadoop API MapReduce API
Hadoop MR
HDFS GFS
Open Source Google
18. Google Cloud Platform 18
Need for Intermediary Storage API
Again, an explosion of options
No reason to believe this will ever end.
There will always be innovation on storage and the file system
HDFS API Swift API
HDFS SwiftGCS / S3
GCS / S3 APIs
Gluster FS
GlusterFS API
19. Google Cloud Platform 19
Model for expressing storage lifecycle
There are patterns we want to express:
● Caching
● Retention policy
● ACLs
● Down-tiering old or stale data
Across storage systems:
● Unified namespace
20. Google Cloud Platform 20
PRD: Intermediate Storage API
1. Model for expressing storage lifecycle
2. Write to the popular storage systems
3. Pluggable APIs extend to other systems
4. Read from the popular processing frameworks
21. Google Cloud Platform 21
Pluggable under storage
Unified namespace + Tiered storage + Lineage
Supports at least a half dozen
Supports at least a half dozen
Alluxio
1. Model for expressing storage lifecycle
2. Write to the popular storage systems
3. Pluggable APIs extend to other systems
4. Read from the popular processing frameworks
Alibaba OSSSwift HDFSGCS / S3
Alluxio
GlusterFS NFS
HadoopFlink SparkLocal HBase Presto
22. Google Cloud Platform 22
Survival tests
Survival tests Apache Beam Alluxio
Portability need / value Lots of frameworks with varying
performance profiles
Lots of frameworks and storage
systems with varying
performance profiles
Catch the technology wave Leading stream processing
revolution
Leading in-memory revolution
Community support Top names in data processing Fastest growing contributor
base
23. Google Cloud Platform 23
My particular excitement about Alluxio
It’s a particularly interesting intermediary API because:
● Data has gravity, Alluxio allows enterprises to adopt tech alongside
legacy storage.
● Alluxio’s unification of sources is valuable within a single job. Beam is
used with one framework at a time, so it’s portable across jobs.
● Alluxio has standalone value from its built-in open source in-memory
filesystem. Beam requires an underly execution engine like Dataflow
24. Google Cloud Platform 24
Intermediary APIs = Data processing nirvana
Coder:
● Ability to express my data processing job or storage lifecycle logically, independent of
physical constraints.
Deployer:
● Code portability
● Swap in technology at will
System/Technology Creators:
● Easy path to adoption
● Focus on features and performance, not APIs/connectivity
25. Google Cloud Platform 25
Apache Beam
Alluxio
Stack of the future?
DataflowFlink SparkLocal Gear Pump Apache Apex
Alibaba OSSSwift HDFSGCS / S3 GlusterFS NFS