Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Eric Anderson
Product Manager
@ericmander
Rise of Intermediary APIs
(Beam and Alluxio)
https://goo.gl/Fa95XZ

Google Cloud Platform 2
About Me
Product Manager at Google on Cloud Dataflow
Work closely with the most of the Apache Beam committers
Project Management Committee for Alluxio
Contributed Google Compute Engine support to Alluxio
Originally from Salt Lake City, UT
Father of 3 kids!
Twitter: @ericmander

Intermediary API?
Jesse Anderson (formerly Cloudera) in blog post: Strata+Hadoop Trends
I’m open to a better name if you have ideas

In the beginning...
There was only one approach to data processing
HDFS GFS
Hadoop MapReduce
Open Source Google

In the beginning...
And it required just two APIs, one for job description, one for storage
HDFS API GFS API
Hadoop API MapReduce API
Hadoop MR
HDFS GFS
Open Source Google

Then there was an evolution
But MapReduce was really hard (data processing in assembly language)
MapReduce API
MR

Flume (2010)
Flume was a programming model (API) innovation
(FlumeJava not Apache Flume)
MapReduce API
Flume
MR
Programming
Model
Higher level abstractions
- PCollections (RDDs)
- PTransforms
Directed Acyclical Graphs (DAGs)
Pipeline optimization (fusing)

Millwheel (2013)
Millwheel was an execution model innovation
Introduced a new API, as expected
3 APIs, 2 processing systems!
MapReduce API Millwheel API
Flume
MR Millwheel
Execution Model
Low latency, exactly-once, stream processing

Programming model innovation: Batch and streaming unified
Execution model innovation: Managed batch and service
Dataflow (2015)
Flume Dataflow SDK
Cloud Dataflow
MR Millwheel
Programming
Model
Execution Model

Dataflow (2015)
Flume Dataflow SDK
Cloud Dataflow
MR Millwheel
Programming
Model
Execution Model
Programming model innovation: Batch and streaming unified
Execution model innovation: Managed batch and service

“We believe that [...] the Beam model is the future reference programming
model for writing data applications in both stream and batch”
- Kostas Tzoumas, CEO of data Artisans and Flink co-creator
Apache Beam (2016)
Flink API Dataflow SDK
Cloud DataflowFlink Spark
Spark API
Local
Apache Beam

Apache Beam
1. The Beam Programming Model (unifies streaming/batch)
a. Transformations
b. Windowing
c. Watermarks + Triggers
d. Accumulation
2. SDKs for writing Beam pipelines
a. Java (Scala thanks to Spotify)
b. Python
3. Runners for existing distributed processing backends
a. Apache Flink (thanks to data Artisans)
b. Apache Spark (thanks to Cloudera and PayPal)
c. Google Cloud Dataflow (fully managed service from Google)
d. Local runner for testing
e. Other runners in progress: Gear Pump, Apache Apex

There is once again, only one library we need for data processing, except this time:
- It’s easy/expressive
- And we can still choose from the best execution technology
Apache Beam (2017?)
Flink API Dataflow SDK
DataflowFlink Spark
Spark API
Local
Apache Beam
Gear Pump Apache Apex

Coming full circle
There is once again, only one library we need for data processing, except this time:
- It’s easy/expressive
- And we can still choose from the best execution technology
Yet, we’ve tried this before...
Hadoop MR

Apache Crunch (2012)
Apache Crunch is an open source Flume-like API on Hadoop and now Spark.
MapReduce API
Crunch
Hadoop
Programming
Model
MapReduce
Flume

Apache Crunch (2012)
Interest in Apache Crunch vs Apache Beam
Why? Perhaps...
● Limited portability need / value
● Missed the streaming revolution
● Community support

What about storage?
And it required just two APIs, one for job description, one for storage
HDFS API GFS API
Hadoop MR
HDFS GFS
Open Source Google

Need for Intermediary Storage API
Again, an explosion of options
No reason to believe this will ever end.
There will always be innovation on storage and the file system
HDFS API Swift API
HDFS SwiftGCS / S3
GCS / S3 APIs
Gluster FS
GlusterFS API

Model for expressing storage lifecycle
There are patterns we want to express:
● Caching
● Retention policy
● ACLs
● Down-tiering old or stale data
Across storage systems:
● Unified namespace

PRD: Intermediate Storage API
1. Model for expressing storage lifecycle
2. Write to the popular storage systems
3. Pluggable APIs extend to other systems
4. Read from the popular processing frameworks

Pluggable under storage
Unified namespace + Tiered storage + Lineage
Supports at least a half dozen
Supports at least a half dozen
Alluxio
1. Model for expressing storage lifecycle
2. Write to the popular storage systems
3. Pluggable APIs extend to other systems
4. Read from the popular processing frameworks
Alibaba OSSSwift HDFSGCS / S3
Alluxio
GlusterFS NFS
HadoopFlink SparkLocal HBase Presto

Survival tests
Survival tests Apache Beam Alluxio
Portability need / value Lots of frameworks with varying
performance profiles
Lots of frameworks and storage
systems with varying
performance profiles
Catch the technology wave Leading stream processing
revolution
Leading in-memory revolution
Community support Top names in data processing Fastest growing contributor
base

My particular excitement about Alluxio
It’s a particularly interesting intermediary API because:
● Data has gravity, Alluxio allows enterprises to adopt tech alongside
legacy storage.
● Alluxio’s unification of sources is valuable within a single job. Beam is
used with one framework at a time, so it’s portable across jobs.
● Alluxio has standalone value from its built-in open source in-memory
filesystem. Beam requires an underly execution engine like Dataflow

Intermediary APIs = Data processing nirvana
Coder:
● Ability to express my data processing job or storage lifecycle logically, independent of
physical constraints.
Deployer:
● Code portability
● Swap in technology at will
System/Technology Creators:
● Easy path to adoption
● Focus on features and performance, not APIs/connectivity

Apache Beam
Alluxio
Stack of the future?
DataflowFlink SparkLocal Gear Pump Apache Apex
Alibaba OSSSwift HDFSGCS / S3 GlusterFS NFS

Questions?https://goo.gl/Fa95XZ

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Similar to Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016 (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016