Spark Pipelines in the Cloud with Alluxio with Gene Pang

•

3 likes•1,353 views

Organizations commonly use Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for Spark pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for Spark pipelines in the cloud.

Data & Analytics

Spark Pipelines in the Cloud
with Alluxio
Gene Pang,Alluxio, Inc.
Spark Summit EU - October 2017

About Me
•  Gene Pang
•  Software engineer @ Alluxio, Inc.
•  Alluxio open source PMC member
•  Ph.D. from AMPLab @ UC Berkeley
•  Worked at Google before UC Berkeley
•  Twitter: @unityxx
•  Github: @gpang
©2017 Alluxio, Inc.All Rights Reserved 2

Outline
©2017 Alluxio, Inc.All Rights Reserved 3
1
2
3
Alluxio Overview
Data Pipelines
Experiments

History of Alluxio
Started at UC Berkeley AMPLab In Summer 2012
•  Originally named as Tachyon
•  Rebranded to Alluxio in early 2016
Open Sourced in 2013
•  Apache License 2.0
•  Latest Release:Alluxio 1.6.0
©2017 Alluxio, Inc.All Rights Reserved 4

5©2017 Alluxio, Inc.All Rights Reserved
Alluxio: Unify Data at Memory Speed
Namespace Unification
Architecture Flexibility
IO Performance

Data Ecosystem with Alluxio
•  Apps only talk to
Alluxio
•  Simple Add/Remove
•  No App Changes
•  In-Memory
Performance
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
©2017 Alluxio, Inc.All Rights Reserved 6

Next Gen Analytics with Alluxio
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Apps, Data & Storage
at Memory Speed
ü  Big Data/IoT
ü  AI/ML
ü  Deep Learning
ü  Cloud Migration
ü  Multi Platform
ü  Autonomous
©2017 Alluxio, Inc.All Rights Reserved 7

Fastest Growing Big Data
Open Source Projects
Fastest Growing open-
source project in the big
data ecosystem
Running in large
production clusters
Now: 600+ Contributors
from 100+ organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
©2017 Alluxio, Inc.All Rights Reserved 8

Outline
©2017 Alluxio, Inc.All Rights Reserved 9
1
2
3
Alluxio Overview
Data Pipelines
Experiments

1 0©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline
Stage 1 Stage 2 Stage 3 Stage 4
Output of stage is input of next stage

1 1©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline
Sharing via common storage
Shared Storage
Stage 1 Stage 2 Stage 3 Stage 4

1 2©2017 Alluxio, Inc.All Rights Reserved
What about
pipelines in the cloud?

1 3©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline in the Cloud
S3 (object store)
Stage 1 Stage 2 Stage 3 Stage 4

1 4©2017 Alluxio, Inc.All Rights Reserved
Sharing data via cloud storage
slows down performance

1 5©2017 Alluxio, Inc.All Rights Reserved
Cloud Pipeline with Alluxio
Sharing via Alluxio memory
S3 (object store)
Stage 1 Stage 2 Stage 3 Stage 4

Sharing Data in the Cloud
Previous stage writes output to storage
Next stage reads input from storage
…
©2017 Alluxio, Inc.All Rights Reserved 1 6

Sharing Data in the Cloud with Alluxio
Previous stage writes output to storage memory
Next stage reads input from storage memory
…
©2017 Alluxio, Inc.All Rights Reserved 1 7

1 8©2017 Alluxio, Inc.All Rights Reserved
Alluxio enables
in-memory data sharing
Faster pipeline performance

Alluxio – Fast Durable Writes
Improves write performance,
without sacrificing fault tolerance
©2017 Alluxio, Inc.All Rights Reserved 1 9

Alluxio – Fast Durable Writes
Synchronously write to replicas in Alluxio memory
Asynchronously write to underlying storage
©2017 Alluxio, Inc.All Rights Reserved 2 0

Alluxio – Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 1
client
Storage
Write is completed

Alluxio – Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 2
client
Storage
Asynchronously
written to
storage

Outline
©2017 Alluxio, Inc.All Rights Reserved 2 3
1
2
3
Alluxio Overview
Data Pipelines
Experiments

Log Pipeline in Amazon Web Services
©2017 Alluxio, Inc.All Rights Reserved 2 4
Generate Parquet Transform Aggregate
Generate: [MapReduce] Create random csv log data
Parquet: [MapReduce] Convert csv to parquet format
Transform: [Spark] Update column values
Aggregate: [Spark] Compute group by / aggregate

Log Pipeline Environment
•  r4.2xlarge instances (61 GB ram, 8 CPUs)
•  1 master, 3 workers
•  Apache Spark 2.2.0
•  Apache Hadoop 2.7.2
•  Alluxio 1.6.0
•  Generate 12 GB of logs
•  Compare AWS S3 vs Alluxio w/ Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 5

2 6©2017 Alluxio, Inc.All Rights Reserved
Average Stage Completion Time

2 7©2017 Alluxio, Inc.All Rights Reserved
Pipeline Completion Time
Over 9x speedup!

Alluxio and Pipelines in the Cloud
Alluxio enables in-memory sharing for data
pipelines in the cloud
Alluxio’s Fast Durable Write feature increases
performance without sacrificing fault tolerance
©2017 Alluxio, Inc.All Rights Reserved 2 8

Thank you!
Gene Pang
gene@alluxio.com
Twitter: @unityxx
Twi$er.com/alluxio

Linkedin.com/alluxio

Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™
©2017 Alluxio, Inc.All Rights Reserved 2 9

What's hot

Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit

Apache Spark on K8S Best Practice and Performance in the CloudDatabricks

Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks

Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouSpark Summit

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit

Spark Summit EU talk by Jorg SchadSpark Summit

Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...DataWorks Summit/Hadoop Summit

High Availability PostgreSQL on OpenShift...and more!Jonathan Katz

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit

Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiDatabricks

APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit

Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Databricks

RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...InfluxData

Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioAlluxio, Inc.

Spark and S3 with Ryan BlueDatabricks

Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.

What's hot (20)

Building a Business Logic Translation Engine with Spark Streaming for Communi...

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...

Apache Spark on K8S Best Practice and Performance in the Cloud

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...

Spark Summit EU talk by Jorg Schad

Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...

High Availability PostgreSQL on OpenShift...and more!

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...

Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi

APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk

Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...

RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...

Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio

Spark and S3 with Ryan Blue

Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio

Similar to Spark Pipelines in the Cloud with Alluxio with Gene Pang

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Spark Pipelines in the Cloud with Alluxio by Bin FanData Con LA

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Accelerating Spark Workloads in a Mesos Environment with AlluxioAlluxio, Inc.

Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Alluxio, Inc.

Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Alluxio, Inc.

Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Community

Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio, Inc.

Accelerating Spark with KubernetesAlluxio, Inc.

Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationOleg Nenashev

Building Fast SQL Analytics on Anything with Presto, AlluxioAlluxio, Inc.

Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.

Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.

Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.

Unified Big Data Analytics: Any Stack, Any CloudAlluxio, Inc.

Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.

Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Spark Summit

Similar to Spark Pipelines in the Cloud with Alluxio with Gene Pang (20)

Best Practices for Using Alluxio with Spark

Spark Pipelines in the Cloud with Alluxio by Bin Fan

Best Practices for Using Alluxio with Spark

Accelerating Spark Workloads in a Mesos Environment with Alluxio

Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio

Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...

Accelerating Spark with Kubernetes

Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration

Building Fast SQL Analytics on Anything with Presto, Alluxio

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017

Iceberg + Alluxio for Fast Data Analytics

Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics

Unified Big Data Analytics: Any Stack, Any Cloud

Realizing the Promise of Portable Data Processing with Apache Beam

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...

Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...

Recently uploaded

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Networking Case Study prepared by teacher.pptxHimangsuNath

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

Principles and Practices of Data VisualizationKianJazayeri1

SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1

Learn How Data Science Changes Our WorldEduminds Learning

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

Recently uploaded (20)

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Student Profile Sample report on improving academic performance by uniting gr...

Data Factory in Microsoft Fabric (MsBIP #82)

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Insurance Churn Prediction Data Analysis Project

Networking Case Study prepared by teacher.pptx

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

Principles and Practices of Data Visualization

SMOTE and K-Fold Cross Validation-Presentation.pptx

Learn How Data Science Changes Our World

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Decoding Patterns: Customer Churn Prediction Data Analysis Project

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

What To Do For World Nature Conservation Day by Slidesgo.pptx

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

Spark Pipelines in the Cloud with Alluxio with Gene Pang

1. Spark Pipelines in the Cloud with Alluxio Gene Pang,Alluxio, Inc. Spark Summit EU - October 2017

2. About Me •  Gene Pang •  Software engineer @ Alluxio, Inc. •  Alluxio open source PMC member •  Ph.D. from AMPLab @ UC Berkeley •  Worked at Google before UC Berkeley •  Twitter: @unityxx •  Github: @gpang ©2017 Alluxio, Inc.All Rights Reserved 2

4. History of Alluxio Started at UC Berkeley AMPLab In Summer 2012 •  Originally named as Tachyon •  Rebranded to Alluxio in early 2016 Open Sourced in 2013 •  Apache License 2.0 •  Latest Release:Alluxio 1.6.0 ©2017 Alluxio, Inc.All Rights Reserved 4

6. Data Ecosystem with Alluxio •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  In-Memory Performance Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface ©2017 Alluxio, Inc.All Rights Reserved 6

7. Next Gen Analytics with Alluxio Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface Apps, Data & Storage at Memory Speed ü  Big Data/IoT ü  AI/ML ü  Deep Learning ü  Cloud Migration ü  Multi Platform ü  Autonomous ©2017 Alluxio, Inc.All Rights Reserved 7

8. Fastest Growing Big Data Open Source Projects Fastest Growing open- source project in the big data ecosystem Running in large production clusters Now: 600+ Contributors from 100+ organizations 0 100 200 300 400 500 0 10 20 30 40 45 NumberofContributors Github Open Source Contributors by Month Alluxio Spark Kafka Redis HDFS Cassandra Hive ©2017 Alluxio, Inc.All Rights Reserved 8

24. Log Pipeline in Amazon Web Services ©2017 Alluxio, Inc.All Rights Reserved 2 4 Generate Parquet Transform Aggregate Generate: [MapReduce] Create random csv log data Parquet: [MapReduce] Convert csv to parquet format Transform: [Spark] Update column values Aggregate: [Spark] Compute group by / aggregate

25. Log Pipeline Environment •  r4.2xlarge instances (61 GB ram, 8 CPUs) •  1 master, 3 workers •  Apache Spark 2.2.0 •  Apache Hadoop 2.7.2 •  Alluxio 1.6.0 •  Generate 12 GB of logs •  Compare AWS S3 vs Alluxio w/ Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 5

28. Alluxio and Pipelines in the Cloud Alluxio enables in-memory sharing for data pipelines in the cloud Alluxio’s Fast Durable Write feature increases performance without sacrificing fault tolerance ©2017 Alluxio, Inc.All Rights Reserved 2 8

29. Thank you! Gene Pang gene@alluxio.com Twitter: @unityxx Twi$er.com/alluxio Linkedin.com/alluxio Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ ©2017 Alluxio, Inc.All Rights Reserved 2 9

Spark Pipelines in the Cloud with Alluxio with Gene Pang

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Pipelines in the Cloud with Alluxio with Gene Pang

Similar to Spark Pipelines in the Cloud with Alluxio with Gene Pang (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark Pipelines in the Cloud with Alluxio with Gene Pang