SlideShare a Scribd company logo
1 of 51
Bin Fan, Tachyon Nexus
July 19, 2015 @ Tachyon Workshop
tachyon-project.org
A Reliable Memory-Centric
Distributed Storage System
• Founded by Tachyon creators and top contributors
• $7.5 million Series A from Andreessen Horowitz
• Committed to Tachyon Open Source
• www.tachyonnexus.com
2
3
Outline
• Overview
– Motivation
– Tachyon Architecture
– Using Tachyon
• Open Source
– Status
– Production Use Cases
• Roadmap
4
Outline
• Overview
– Motivation
– Tachyon Architecture
– Using Tachyon
• Open Source
– Status
– Production Use Cases
• Roadmap
5
Started From UCB AMPLab
Berkeley Data Analytics Stack (BDAS)
Cluster manager Parallel computation
framework
Reliable, distributed memory-centric
storage system
6
7
Why Tachyon?
Memory is Fast
• RAM throughput
increasing exponentially
• Disk throughput
increasing slowly
8
Memory-locality key to interactive response times
Memory is Cheaper
source: jcmit.com
9
Realized by many…
10
11
Is the
Problem Solved?
12
Missing a Solution
for the Storage Layer
An Example: -
• Fast, in-memory data processing framework
– Keep one in-memory copy inside JVM
– Track lineage of operations used to derive data
– Upon failure, use lineage to recompute data
map
filter map
join reduce
Lineage Tracking
13
Issue 1
14
Data Sharing is the bottleneck in
analytics pipeline:
Slow writes to disk
Spark Job1
Spark mem
block manager
block 1
block 3
Spark Job2
Spark mem
block manager
block 3
block 1
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
(slow writes)
Issue 1
15
Spark Job
Spark mem
block manager
block 1
block 3
Hadoop MR
Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Data Sharing is the bottleneck in
analytics pipeline:
Slow writes to disk
storage engine &
execution engine
same process
(slow writes)
Issue 2
16
Spark Task
Spark memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
Cache loss when process
crashes
Issue 2
17
crash
Spark memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
Cache loss when process
crashes
HDFS / Amazon S3
Issue 2
18
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
crash
Cache loss when process
crashes
HDFS / Amazon S3
Issue 3
19
In-memory Data Duplication &
Java Garbage Collection
Spark Task1
Spark mem
block manager
block 1
block 3
Spark Task2
Spark mem
block manager
block 3
block 1
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
(duplication & GC)
Tachyon
Reliable data sharing at
memory-speed within and across
cluster frameworks/jobs
20
Technical Overview
Ideas
• A memory-centric storage architecture
• Push lineage down to storage layer
• Manage tiered storage
Facts
• One data copy in memory
• Re-computation for fault-tolerance
21
Apache
Spark
Apache
MR
Apache
HBase
H2O
Apache
Flink
Impala
S3
Gluster
FS
HDFS Swift NFS Ceph ……
……
Stack
22
Tachyon Memory-Centric
Architecture
23
Tachyon Memory-Centric
Architecture
24
Lineage in Tachyon
25
Issue 1 revisited
26
Memory-speed data sharing
among jobs in different
frameworks
execution engine &
storage engine
same process
(fast writes)
Spark Job
Spark mem
Hadoop MR
Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
Issue 2 revisited
27
Spark Task
Spark memory
block manager
execution engine &
storage engine
same process
Keep in-memory data safe,
even when a job crashes.
Issue 2 revisited
28
HDFS
disk
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
Tachyon
in-memory
block 1
block 3 block 4
crash
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Keep in-memory data safe,
even when a job crashes.
Issue 3 revisited
29
No in-memory data duplication,
much less GC
Spark Task
Spark mem
Spark Task
Spark mem
HDFS / Amazon S3
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
(no duplication & GC)
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
Comparison with In-Memory HDFS
30
How easy / hard to
use Tachyon?
31
Spark/MapReduce/Shark
without Tachyon
• Spark
scala> val file = sc.textFile(“hdfs://ip:port/path”)
• Hadoop MapReduce
$ hadoop jar hadoop-examples-1.0.4.jar wordcount
hdfs://localhost:19998/input
hdfs://localhost:19998/output
• Shark
CREATE TABLE orders_cached AS SELECT * FROM orders;
32
Spark/MapReduce/Shark
with Tachyon
• Spark
scala> val file = sc.textFile(“tachyon://ip:port/path”)
• Hadoop MapReduce
$ hadoop jar hadoop-examples-1.0.4.jar wordcount
tachyon://localhost:19998/input
tachyon://localhost:19998/output
• Shark
CREATE TABLE orders_tachyon AS SELECT * FROM orders;
33
Outline
• Overview
– Motivation
– Tachyon Architecture
– Using Tachyon
• Open Source
– Status
– Production Use Cases
• Roadmap
34
Open Source Status
• Started at UC Berkeley AMPLab in Summer 2012
• Apache License 2.0, Version 0.7 (July 2015)
• Deployed at > 50 companies (July 2014)
• 30+ Companies Contributing
• Spark/MapReduce/Flink applications can run
without code change
35
Contributors Growth
v0.4
Feb ‘14
v0.3
Oct ‘13
v0.2
Apr ‘13
v0.1
Dec ‘12
36
v0.6
Mar ‘15
v0.5
Jul ‘14
v0.7
Jul ‘15
1 3
15
30
46
70
100+
Codebase Growth
v0.4
Feb ‘14
v0.3
Oct ‘13
v0.2
Apr ‘13
37
v0.6
Mar ‘15
v0.5
Jul ‘14
v0.7
Jul ‘15
465
commits
696
commits
1080
commits
1610
commits
2884
commits
4969
commits
Open Community
38
Berkeley
Contributors
Non-Berkeley
Contributors
Thanks to Our Contributors!Aaron Davidson
Abhiraj Butala
Achal Soni
Albert Chu
Ali Ghodsi
Andrew Ash
Anurag Khandelwal
Aslan Bekirov
Bill Zhao
Bin Fan
Bradley Childs
Calvin Jia
Carson Wang
Chao Chen
Cheng Chang
Cheng Hao
Colin Patrick McCabe
Dan Crankshaw
Darion Yaphet
David Capwell
David Zhu
Dina Leventol
Du Li
Fei Wang
Gene Pang
Gerald Zhang
Grace Huang
Haoyuan Li
Henry Saputra
Hobin Yoon
Huamin Chen
Jacky Li
Jey Kottalam
Jingxin Feng
Joseph Tang
Juan Zhou
Jun Aoki
Kun Xu
Lukasz Jastrzebski
Luogan Kun
Manu Goyal
Mark Hamstra
Mingfei Shi
Mubarak Seyed
Nan Dun
Nick Lanham
Orcun Simsek
Pengfei Xuan
Qianhao Dong
Qifan Pu
Ramaraju Indukuri
Raymond Liu
Rob Vesse
Robert Metzger
Rong Gu
Sean Zhong
Seonghwan Moon
Shaoshan Liu
Shivaram Venkataraman
Shu Peng
Srinivas Parayya
Tao Wang
Thu Kyaw
Timothy St. Clair
Vaishnav Kovvuri
Vikram Sreekanti
Xi Liu
Xiaomeng Huang
Xiaomin Zhang
Xing Lin
Yi Liu
Zhao Zhang
39
Tachyon Usage
40
Under Filesystem Choices
(Big Data, Cloud, HPC, Enterprise)
41
Use Case: Baidu
• Framework: SparkSQL
• Under Storage: Baidu’s File System
• Storage Media: MEM + HDD
• 100+ nodes deployment
• 1PB+ managed space
• 30x Performance Improvement
More Details: www.meetup.com/Tachyon 42
Use Case: a SAAS Company
• Framework: Impala
• Under Storage: S3
• Storage Media: MEM + SSD
• 15x Performance Improvement
43
Use Case: an Oil Company
• Framework: Spark
• Under Storage: GlusterFS
• Storage Media: MEM only
• Analyzing data in traditional storage
44
Use Case: a SAAS Company
• Framework: Spark
• Under Storage: S3
• Storage Media: SSD only
• Elastic Tachyon deployment
45
Outline
• Overview
– Motivation
– Tachyon Architecture
– Using Tachyon
• Open Source
– Status
– Production Use Cases
• Roadmap
46
New Features
• Lineage in Storage (alpha)
• Tiered Storage (alpha)
47
New Features
• Lineage in Storage (alpha)
• Tiered Storage (alpha)
• Data Serving
• Support for New Hardware
• …
• Your New Feature!
48
49
Tachyon’s Goal?
Distributed Memory-Centric Storage:
Better Assist Other Components
Welcome Collaboration!
50
JIRA New Contributor Tasks
Apache
Spark
Apache
MR
Apache
HBase
H2O
Apache
Flink
Impala
S3
Gluster
FS
HDFS Swift NFS Ceph ……
……
• Website: http://tachyon-project.org
• Github: https://github.com/amplab/tachyon
• Meetup: http://www.meetup.com/Tachyon
• Training program: coming soon
• Tachyon Nexus is hiring
• News Letter Subscription: http://goo.gl/mwB2sX
• Email: binfan@tachyonnexus.com
51

More Related Content

What's hot

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 
Accelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAccelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAlluxio, Inc.
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Alluxio, Inc.
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.
 
Tachyon meetup slides.
Tachyon meetup slides.Tachyon meetup slides.
Tachyon meetup slides.David Groozman
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit
 
Embracing hybrid cloud for data-intensive analytic workloads
Embracing hybrid cloud for data-intensive analytic workloadsEmbracing hybrid cloud for data-intensive analytic workloads
Embracing hybrid cloud for data-intensive analytic workloadsAlluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Alluxio, Inc.
 
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3Alluxio, Inc.
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokAlluxio, Inc.
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsAlluxio, Inc.
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Alluxio, Inc.
 

What's hot (20)

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Accelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAccelerate Cloud Training with Alluxio
Accelerate Cloud Training with Alluxio
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Tachyon meetup slides.
Tachyon meetup slides.Tachyon meetup slides.
Tachyon meetup slides.
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Embracing hybrid cloud for data-intensive analytic workloads
Embracing hybrid cloud for data-intensive analytic workloadsEmbracing hybrid cloud for data-intensive analytic workloads
Embracing hybrid cloud for data-intensive analytic workloads
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
 
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTok
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
 

Viewers also liked

Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageAlluxio, Inc.
 
The Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersThe Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersAlluxio, Inc.
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio, Inc.
 
A Design of Distributed Storage System over HTTP for Collecting Sensor Data
A Design of Distributed Storage System over HTTP for Collecting Sensor DataA Design of Distributed Storage System over HTTP for Collecting Sensor Data
A Design of Distributed Storage System over HTTP for Collecting Sensor DataSayed Ahmad Naweed
 
Distributed storage performance for OpenStack clouds using small-file IO work...
Distributed storage performance for OpenStack clouds using small-file IO work...Distributed storage performance for OpenStack clouds using small-file IO work...
Distributed storage performance for OpenStack clouds using small-file IO work...Principled Technologies
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
DumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionDumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionNuno Loureiro
 
7 distributed storage_open_stack
7 distributed storage_open_stack7 distributed storage_open_stack
7 distributed storage_open_stackopenstackindia
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Alluxio, Inc.
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Gluster.org
 
The shortest path is not always a straight line
The shortest path is not always a straight lineThe shortest path is not always a straight line
The shortest path is not always a straight lineVasia Kalavri
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Jiří Šimša
 
Accessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAccessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAlluxio, Inc.
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Sage Weil
 
Strategies for Distributed Data Storage
Strategies for Distributed Data StorageStrategies for Distributed Data Storage
Strategies for Distributed Data Storagekakugawa
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio, Inc.
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 

Viewers also liked (20)

Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed Storage
 
The Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersThe Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand Clusters
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
 
Torus
TorusTorus
Torus
 
A Design of Distributed Storage System over HTTP for Collecting Sensor Data
A Design of Distributed Storage System over HTTP for Collecting Sensor DataA Design of Distributed Storage System over HTTP for Collecting Sensor Data
A Design of Distributed Storage System over HTTP for Collecting Sensor Data
 
Distributed storage performance for OpenStack clouds using small-file IO work...
Distributed storage performance for OpenStack clouds using small-file IO work...Distributed storage performance for OpenStack clouds using small-file IO work...
Distributed storage performance for OpenStack clouds using small-file IO work...
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
DumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionDumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage Solution
 
7 distributed storage_open_stack
7 distributed storage_open_stack7 distributed storage_open_stack
7 distributed storage_open_stack
 
Distributed storage system
Distributed storage systemDistributed storage system
Distributed storage system
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
 
The shortest path is not always a straight line
The shortest path is not always a straight lineThe shortest path is not always a straight line
The shortest path is not always a straight line
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 
Accessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAccessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified Namespace
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 
Strategies for Distributed Data Storage
Strategies for Distributed Data StorageStrategies for Distributed Data Storage
Strategies for Distributed Data Storage
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 

Similar to Tachyon workshop 2015-07-19

Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMShaoshan Liu
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaAlluxio, Inc.
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
 
Cache Rules Everything Around Me - DevIntersection - December 2022
Cache Rules Everything Around Me - DevIntersection - December 2022Cache Rules Everything Around Me - DevIntersection - December 2022
Cache Rules Everything Around Me - DevIntersection - December 2022Matthew Groves
 
CREAM - That Conference Austin - January 2024.pptx
CREAM - That Conference Austin - January 2024.pptxCREAM - That Conference Austin - January 2024.pptx
CREAM - That Conference Austin - January 2024.pptxMatthew Groves
 
Cache Rules Everything Around Me - Momentum - October 2022.pptx
Cache Rules Everything Around Me - Momentum - October 2022.pptxCache Rules Everything Around Me - Momentum - October 2022.pptx
Cache Rules Everything Around Me - Momentum - October 2022.pptxMatthew Groves
 
Caching Methodology & Strategies
Caching Methodology & StrategiesCaching Methodology & Strategies
Caching Methodology & StrategiesTiệp Vũ
 
Caching methodology and strategies
Caching methodology and strategiesCaching methodology and strategies
Caching methodology and strategiesTiep Vu
 
Introduction to Memory-Style Storage in Linux
Introduction to Memory-Style Storage in LinuxIntroduction to Memory-Style Storage in Linux
Introduction to Memory-Style Storage in LinuxClay (Chih-Hao) Chang
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Hadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesHadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesBigdata Meetup Kochi
 
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusion
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusionAdvanced caching techniques with ehcache, big memory, terracotta, and coldfusion
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusionColdFusionConference
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0Heiko Loewe
 

Similar to Tachyon workshop 2015-07-19 (20)

Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBM
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
 
Cache Rules Everything Around Me - DevIntersection - December 2022
Cache Rules Everything Around Me - DevIntersection - December 2022Cache Rules Everything Around Me - DevIntersection - December 2022
Cache Rules Everything Around Me - DevIntersection - December 2022
 
CREAM - That Conference Austin - January 2024.pptx
CREAM - That Conference Austin - January 2024.pptxCREAM - That Conference Austin - January 2024.pptx
CREAM - That Conference Austin - January 2024.pptx
 
Cache Rules Everything Around Me - Momentum - October 2022.pptx
Cache Rules Everything Around Me - Momentum - October 2022.pptxCache Rules Everything Around Me - Momentum - October 2022.pptx
Cache Rules Everything Around Me - Momentum - October 2022.pptx
 
Mini-Training: To cache or not to cache
Mini-Training: To cache or not to cacheMini-Training: To cache or not to cache
Mini-Training: To cache or not to cache
 
Caching Methodology & Strategies
Caching Methodology & StrategiesCaching Methodology & Strategies
Caching Methodology & Strategies
 
Caching methodology and strategies
Caching methodology and strategiesCaching methodology and strategies
Caching methodology and strategies
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
 
Introduction to Memory-Style Storage in Linux
Introduction to Memory-Style Storage in LinuxIntroduction to Memory-Style Storage in Linux
Introduction to Memory-Style Storage in Linux
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesHadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologies
 
Cache-Aside Cloud Design Pattern
Cache-Aside Cloud Design PatternCache-Aside Cloud Design Pattern
Cache-Aside Cloud Design Pattern
 
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusion
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusionAdvanced caching techniques with ehcache, big memory, terracotta, and coldfusion
Advanced caching techniques with ehcache, big memory, terracotta, and coldfusion
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 

Recently uploaded

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Tachyon workshop 2015-07-19

  • 1. Bin Fan, Tachyon Nexus July 19, 2015 @ Tachyon Workshop tachyon-project.org A Reliable Memory-Centric Distributed Storage System
  • 2. • Founded by Tachyon creators and top contributors • $7.5 million Series A from Andreessen Horowitz • Committed to Tachyon Open Source • www.tachyonnexus.com 2
  • 3. 3
  • 4. Outline • Overview – Motivation – Tachyon Architecture – Using Tachyon • Open Source – Status – Production Use Cases • Roadmap 4
  • 5. Outline • Overview – Motivation – Tachyon Architecture – Using Tachyon • Open Source – Status – Production Use Cases • Roadmap 5
  • 6. Started From UCB AMPLab Berkeley Data Analytics Stack (BDAS) Cluster manager Parallel computation framework Reliable, distributed memory-centric storage system 6
  • 8. Memory is Fast • RAM throughput increasing exponentially • Disk throughput increasing slowly 8 Memory-locality key to interactive response times
  • 12. 12 Missing a Solution for the Storage Layer
  • 13. An Example: - • Fast, in-memory data processing framework – Keep one in-memory copy inside JVM – Track lineage of operations used to derive data – Upon failure, use lineage to recompute data map filter map join reduce Lineage Tracking 13
  • 14. Issue 1 14 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes)
  • 15. Issue 1 15 Spark Job Spark mem block manager block 1 block 3 Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 Data Sharing is the bottleneck in analytics pipeline: Slow writes to disk storage engine & execution engine same process (slow writes)
  • 16. Issue 2 16 Spark Task Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process Cache loss when process crashes
  • 17. Issue 2 17 crash Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process Cache loss when process crashes
  • 18. HDFS / Amazon S3 Issue 2 18 block 1 block 3 block 2 block 4 execution engine & storage engine same process crash Cache loss when process crashes
  • 19. HDFS / Amazon S3 Issue 3 19 In-memory Data Duplication & Java Garbage Collection Spark Task1 Spark mem block manager block 1 block 3 Spark Task2 Spark mem block manager block 3 block 1 block 1 block 3 block 2 block 4 execution engine & storage engine same process (duplication & GC)
  • 20. Tachyon Reliable data sharing at memory-speed within and across cluster frameworks/jobs 20
  • 21. Technical Overview Ideas • A memory-centric storage architecture • Push lineage down to storage layer • Manage tiered storage Facts • One data copy in memory • Re-computation for fault-tolerance 21
  • 26. Issue 1 revisited 26 Memory-speed data sharing among jobs in different frameworks execution engine & storage engine same process (fast writes) Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4
  • 27. HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 Issue 2 revisited 27 Spark Task Spark memory block manager execution engine & storage engine same process Keep in-memory data safe, even when a job crashes.
  • 28. Issue 2 revisited 28 HDFS disk block 1 block 3 block 2 block 4 execution engine & storage engine same process Tachyon in-memory block 1 block 3 block 4 crash HDFS / Amazon S3 block 1 block 3 block 2 block 4 Keep in-memory data safe, even when a job crashes.
  • 29. Issue 3 revisited 29 No in-memory data duplication, much less GC Spark Task Spark mem Spark Task Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process (no duplication & GC) HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4
  • 31. How easy / hard to use Tachyon? 31
  • 32. Spark/MapReduce/Shark without Tachyon • Spark scala> val file = sc.textFile(“hdfs://ip:port/path”) • Hadoop MapReduce $ hadoop jar hadoop-examples-1.0.4.jar wordcount hdfs://localhost:19998/input hdfs://localhost:19998/output • Shark CREATE TABLE orders_cached AS SELECT * FROM orders; 32
  • 33. Spark/MapReduce/Shark with Tachyon • Spark scala> val file = sc.textFile(“tachyon://ip:port/path”) • Hadoop MapReduce $ hadoop jar hadoop-examples-1.0.4.jar wordcount tachyon://localhost:19998/input tachyon://localhost:19998/output • Shark CREATE TABLE orders_tachyon AS SELECT * FROM orders; 33
  • 34. Outline • Overview – Motivation – Tachyon Architecture – Using Tachyon • Open Source – Status – Production Use Cases • Roadmap 34
  • 35. Open Source Status • Started at UC Berkeley AMPLab in Summer 2012 • Apache License 2.0, Version 0.7 (July 2015) • Deployed at > 50 companies (July 2014) • 30+ Companies Contributing • Spark/MapReduce/Flink applications can run without code change 35
  • 36. Contributors Growth v0.4 Feb ‘14 v0.3 Oct ‘13 v0.2 Apr ‘13 v0.1 Dec ‘12 36 v0.6 Mar ‘15 v0.5 Jul ‘14 v0.7 Jul ‘15 1 3 15 30 46 70 100+
  • 37. Codebase Growth v0.4 Feb ‘14 v0.3 Oct ‘13 v0.2 Apr ‘13 37 v0.6 Mar ‘15 v0.5 Jul ‘14 v0.7 Jul ‘15 465 commits 696 commits 1080 commits 1610 commits 2884 commits 4969 commits
  • 39. Thanks to Our Contributors!Aaron Davidson Abhiraj Butala Achal Soni Albert Chu Ali Ghodsi Andrew Ash Anurag Khandelwal Aslan Bekirov Bill Zhao Bin Fan Bradley Childs Calvin Jia Carson Wang Chao Chen Cheng Chang Cheng Hao Colin Patrick McCabe Dan Crankshaw Darion Yaphet David Capwell David Zhu Dina Leventol Du Li Fei Wang Gene Pang Gerald Zhang Grace Huang Haoyuan Li Henry Saputra Hobin Yoon Huamin Chen Jacky Li Jey Kottalam Jingxin Feng Joseph Tang Juan Zhou Jun Aoki Kun Xu Lukasz Jastrzebski Luogan Kun Manu Goyal Mark Hamstra Mingfei Shi Mubarak Seyed Nan Dun Nick Lanham Orcun Simsek Pengfei Xuan Qianhao Dong Qifan Pu Ramaraju Indukuri Raymond Liu Rob Vesse Robert Metzger Rong Gu Sean Zhong Seonghwan Moon Shaoshan Liu Shivaram Venkataraman Shu Peng Srinivas Parayya Tao Wang Thu Kyaw Timothy St. Clair Vaishnav Kovvuri Vikram Sreekanti Xi Liu Xiaomeng Huang Xiaomin Zhang Xing Lin Yi Liu Zhao Zhang 39
  • 41. Under Filesystem Choices (Big Data, Cloud, HPC, Enterprise) 41
  • 42. Use Case: Baidu • Framework: SparkSQL • Under Storage: Baidu’s File System • Storage Media: MEM + HDD • 100+ nodes deployment • 1PB+ managed space • 30x Performance Improvement More Details: www.meetup.com/Tachyon 42
  • 43. Use Case: a SAAS Company • Framework: Impala • Under Storage: S3 • Storage Media: MEM + SSD • 15x Performance Improvement 43
  • 44. Use Case: an Oil Company • Framework: Spark • Under Storage: GlusterFS • Storage Media: MEM only • Analyzing data in traditional storage 44
  • 45. Use Case: a SAAS Company • Framework: Spark • Under Storage: S3 • Storage Media: SSD only • Elastic Tachyon deployment 45
  • 46. Outline • Overview – Motivation – Tachyon Architecture – Using Tachyon • Open Source – Status – Production Use Cases • Roadmap 46
  • 47. New Features • Lineage in Storage (alpha) • Tiered Storage (alpha) 47
  • 48. New Features • Lineage in Storage (alpha) • Tiered Storage (alpha) • Data Serving • Support for New Hardware • … • Your New Feature! 48
  • 50. Distributed Memory-Centric Storage: Better Assist Other Components Welcome Collaboration! 50 JIRA New Contributor Tasks Apache Spark Apache MR Apache HBase H2O Apache Flink Impala S3 Gluster FS HDFS Swift NFS Ceph …… ……
  • 51. • Website: http://tachyon-project.org • Github: https://github.com/amplab/tachyon • Meetup: http://www.meetup.com/Tachyon • Training program: coming soon • Tachyon Nexus is hiring • News Letter Subscription: http://goo.gl/mwB2sX • Email: binfan@tachyonnexus.com 51