Alluxio Presentation at Strata San Jose 2016

Alluxio (formerly Tachyon):
Unified Namespace and Tiered Storage
Calvin Jia, Jiri Simsa

One of the Things to Watch at
Strata
TechCrunch article:
“… An interesting item that made the top
terms list is “alluxio,” which is the recently
renamed Tachyon project. Alluxio is a virtual
distributed storage system, and it has a
memory-centric architecture that enables
data sharing across clusters at memory
speed. … “
2

Who Are We?
• Calvin Jia
• SWE @ Alluxio, Inc.
• #1 Alluxio contributor
• Twitter: @JiaCalvin
• Jiri Simsa
• SWE @ Alluxio, Inc
• CMU Ph.D. & Google
• Twitter: @jsimsa
3

Alluxio Inc.
• Founded by Alluxio creators and top
committers
• Formerly Tachyon Nexus, Inc.
• $7.5 million Series A by Andreessen Horowitz
• Committed to the Alluxio Open Source
Project
• Company Website: http://www.alluxio.com
4

Outline
• Alluxio Introduction
• Tiered Storage
• Unified Namespace
5

ALLUXIO:
Open Source Memory Speed
Virtual Distributed Storage
6

Memory Speed
• Memory-centric architecture designed for memory I/O
Virtual
• Abstracts persistent storage from applications
Distributed
• Designed to scale with nothing but commodity hardware
Open Source
• One of the fastest growing project communities
7

Contributor Growth
• Over 200 Contributors
– 3x growth over the last year
8

Organizations
• Over 50 Organizations
9

Simple Examples
• Data sharing between frameworks
• Data resilience during application crashes
• Consolidate memory usage and alleviate
GC issues
13

Spark Job
Spark
Memory
block 1
block 3
Hadoop MR Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Data Sharing Between Frameworks
Inter-process sharing slowed down by network and/or disk I/O
14

Data Sharing Between Frameworks
Spark Job
Spark Memory
Hadoop MR Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
In-Memory
block 1
block 3 block 4
storage engine &
execution engine
same process
Inter-process sharing can happen at memory speed
15

Data Resilience during Crashes
Spark Task
Spark Memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Process crash requires network and/or disk I/O to re-read the data
16

Crash
Spark Memory
block manager
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
17

HDFS / Amazon S3
block 1
block 3
block 2
block 4
Crash
storage engine &
execution engine
same process
18

Spark Task
Spark Memory
block manager
storage engine &
execution engine
same process
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
In-Memory
block 1
block 3 block 4
Process crash only needs memory I/O to re-read the data
19

Crash
storage engine &
execution engine
same process
Process crash only needs memory I/O to re-read the data
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
In-Memory
block 1
block 3 block 4
20

HDFS / Amazon S3
Consolidating Memory
Spark Job1
Spark
Memory
block 1
block 3
Spark Job2
Spark
Memory
block 3
block 1
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Data duplicated at memory-level
21

Consolidating Memory
Spark Job1
Spark mem
Spark Job2
Spark mem
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
In-Memory
block 1
block 3 block 4
Data not duplicated at memory-level
22

Case Study: Barclays
Making the Impossible Possible with Tachyon: Accelerate Spark
Jobs from Hours to Seconds
• Application: SparkSQL + Spark RDDs
• Alluxio Storage Layer: MEM
• Backend Storage: None
• Result: Speeding up Spark jobs from hours to seconds
23

Common Questions
– Memory speed sharing among distributed applications
HDFS interface compatible
– GC overhead introduced by in-memory caching
Off-Heap Memory Management
– Data set could be larger than available memory
Tiered storage
24

Outline
• Alluxio Introduction
• Tiered Storage
25

Motivation
• Memory resources are still constrained
• Alluxio data management logic is not
limited to memory
• Storage resources available on compute
clusters
26

Tiered Storage
• Extends Alluxio with support for SSDs and/or
HDDs storage
• Different tiers have different characteristics
– Keep hot data in fast but limited storage
– Keep warm data in slower but abundant storage
• Workers manage their own storage
• Data allocation and eviction is driven by
application access
28

Tiered Storage Architecture
Machine Type 1
Compute Client
Alluxio Master
Memory, SSD, HDD
Machine Type 2
Compute Client
Alluxio Worker
Memory, SSD, HDD
29

Tiered Storage Architecture
Machine Type 2
Compute Client
• Alluxio Client
Alluxio Worker
• Tiered Block Store
• Evictor
• Allocator
Memory, SSD, HDD
30

Automatic Data Migration
• Data can be evicted to lower layers if it is “cooling down”
• Data can be promoted to upper layers if it is “warming
up”
Evict stale data to
lower tier
Promote hot data to
upper tier
31

Pluggable Policies
• Policies can be customized to suit
workloads
• Defaults provided for general scenarios
• Advanced users can optimize with
additional knowledge
– For example: Optimize for iterations
32

Case Study: Baidu
Baidu Queries Data 30 Times Faster with Alluxio
• Application: Spark
• Alluxio Storage: MEM + HDD
• Backend Storage: Baidu’s File System
• 200+ nodes deployment, 2PB+ managed space
• Result: Speeding up data querying by 30x
33

Outline
• About Alluxio
• Tiered Storage
34

Motivation
• At large organizations, data spans many storage
systems (object storage, network / distributed file
systems, DBs)
• Application logic needs to integrate with different types
of storage systems
• Data needs to be moved around to work around
application limitations
• In-house storage layers are built to address limitations
of legacy storage systems
38

Transparent Naming
• Applications can transparently and efficiently interact
with remote storage through Alluxio.
• Applications do not need to use different APIs for
interacting with different storage systems.
alluxio://host:port/
data users
reports sales alice bob
s3n://bucket/directory
data users
Alluxio Storage System
39

Single Namespace
• Applications can read and write different storage
systems.
• Decouples data location from application
alluxio://host:port/
data users
hdfs://host:port/
users
alice bob
s3n://bucket/directory
reports sales
Alluxio Storage System A
Storage System B
40

Architecture
Alluxio Interface
UFS Interface
HDFSS3 Swift …
S3
adapter
Swift
adapter
HDFS
adapter ALLUXIO
41

Alluxio Benefits
42
• Enable new workloads across storage systems
• Work with the framework of your choice
• Scale storage and compute independently

Resources
• Alluxio Project: http://www.alluxio.org
• Development: https://github.com/Alluxio/alluxio
• Meet Friends: http://www.meetup.com/Alluxio
• Alluxio Inc: http://www.alluxio.com
• Contact us: info@alluxio.com
43

Alluxio Presentation at Strata San Jose 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Alluxio Presentation at Strata San Jose 2016

Similar to Alluxio Presentation at Strata San Jose 2016 (20)

Recently uploaded

Recently uploaded (11)

Alluxio Presentation at Strata San Jose 2016

Editor's Notes