SlideShare a Scribd company logo
1 of 44
Download to read offline
Interactive Data Analytics
with Spark on Tachyon
in Baidu
Bin Fan (Tachyon Nexus)
binfan@tachyonnexus.com
Xiang Wen (Baidu)
wenxiang@baidu.com
Dec 02 2015 @ Strata + Hadoop World, Singapore
1
Who Are We?
• Bin Fan
• Tachyon Project
Contributor
• Software Engineer
at Tachyon Nexus
• Xiang Wen
• From Baidu Big
Data Department
• Senior Software
Engineer at Baidu
2
• Team consists of Tachyon creators, top contributors
• Series A ($7.5 million) from Andreessen Horowitz
• Committed to Tachyon Open Source Project
• www.tachyonnexus.com
3
Agenda
• Tachyon Basics & New Features
• Motivation
• Building an interactive data service
–Spark + Tachyon
• Future Works
4
History of Tachyon
• Started at UC Berkeley AMPLab
– From summer 2012
• Open sourced
– April 2013 (two and half years ago)
– Apache License 2.0
– Latest Release: Version 0.8.2 (November 2015)
5
One of the Fastest Growing
Big Data Open Source Project
> 170 Contributors (v0.8)
3x increment over the last year
http://tachyon-project.org/community/
6
> 50 Organizations
One of the Fastest Growing
Big Data Open Source Project
7
An Open Source
Memory-Centric
Distributed Storage System
What is
8
Tachyon Stack
9
Memory-speed data sharing
across jobs and frameworks
Spark Job
Spark mem
Hadoop MR
Job
YARN
HDFS / Amazon S3
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memoryData
Why Use Tachyon?
Data survive in
memory after
computation
crashes
crash
Off-heap
storage, no GC
10
Enable Faster Innovation in
Storage Layer
11
What if
data size exceeds
memory capacity?
12
Tiered Storage: Tachyon Manages More
Than DRAM
MEM
SSD
HDD
Faster
Higher
Capacity
13
Configurable Storage Tiers
MEM only
MEM + HDD
SSD only
14
Pluggable Data Management Policy
Evict stale data to
lower tier
Promote hot data to
upper tier
15
Pin Data in Memory
16
Transparent Naming
17
Unified Namespace Across Under
Storage Systems
18
More Features
• Remote Write Support
• Easy deployment with Mesos and YARN
• Initial Security Support
• One Command Cluster Deployment
• Metrics Reporting for Clients, Workers, and
Master
19
Rich Choice of Under Storage Supports
20
How Easy to Use Tachyon in
scala> val file = sc.textFile(“hdfs://foo”)
scala> val file = sc.textFile(“tachyon://foo”)
21
Use Case: a SAAS Company
• Framework: Impala
• Under Storage: S3
• Storage Media: MEM + SSD
• 15x Performance Improvement
22
Use Case: a Biotechnology Company
• Framework: Spark & MapReduce
• Under Storage: GlusterFS
• Storage Media: MEM and SSD
23
When Tachyon Meets Baidu
~ 100 nodes in deployment, > 1 PB storage space
30X Acceleration of our Big Data Analytics Workload
24
Agenda
• Tachyon Basics & New Features
• Motivation
• Building an interactive data service
–Spark + Tachyon
• Future Works
25
Background
Logs
Data Warehouse
Online Data Services26
Hours or Days
Hours
Frustrated data explorers
• Example:
– John is a PM and he needs to keep track of the top user
actions for a new feature
– Based on the top actions of the day, he will perform
additional analysis
– But John is very frustrated that each query takes tens of
minutes to finish
27
A dedicated service for data exploring
• Manages PBs of data
• Most queries within one minute
28
User Scenario
Web Site Client
Structured
Logs
Data
Warehouse
Data
Marts
Service Gate
Query Engine
select some_action, count(*)
from event_table
where event_day=‘20151123’
group by user_action
Have first try
Try another way
select some_action, event_hour, count(*)
from event_table
where event_day=‘20151123’
group by user_action, event_hour
Not as expected!
See what happens in original log
select *
from event_log
where event_day=‘20151123’ and event_hour=’01’
limit 10
29
Agenda
• Tachyon Basics & New Features
• Motivation
• Building an interactive data service
–Spark + Tachyon
• Future Works
30
Choose Spark as compute solution
Data Warehouse
BFS
Service Gate
Service Gate
Hive
Map Reduce
4X Improvement but
not good enough!
Compute Center
Data Center
31
Baidu File System (BFS)
Data Center
Choose Tachyon as storage solution
Spark Task
Spark mem
Spark Task
Spark mem
HDFS
disk
block 1
block 3
block 2
block 4
Tachyon
in-memory
block 1
block 3 block 4
Compute Center
• Read from remote data
center: ~ 100 ~ 150 seconds
• Read from Tachyon cluster
local node: 10 ~ 15 sec
• Read from Tachyon machine
local node: ~ 5 sec
Tachyon Brings 30X Speed-up !
32
Overall Performance
0
200
400
600
800
1000
1200
MR (sec) Spark (sec) Spark +
Tachyon (sec)
Setup:
1. Use MR to query 6 TB of data
2. Use Spark to query 6 TB of
data
3. Use Spark + Tachyon to query
6 TB of data
Results:
1. Spark + Tachyon achieves 50-
fold speedup compared to
MR
33
Architecture
SparkContext
Operation
Manager
View
Manager
Run Query
Build Cache
Data Warehouse
Cache
Scheduler
Ask & Profile
34
Catalyst helps to be ‘transparent’
lookupRelation
CacheableRelation
Tachyon HDFS
Union HiveTableScan
withUncachedPartitions
HiveTableScan
withCachedPartitions
35
Cache Policy
• Prefetch
– Fetch the views daily in advance when system is idle
– The views fetched are based on the pattern of the past
query history profiling, e.g. 3 months query logs
• On Demand caches
– Fetch the views at runtime when system is serving regular
queries.
– Using machine learning to generate policy file monthly for
views/tables
– When a query is accessing some views, and parts of views
match our pre-generated policy, those views will be cached
at that time.
36
Hot Query: Cache Hit
37
Spark
Tachyon
HDFS
Operation Manager
Query UI
View
Manager Cache
Meta
Cold Query: Cache Miss
38
Spark
Tachyon
HDFS
Operation Manager
Query UI
View
Manager Cache
Meta
Daily Stats with Cache
• Daily
– Table
• Queries: 100 – 300
• Hit Rate: ~40%
– Partition
• Queries: 80K – 120K
• Hit Rate: ~40 – 50%
• Performance with Cache
– avg 2 - 3 time faster than without Cache
39
Agenda
• Tachyon Basics & New Features
• Motivation
• Building an interactive data service
–Spark + Tachyon
• Future Works
40
Improve Caching System
• As Extended Meta Service
– Improve legacy schema/input-format
– Load block meta into cache layer
– Index / Materialized View
• Cost Based Caching/Optimizing
– Better performance, hit rate & execution
– Lower storage needs for cache layer
41
More User Scenario
• If John is data scientist
– Need a way to construct dataset conveniently
– Usually have many tries with same dataset
• An interactive system should help a lot
– Spark is an ideal solution
42
Hardware assisted
big data infrastructure
• Hardware
– GPU
– FPGA
• Applications
– Accelerate common SQL and ML operators
– TableScan && InputFormat && Serde
• Lower down the cost
– 10 dollars for big data
– 1 more dollar for interactive big data
43
Q&A
44

More Related Content

What's hot

Accelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAccelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAlluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioAlluxio, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomAlluxio, Inc.
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkAlluxio, Inc.
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit
 
The Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersThe Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersAlluxio, Inc.
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Alluxio, Inc.
 
Alluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for DaskAlluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for DaskAlluxio, Inc.
 
Tachyon meetup slides.
Tachyon meetup slides.Tachyon meetup slides.
Tachyon meetup slides.David Groozman
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Alluxio, Inc.
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 

What's hot (19)

Accelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAccelerate Cloud Training with Alluxio
Accelerate Cloud Training with Alluxio
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
The Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand ClustersThe Missing Piece of On-Demand Clusters
The Missing Piece of On-Demand Clusters
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
 
Alluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for DaskAlluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for Dask
 
Tachyon meetup slides.
Tachyon meetup slides.Tachyon meetup slides.
Tachyon meetup slides.
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 

Viewers also liked

Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio, Inc.
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Jiří Šimša
 
Accessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAccessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAlluxio, Inc.
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio, Inc.
 
Baidu
BaiduBaidu
BaiduNanor
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio, Inc.
 
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.Rishikese MR
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio, Inc.
 
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio, Inc.
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.
 
Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageAlluxio, Inc.
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 

Viewers also liked (12)

Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 
Accessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAccessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified Namespace
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016
 
Baidu
BaiduBaidu
Baidu
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
 
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
 
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed Storage
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
 

Similar to Interactive Data Analytics with Spark on Tachyon in Baidu

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineAndries den Haan
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Nishant Gandhi
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonAlluxio, Inc.
 
[161] 데이터사이언스팀 빌딩
[161] 데이터사이언스팀 빌딩[161] 데이터사이언스팀 빌딩
[161] 데이터사이언스팀 빌딩NAVER D2
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMShaoshan Liu
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...NCCOMMS
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineAndries den Haan
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoopgluent.
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...serge luca
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeDATAVERSITY
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun
 
Finance month closing with HANA
Finance month closing with HANAFinance month closing with HANA
Finance month closing with HANADouglas Bernardini
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 

Similar to Interactive Data Analytics with Spark on Tachyon in Baidu (20)

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint Online
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on Tachyon
 
[161] 데이터사이언스팀 빌딩
[161] 데이터사이언스팀 빌딩[161] 데이터사이언스팀 빌딩
[161] 데이터사이언스팀 빌딩
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBM
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint Online
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Finance month closing with HANA
Finance month closing with HANAFinance month closing with HANA
Finance month closing with HANA
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 

Recently uploaded

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Recently uploaded (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 

Interactive Data Analytics with Spark on Tachyon in Baidu

  • 1. Interactive Data Analytics with Spark on Tachyon in Baidu Bin Fan (Tachyon Nexus) binfan@tachyonnexus.com Xiang Wen (Baidu) wenxiang@baidu.com Dec 02 2015 @ Strata + Hadoop World, Singapore 1
  • 2. Who Are We? • Bin Fan • Tachyon Project Contributor • Software Engineer at Tachyon Nexus • Xiang Wen • From Baidu Big Data Department • Senior Software Engineer at Baidu 2
  • 3. • Team consists of Tachyon creators, top contributors • Series A ($7.5 million) from Andreessen Horowitz • Committed to Tachyon Open Source Project • www.tachyonnexus.com 3
  • 4. Agenda • Tachyon Basics & New Features • Motivation • Building an interactive data service –Spark + Tachyon • Future Works 4
  • 5. History of Tachyon • Started at UC Berkeley AMPLab – From summer 2012 • Open sourced – April 2013 (two and half years ago) – Apache License 2.0 – Latest Release: Version 0.8.2 (November 2015) 5
  • 6. One of the Fastest Growing Big Data Open Source Project > 170 Contributors (v0.8) 3x increment over the last year http://tachyon-project.org/community/ 6
  • 7. > 50 Organizations One of the Fastest Growing Big Data Open Source Project 7
  • 8. An Open Source Memory-Centric Distributed Storage System What is 8
  • 10. Memory-speed data sharing across jobs and frameworks Spark Job Spark mem Hadoop MR Job YARN HDFS / Amazon S3 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memoryData Why Use Tachyon? Data survive in memory after computation crashes crash Off-heap storage, no GC 10
  • 11. Enable Faster Innovation in Storage Layer 11
  • 12. What if data size exceeds memory capacity? 12
  • 13. Tiered Storage: Tachyon Manages More Than DRAM MEM SSD HDD Faster Higher Capacity 13
  • 14. Configurable Storage Tiers MEM only MEM + HDD SSD only 14
  • 15. Pluggable Data Management Policy Evict stale data to lower tier Promote hot data to upper tier 15
  • 16. Pin Data in Memory 16
  • 18. Unified Namespace Across Under Storage Systems 18
  • 19. More Features • Remote Write Support • Easy deployment with Mesos and YARN • Initial Security Support • One Command Cluster Deployment • Metrics Reporting for Clients, Workers, and Master 19
  • 20. Rich Choice of Under Storage Supports 20
  • 21. How Easy to Use Tachyon in scala> val file = sc.textFile(“hdfs://foo”) scala> val file = sc.textFile(“tachyon://foo”) 21
  • 22. Use Case: a SAAS Company • Framework: Impala • Under Storage: S3 • Storage Media: MEM + SSD • 15x Performance Improvement 22
  • 23. Use Case: a Biotechnology Company • Framework: Spark & MapReduce • Under Storage: GlusterFS • Storage Media: MEM and SSD 23
  • 24. When Tachyon Meets Baidu ~ 100 nodes in deployment, > 1 PB storage space 30X Acceleration of our Big Data Analytics Workload 24
  • 25. Agenda • Tachyon Basics & New Features • Motivation • Building an interactive data service –Spark + Tachyon • Future Works 25
  • 26. Background Logs Data Warehouse Online Data Services26 Hours or Days Hours
  • 27. Frustrated data explorers • Example: – John is a PM and he needs to keep track of the top user actions for a new feature – Based on the top actions of the day, he will perform additional analysis – But John is very frustrated that each query takes tens of minutes to finish 27
  • 28. A dedicated service for data exploring • Manages PBs of data • Most queries within one minute 28
  • 29. User Scenario Web Site Client Structured Logs Data Warehouse Data Marts Service Gate Query Engine select some_action, count(*) from event_table where event_day=‘20151123’ group by user_action Have first try Try another way select some_action, event_hour, count(*) from event_table where event_day=‘20151123’ group by user_action, event_hour Not as expected! See what happens in original log select * from event_log where event_day=‘20151123’ and event_hour=’01’ limit 10 29
  • 30. Agenda • Tachyon Basics & New Features • Motivation • Building an interactive data service –Spark + Tachyon • Future Works 30
  • 31. Choose Spark as compute solution Data Warehouse BFS Service Gate Service Gate Hive Map Reduce 4X Improvement but not good enough! Compute Center Data Center 31
  • 32. Baidu File System (BFS) Data Center Choose Tachyon as storage solution Spark Task Spark mem Spark Task Spark mem HDFS disk block 1 block 3 block 2 block 4 Tachyon in-memory block 1 block 3 block 4 Compute Center • Read from remote data center: ~ 100 ~ 150 seconds • Read from Tachyon cluster local node: 10 ~ 15 sec • Read from Tachyon machine local node: ~ 5 sec Tachyon Brings 30X Speed-up ! 32
  • 33. Overall Performance 0 200 400 600 800 1000 1200 MR (sec) Spark (sec) Spark + Tachyon (sec) Setup: 1. Use MR to query 6 TB of data 2. Use Spark to query 6 TB of data 3. Use Spark + Tachyon to query 6 TB of data Results: 1. Spark + Tachyon achieves 50- fold speedup compared to MR 33
  • 35. Catalyst helps to be ‘transparent’ lookupRelation CacheableRelation Tachyon HDFS Union HiveTableScan withUncachedPartitions HiveTableScan withCachedPartitions 35
  • 36. Cache Policy • Prefetch – Fetch the views daily in advance when system is idle – The views fetched are based on the pattern of the past query history profiling, e.g. 3 months query logs • On Demand caches – Fetch the views at runtime when system is serving regular queries. – Using machine learning to generate policy file monthly for views/tables – When a query is accessing some views, and parts of views match our pre-generated policy, those views will be cached at that time. 36
  • 37. Hot Query: Cache Hit 37 Spark Tachyon HDFS Operation Manager Query UI View Manager Cache Meta
  • 38. Cold Query: Cache Miss 38 Spark Tachyon HDFS Operation Manager Query UI View Manager Cache Meta
  • 39. Daily Stats with Cache • Daily – Table • Queries: 100 – 300 • Hit Rate: ~40% – Partition • Queries: 80K – 120K • Hit Rate: ~40 – 50% • Performance with Cache – avg 2 - 3 time faster than without Cache 39
  • 40. Agenda • Tachyon Basics & New Features • Motivation • Building an interactive data service –Spark + Tachyon • Future Works 40
  • 41. Improve Caching System • As Extended Meta Service – Improve legacy schema/input-format – Load block meta into cache layer – Index / Materialized View • Cost Based Caching/Optimizing – Better performance, hit rate & execution – Lower storage needs for cache layer 41
  • 42. More User Scenario • If John is data scientist – Need a way to construct dataset conveniently – Usually have many tries with same dataset • An interactive system should help a lot – Spark is an ideal solution 42
  • 43. Hardware assisted big data infrastructure • Hardware – GPU – FPGA • Applications – Accelerate common SQL and ML operators – TableScan && InputFormat && Serde • Lower down the cost – 10 dollars for big data – 1 more dollar for interactive big data 43

Editor's Notes

  1. Logs -> Data Warehouse Transfer from online services to HDFS ETL, By Hive or MapReduce Take hours or even days Data Warehouse -> Data Mart Usually several steps Typically Hive Take hours on each step Data Mart -> Online Data Services End of the long journey Water drop from ocean
  2. http:/// http://calcite.apache.org/
  3. exploit