AI Infra Day
Oct. 25, 2023
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Adit Madan (Director of Product Management, @Alluxio)
In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.
3. Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
1000+
nodes
Largest deployment by
Baidu
Started
from UC
Berkeley
AMPLab
1 Billion
Files
supported by Alluxio
with 2.0 release
2014 2019 2023
7/10 top
Internet Co
powered by Alluxio
3
AliPay 80%
Model
Training
Zhihu LLM
Model training served by
Alluxio
EXPLOSION OF DATA
rise of big data & analytics
CLOUD ADOPTION
Single to hybrid cloud,
multi-cloud, cross region
DEEP LEARNING & AI
Large-scale model training
and deployment
1000+
Contributors
Open Source
1000+
Attendees
Data Orchestration Summit
100% Presto @
Meta
Fully on-boarded to Alluxio
9/10 top
Internet Co
powered by Alluxio
4.
5. 5
Critical infrastructure barriers to
effective AI/ML adoption
LOW PERFORMANCE COST MANAGEMENT
Inefficient Data I/O
GPU SCARCITY
Ability to leverage GPUs
anywhere
$
$
Specialized storage comes at
a premium
6. Whatʼs New: Alluxio Enterprise AI
1. High performance I/O over commodity storage
○ New Distributed System Architecture, called DORA (Decentralized
Object Repository Architecture)
2. Accelerating end-to-end ML pipelines (LLM, NLP & Computer Vision)
○ Optimized Performance for Model Training and Model Serving
ALLUXIO 6
7. Distributed Object Repository Architecture (DORA)
● No single point of failure with a new architecture that scales-out
horizontally without any central management
● Automatic Fallback to data lake storage for masking any failures to due to
capacity or other reasons
● Performance
○ Revamped single-node storage with 50 million objects per node
○ Workload-specific optimizations for ML training & analytics
Design Goals:
Extremely Stable, Low Maintenance Overhead, Scalability for ML
Alluxio Platform
Revolutionary New Architecture
Alluxio Client
Affinity Location
Policy Consistent Hash
(Decentralized)
8. Alluxio Enterprise AI
Whatʼs New on the Alluxio Platform for AI
Model Training
Scale to 10 billion+ objects to handle the demands of AI
POSIX & REST API for Python
● 2-8x performance improvements over commodity S3
● 1.5-2x over specialized storage systems with POSIX API
● Upto 95% API cost savings compared to direct access
1
9. Alluxio Enterprise AI
Whatʼs New on the Alluxio Platform for AI
Model Serving
Extreme Concurrency for model serving,
from training to inference clusters
Data Preloading based on usage pattern
● 2-3x reduced deployment times in production
2
10. BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
File System
Training
Data
Training
Data
M
o
d
e
l
s
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Offline Cloud Online Cloud
APAC Quora CASE STUDY:
High Performance AI Platform for LLM
2 - 4X faster
time-to-market
11. Model Training:
Increase GPU utilization
with Existing Data Lake
11
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
Increase
utilization up
to 90%
Faster model training
with more accurate,
fresher models
Save on API costs
Runs on standard
low-cost storage
12. 12
Alluxio vs Directly Accessing S3
17 min
Total training time
(3 epochs)
93%
GPU utilization
(TensorBoard)
Alluxio
85 min
Total training time
(3 epochs)
17%
GPU utilization
(TensorBoard)
S3
Alluxio is
5 times
faster than S3
13. Model Training:
Eliminate cost/complexity
with data copies
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
13
Automatically load data
from existing data lake
Faster access to training
data
Increased data
engineering productivity
14. Model Training:
Spin up GPUs where available
14
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REMOTE TRAINING CLUSTER
Deploy GPUs anywhere
based on availability and
cost
Eliminate data copies
Unified access for all
training data
Reduced network and
egress costs
16. Training Cluster
Offline Training Platform
1
Training Data
Models
4
2
Training Data
3
Models
Models
5
Inference Cluster
Online ML Platform
Consumer is the Data Scientist with a focus on building models without
having to worry about scaling to multiple servers and the platform complexity
Data Sources in the same or different region / cloud as the AI/ML infrastructure
Decentralized Object
Repository Architecture
17. 70
AI Reference Architecture
17
Training Cluster
Offline Training Platform
1
Training Data
Models
4
2
Training Data
3
Models
Models
5
Inference Cluster
Online ML Platform
New
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
18. Before using Alluxio
> 80% of total time is spent in DataLoader
Result in low GPU Utilization Rate (<20%)
18
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute
Capability 7.5
GPU Utilization 16.96%
Est. SM Efficiency 16.91%
Est. Achieved
Occupancy 68.75%
Kernel Time using
Tensor Cores 0.0%
Category
Time Duration
(us)
Percentage
(%)
Average Step
Time
1,763,649,145 100
Kernel 299,168,905 16.96
Memcpy 10,521,722 0.6
Memset 39,459 0
Runtime 3,043,169 0.17
DataLoader 1,446,068,956 81.99
CPU Exec 1,570,076 0.09
Other 3,245,858 0.18
Resnet-50
3 epochs
S3 Fuse
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
19. Ater using Alluxio
Reduce Data Loader Rate from 82% to 1%
Increase GPU Utilization Rate from 17% to 93%
19
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute Capability 7,5
GPU Utilization 93.29%
Est. SM Efficiency 92.98%
Est. Achieved
Occupancy
68.03%
Kernel Time using
Tensor Cores
0.0%
Category
Time Duration
(us)
Percentage
(%)
Average Step
Time
334,274,946 100%
Kernel 311,847,023 93.29
Memcpy 10,500,126 3.14
Memset 43,946 0.01
Runtime 3,899,241 1.17
DataLoader 3,343,301 1
CPU Exec 1,648,391 0.49
Other 2,992,918 0.9
Resnet-50
3 epochs
S3 Fuse
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
20. Model Serving:
Faster model
deployment times
20
70
70
On Prem
…
Checkpoints
Training
Data
Object Store
or HDFS
Data Lake
Source of Truth
Training
Cluster
On
Premise
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REGIONAL INTERFACE CLUSTERS
Deploy models to remote
inference sites in minutes
Reduced network
bandwidth
Offload underlying object
store or HDFS
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REGIONAL INTERFACE CLUSTERS
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
21. Distributed Object Repository Architecture (DORA)
● No single point of failure with a new architecture that scales-out
horizontally without any central management
● Automatic Fallback to data lake storage for masking any failures to due to
capacity or other reasons
● Performance
○ Revamped single-node storage with 50 million objects per node
○ Workload-specific optimizations for ML training & analytics
Design Goals:
Extremely Stable, Low Maintenance Overhead, Scalability for ML
Alluxio Platform
Revolutionary New Architecture
Alluxio Client
Affinity Location
Policy Consistent Hash
(Decentralized)
New
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
22. Alluxio System Architecture
70
AI/Analytics Applications
Get Task Info
Send Result
Alluxio Client
22
Affinity Block
Location Policy
Client Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker
Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task
5
Training Node
Alluxio Cluster
Under Storage
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET