SlideShare a Scribd company logo
1 of 22
Download to read offline
Accelerate Model Training with
Alluxio Enterprise AI
Adit Madan
adit@alluxio.com
2
Alluxio Data Platform
High Performance data access, unified global view
Alluxio Technology Journey
Open Source Started From UC Berkeley AMPLab in 2014
1000+
nodes
Largest deployment by
Baidu
Started
from UC
Berkeley
AMPLab
1 Billion
Files
supported by Alluxio
with 2.0 release
2014 2019 2023
7/10 top
Internet Co
powered by Alluxio
3
AliPay 80%
Model
Training
Zhihu LLM
Model training served by
Alluxio
EXPLOSION OF DATA
rise of big data & analytics
CLOUD ADOPTION
Single to hybrid cloud,
multi-cloud, cross region
DEEP LEARNING & AI
Large-scale model training
and deployment
1000+
Contributors
Open Source
1000+
Attendees
Data Orchestration Summit
100% Presto @
Meta
Fully on-boarded to Alluxio
9/10 top
Internet Co
powered by Alluxio
5
Critical infrastructure barriers to
effective AI/ML adoption
LOW PERFORMANCE COST MANAGEMENT
Inefficient Data I/O
GPU SCARCITY
Ability to leverage GPUs
anywhere
$
$
Specialized storage comes at
a premium
Whatʼs New: Alluxio Enterprise AI
1. High performance I/O over commodity storage
○ New Distributed System Architecture, called DORA (Decentralized
Object Repository Architecture)
2. Accelerating end-to-end ML pipelines (LLM, NLP & Computer Vision)
○ Optimized Performance for Model Training and Model Serving
ALLUXIO 6
Distributed Object Repository Architecture (DORA)
● No single point of failure with a new architecture that scales-out
horizontally without any central management
● Automatic Fallback to data lake storage for masking any failures to due to
capacity or other reasons
● Performance
○ Revamped single-node storage with 50 million objects per node
○ Workload-specific optimizations for ML training & analytics
Design Goals:
Extremely Stable, Low Maintenance Overhead, Scalability for ML
Alluxio Platform
Revolutionary New Architecture
Alluxio Client
Affinity Location
Policy Consistent Hash
(Decentralized)
Alluxio Enterprise AI
Whatʼs New on the Alluxio Platform for AI
Model Training
Scale to 10 billion+ objects to handle the demands of AI
POSIX & REST API for Python
● 2-8x performance improvements over commodity S3
● 1.5-2x over specialized storage systems with POSIX API
● Upto 95% API cost savings compared to direct access
1
Alluxio Enterprise AI
Whatʼs New on the Alluxio Platform for AI
Model Serving
Extreme Concurrency for model serving,
from training to inference clusters
Data Preloading based on usage pattern
● 2-3x reduced deployment times in production
2
BUSINESS BENEFIT:
TECH BENEFIT:
Increase GPU
utilization
50%
93%
File System
Training
Data
Training
Data
M
o
d
e
l
s
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Offline Cloud Online Cloud
APAC Quora CASE STUDY:
High Performance AI Platform for LLM
2 - 4X faster
time-to-market
Model Training:
Increase GPU utilization
with Existing Data Lake
11
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
Increase
utilization up
to 90%
Faster model training
with more accurate,
fresher models
Save on API costs
Runs on standard
low-cost storage
12
Alluxio vs Directly Accessing S3
17 min
Total training time
(3 epochs)
93%
GPU utilization
(TensorBoard)
Alluxio
85 min
Total training time
(3 epochs)
17%
GPU utilization
(TensorBoard)
S3
Alluxio is
5 times
faster than S3
Model Training:
Eliminate cost/complexity
with data copies
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
13
Automatically load data
from existing data lake
Faster access to training
data
Increased data
engineering productivity
Model Training:
Spin up GPUs where available
14
70
70
On Prem
…
Checkpoints
Training
Data
Data Lake
Source of Truth
Training
Cluster
Object
Store
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REMOTE TRAINING CLUSTER
Deploy GPUs anywhere
based on availability and
cost
Eliminate data copies
Unified access for all
training data
Reduced network and
egress costs
ALLUXIO 15
Appendix
15
Training Cluster
Offline Training Platform
1
Training Data
Models
4
2
Training Data
3
Models
Models
5
Inference Cluster
Online ML Platform
Consumer is the Data Scientist with a focus on building models without
having to worry about scaling to multiple servers and the platform complexity
Data Sources in the same or different region / cloud as the AI/ML infrastructure
Decentralized Object
Repository Architecture
70
AI Reference Architecture
17
Training Cluster
Offline Training Platform
1
Training Data
Models
4
2
Training Data
3
Models
Models
5
Inference Cluster
Online ML Platform
New
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Before using Alluxio
> 80% of total time is spent in DataLoader
Result in low GPU Utilization Rate (<20%)
18
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute
Capability 7.5
GPU Utilization 16.96%
Est. SM Efficiency 16.91%
Est. Achieved
Occupancy 68.75%
Kernel Time using
Tensor Cores 0.0%
Category
Time Duration
(us)
Percentage
(%)
Average Step
Time
1,763,649,145 100
Kernel 299,168,905 16.96
Memcpy 10,521,722 0.6
Memset 39,459 0
Runtime 3,043,169 0.17
DataLoader 1,446,068,956 81.99
CPU Exec 1,570,076 0.09
Other 3,245,858 0.18
Resnet-50
3 epochs
S3 Fuse
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Ater using Alluxio
Reduce Data Loader Rate from 82% to 1%
Increase GPU Utilization Rate from 17% to 93%
19
GPU Summary
Name Tesla T4
Memory 14.62GB
Compute Capability 7,5
GPU Utilization 93.29%
Est. SM Efficiency 92.98%
Est. Achieved
Occupancy
68.03%
Kernel Time using
Tensor Cores
0.0%
Category
Time Duration
(us)
Percentage
(%)
Average Step
Time
334,274,946 100%
Kernel 311,847,023 93.29
Memcpy 10,500,126 3.14
Memset 43,946 0.01
Runtime 3,899,241 1.17
DataLoader 3,343,301 1
CPU Exec 1,648,391 0.49
Other 2,992,918 0.9
Resnet-50
3 epochs
S3 Fuse
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Model Serving:
Faster model
deployment times
20
70
70
On Prem
…
Checkpoints
Training
Data
Object Store
or HDFS
Data Lake
Source of Truth
Training
Cluster
On
Premise
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REGIONAL INTERFACE CLUSTERS
Deploy models to remote
inference sites in minutes
Reduced network
bandwidth
Offload underlying object
store or HDFS
70
70
On Prem
…
Checkpoints
Training
Data
Training
Cluster
REGIONAL INTERFACE CLUSTERS
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Distributed Object Repository Architecture (DORA)
● No single point of failure with a new architecture that scales-out
horizontally without any central management
● Automatic Fallback to data lake storage for masking any failures to due to
capacity or other reasons
● Performance
○ Revamped single-node storage with 50 million objects per node
○ Workload-specific optimizations for ML training & analytics
Design Goals:
Extremely Stable, Low Maintenance Overhead, Scalability for ML
Alluxio Platform
Revolutionary New Architecture
Alluxio Client
Affinity Location
Policy Consistent Hash
(Decentralized)
New
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
Alluxio System Architecture
70
AI/Analytics Applications
Get Task Info
Send Result
Alluxio Client
22
Affinity Block
Location Policy
Client Consistent Hash
(Task Info)
2
3
Service
Registry
Alluxio Worker Alluxio Worker
Alluxio Worker
Execute Task
Get Cluster Info
Find Worker(s)
1
4
Cache miss Under storage task
5
Training Node
Alluxio Cluster
Under Storage
What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET

More Related Content

Similar to AI Infra Day | Accelerate Your Model Training and Serving with Distributed Caching

Alluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio Product school Webinar - Distributed Caching for Generative AIAlluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio Product school Webinar - Distributed Caching for Generative AIAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
High Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the CloudHigh Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the CloudWolfgang Gentzsch
 
High Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the CloudHigh Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the CloudThe UberCloud
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOSQAware GmbH
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0Ganesan Narayanasamy
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Clarisse Hedglin
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.
 
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...Ceph Community
 
[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...
[AWS Dev Day] 인공지능 / 기계 학습 |  AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...[AWS Dev Day] 인공지능 / 기계 학습 |  AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...
[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...Amazon Web Services Korea
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchEMC
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsAlluxio, Inc.
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsAlluxio, Inc.
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Karel Dumon
 
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidSpeeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidAlluxio, Inc.
 
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Community
 
Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland mictc
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 

Similar to AI Infra Day | Accelerate Your Model Training and Serving with Distributed Caching (20)

Alluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio Product school Webinar - Distributed Caching for Generative AIAlluxio Product school Webinar - Distributed Caching for Generative AI
Alluxio Product school Webinar - Distributed Caching for Generative AI
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
High Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the CloudHigh Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the Cloud
 
High Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the CloudHigh Performance Computing (HPC) and Engineering Simulations in the Cloud
High Performance Computing (HPC) and Engineering Simulations in the Cloud
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOS
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
 
[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...
[AWS Dev Day] 인공지능 / 기계 학습 |  AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...[AWS Dev Day] 인공지능 / 기계 학습 |  AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...
[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...
 
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics WorkbenchPivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
Pivotal: Operationalizing 1000 Node Hadoop Cluster - Analytics Workbench
 
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi CloudsSimplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
Simplified Data Preparation for Machine Learning in Hybrid and Multi Clouds
 
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and CloudsArchitecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
 
OpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in ZurichOpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in Zurich
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
 
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidSpeeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
 
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
 
Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 

More from Alluxio, Inc.

Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...Alluxio, Inc.
 
Alluxio Product School Webinar - Get Started with Alluxio on Kubernetes
Alluxio Product School Webinar - Get Started with Alluxio on KubernetesAlluxio Product School Webinar - Get Started with Alluxio on Kubernetes
Alluxio Product School Webinar - Get Started with Alluxio on KubernetesAlluxio, Inc.
 
Alluxio Product School Webinar - Boosting Trino Performance.
Alluxio Product School Webinar - Boosting Trino Performance.Alluxio Product School Webinar - Boosting Trino Performance.
Alluxio Product School Webinar - Boosting Trino Performance.Alluxio, Inc.
 
Alluxio Product School Webinar - Transparent URI
Alluxio Product School Webinar - Transparent URIAlluxio Product School Webinar - Transparent URI
Alluxio Product School Webinar - Transparent URIAlluxio, Inc.
 
Alluxio 2.9 Release Overview
Alluxio 2.9 Release OverviewAlluxio 2.9 Release Overview
Alluxio 2.9 Release OverviewAlluxio, Inc.
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.
 

More from Alluxio, Inc. (20)

Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
Alluxio + Eckerson Webinar | Simplifying and Accelerating Data Access for AI/...
 
Alluxio Product School Webinar - Get Started with Alluxio on Kubernetes
Alluxio Product School Webinar - Get Started with Alluxio on KubernetesAlluxio Product School Webinar - Get Started with Alluxio on Kubernetes
Alluxio Product School Webinar - Get Started with Alluxio on Kubernetes
 
Alluxio Product School Webinar - Boosting Trino Performance.
Alluxio Product School Webinar - Boosting Trino Performance.Alluxio Product School Webinar - Boosting Trino Performance.
Alluxio Product School Webinar - Boosting Trino Performance.
 
Alluxio Product School Webinar - Transparent URI
Alluxio Product School Webinar - Transparent URIAlluxio Product School Webinar - Transparent URI
Alluxio Product School Webinar - Transparent URI
 
Alluxio 2.9 Release Overview
Alluxio 2.9 Release OverviewAlluxio 2.9 Release Overview
Alluxio 2.9 Release Overview
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
 

Recently uploaded

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Caching

  • 1. Accelerate Model Training with Alluxio Enterprise AI Adit Madan adit@alluxio.com
  • 2. 2 Alluxio Data Platform High Performance data access, unified global view
  • 3. Alluxio Technology Journey Open Source Started From UC Berkeley AMPLab in 2014 1000+ nodes Largest deployment by Baidu Started from UC Berkeley AMPLab 1 Billion Files supported by Alluxio with 2.0 release 2014 2019 2023 7/10 top Internet Co powered by Alluxio 3 AliPay 80% Model Training Zhihu LLM Model training served by Alluxio EXPLOSION OF DATA rise of big data & analytics CLOUD ADOPTION Single to hybrid cloud, multi-cloud, cross region DEEP LEARNING & AI Large-scale model training and deployment 1000+ Contributors Open Source 1000+ Attendees Data Orchestration Summit 100% Presto @ Meta Fully on-boarded to Alluxio 9/10 top Internet Co powered by Alluxio
  • 4.
  • 5. 5 Critical infrastructure barriers to effective AI/ML adoption LOW PERFORMANCE COST MANAGEMENT Inefficient Data I/O GPU SCARCITY Ability to leverage GPUs anywhere $ $ Specialized storage comes at a premium
  • 6. Whatʼs New: Alluxio Enterprise AI 1. High performance I/O over commodity storage ○ New Distributed System Architecture, called DORA (Decentralized Object Repository Architecture) 2. Accelerating end-to-end ML pipelines (LLM, NLP & Computer Vision) ○ Optimized Performance for Model Training and Model Serving ALLUXIO 6
  • 7. Distributed Object Repository Architecture (DORA) ● No single point of failure with a new architecture that scales-out horizontally without any central management ● Automatic Fallback to data lake storage for masking any failures to due to capacity or other reasons ● Performance ○ Revamped single-node storage with 50 million objects per node ○ Workload-specific optimizations for ML training & analytics Design Goals: Extremely Stable, Low Maintenance Overhead, Scalability for ML Alluxio Platform Revolutionary New Architecture Alluxio Client Affinity Location Policy Consistent Hash (Decentralized)
  • 8. Alluxio Enterprise AI Whatʼs New on the Alluxio Platform for AI Model Training Scale to 10 billion+ objects to handle the demands of AI POSIX & REST API for Python ● 2-8x performance improvements over commodity S3 ● 1.5-2x over specialized storage systems with POSIX API ● Upto 95% API cost savings compared to direct access 1
  • 9. Alluxio Enterprise AI Whatʼs New on the Alluxio Platform for AI Model Serving Extreme Concurrency for model serving, from training to inference clusters Data Preloading based on usage pattern ● 2-3x reduced deployment times in production 2
  • 10. BUSINESS BENEFIT: TECH BENEFIT: Increase GPU utilization 50% 93% File System Training Data Training Data M o d e l s Training Data Models Model Training Model Training Model Deployment Model Inference Downstream Applications Model Update Training Clouds Offline Cloud Online Cloud APAC Quora CASE STUDY: High Performance AI Platform for LLM 2 - 4X faster time-to-market
  • 11. Model Training: Increase GPU utilization with Existing Data Lake 11 70 70 On Prem … Checkpoints Training Data Data Lake Source of Truth Training Cluster Object Store Increase utilization up to 90% Faster model training with more accurate, fresher models Save on API costs Runs on standard low-cost storage
  • 12. 12 Alluxio vs Directly Accessing S3 17 min Total training time (3 epochs) 93% GPU utilization (TensorBoard) Alluxio 85 min Total training time (3 epochs) 17% GPU utilization (TensorBoard) S3 Alluxio is 5 times faster than S3
  • 13. Model Training: Eliminate cost/complexity with data copies 70 70 On Prem … Checkpoints Training Data Data Lake Source of Truth Training Cluster Object Store 13 Automatically load data from existing data lake Faster access to training data Increased data engineering productivity
  • 14. Model Training: Spin up GPUs where available 14 70 70 On Prem … Checkpoints Training Data Data Lake Source of Truth Training Cluster Object Store 70 70 On Prem … Checkpoints Training Data Training Cluster REMOTE TRAINING CLUSTER Deploy GPUs anywhere based on availability and cost Eliminate data copies Unified access for all training data Reduced network and egress costs
  • 16. Training Cluster Offline Training Platform 1 Training Data Models 4 2 Training Data 3 Models Models 5 Inference Cluster Online ML Platform Consumer is the Data Scientist with a focus on building models without having to worry about scaling to multiple servers and the platform complexity Data Sources in the same or different region / cloud as the AI/ML infrastructure Decentralized Object Repository Architecture
  • 17. 70 AI Reference Architecture 17 Training Cluster Offline Training Platform 1 Training Data Models 4 2 Training Data 3 Models Models 5 Inference Cluster Online ML Platform New What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 18. Before using Alluxio > 80% of total time is spent in DataLoader Result in low GPU Utilization Rate (<20%) 18 GPU Summary Name Tesla T4 Memory 14.62GB Compute Capability 7.5 GPU Utilization 16.96% Est. SM Efficiency 16.91% Est. Achieved Occupancy 68.75% Kernel Time using Tensor Cores 0.0% Category Time Duration (us) Percentage (%) Average Step Time 1,763,649,145 100 Kernel 299,168,905 16.96 Memcpy 10,521,722 0.6 Memset 39,459 0 Runtime 3,043,169 0.17 DataLoader 1,446,068,956 81.99 CPU Exec 1,570,076 0.09 Other 3,245,858 0.18 Resnet-50 3 epochs S3 Fuse What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 19. Ater using Alluxio Reduce Data Loader Rate from 82% to 1% Increase GPU Utilization Rate from 17% to 93% 19 GPU Summary Name Tesla T4 Memory 14.62GB Compute Capability 7,5 GPU Utilization 93.29% Est. SM Efficiency 92.98% Est. Achieved Occupancy 68.03% Kernel Time using Tensor Cores 0.0% Category Time Duration (us) Percentage (%) Average Step Time 334,274,946 100% Kernel 311,847,023 93.29 Memcpy 10,500,126 3.14 Memset 43,946 0.01 Runtime 3,899,241 1.17 DataLoader 3,343,301 1 CPU Exec 1,648,391 0.49 Other 2,992,918 0.9 Resnet-50 3 epochs S3 Fuse What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 20. Model Serving: Faster model deployment times 20 70 70 On Prem … Checkpoints Training Data Object Store or HDFS Data Lake Source of Truth Training Cluster On Premise 70 70 On Prem … Checkpoints Training Data Training Cluster REGIONAL INTERFACE CLUSTERS Deploy models to remote inference sites in minutes Reduced network bandwidth Offload underlying object store or HDFS 70 70 On Prem … Checkpoints Training Data Training Cluster REGIONAL INTERFACE CLUSTERS What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 21. Distributed Object Repository Architecture (DORA) ● No single point of failure with a new architecture that scales-out horizontally without any central management ● Automatic Fallback to data lake storage for masking any failures to due to capacity or other reasons ● Performance ○ Revamped single-node storage with 50 million objects per node ○ Workload-specific optimizations for ML training & analytics Design Goals: Extremely Stable, Low Maintenance Overhead, Scalability for ML Alluxio Platform Revolutionary New Architecture Alluxio Client Affinity Location Policy Consistent Hash (Decentralized) New What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET
  • 22. Alluxio System Architecture 70 AI/Analytics Applications Get Task Info Send Result Alluxio Client 22 Affinity Block Location Policy Client Consistent Hash (Task Info) 2 3 Service Registry Alluxio Worker Alluxio Worker Alluxio Worker Execute Task Get Cluster Info Find Worker(s) 1 4 Cache miss Under storage task 5 Training Node Alluxio Cluster Under Storage What’s New: Alluxio Enterprise AI - under embargo until Wed, Oct 18 at 8:00 am ET