Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS

Webinar:
Efﬁcient Data Loading for
Model Training on AWS
Greg Palmer
greg.palmer@alluxio.com
October 3rd, 2023

Adoption of Artiﬁcial Intelligence (AI)
● 49% of CIOs are using or plan to use AI[1]
● Recent boom of generative AI is accelerating adoption
2
● Successful AI projects require access to data
● As AI use cases grow more complex…
○ Understanding data access patterns becomes more
important
[1] Gartner, “2023 Gartner CIO survey”

Barriers to Implementing AI[2]
3
[2] Gartner, “2021 Gartner AI in Organizations survey”

How Access to Data Hinders the
Success of AI
4
● High-quality AI models require access to massive datasets
● Data access is slow and costly[3]
● Increasing size of models slows down application performance
● Limited availability of GPUs necessitates remote data transfer[4]
● GPUs waiting for data, results in underutilized GPUs
[3] AI and compute, https://openai.com/research/ai-and-compute [4] Amazon EC2 P4 Instances, https://aws.amazon.com/ec2/instance-types/p4

Data Access Patterns in the ML Pipeline
6

Data Access Patterns in Model Training
7

Cloud Data Access Patterns:
Training on Unstructured Datasets
8

Training on Structured Datasets
9

Multi-cloud/Multi-region Data Access
10

Data Access Solutions Should Support:
11
● High performance and throughput for ML workloads
● Dataset management, including load/unload/update of data from the data
lake
● Cloud-native capabilities, such as multi-tenancy, scalability, and elasticity
● Eliminate data redundancy to avoid managing multiple copies of data
● Reduced dependency on specialized networking hardware
● Flexibility to place compute anywhere, regardless of the location of the data
● Agnostic to cloud service providers to avoid vendor lock-in
● Future-proofing to adapt to advancements in storage and computation
technologies
● Security, including consistent authentication and authorization

Alluxio-powered Data Access Across
the ML Pipeline
12

Data Access for Model
Training on AWS[5]
[5] Amazon AWS Docs: https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

Data Access for Model Training on AWS
S3 File Mode
14
14
Instance Filesystem:
/opt/ml/input/data/training-channel
Training Instance
Training Script Process
(train.py)
Copy dataset ahead of time
Read

S3 FastFile Mode
15
15
(train.py)
Stream in real-time
Read
FUSE Process
mount
Training Instance

S3 Pipe Mode
16
16
(train.py)
Stream in real-time
Read
Training Instance

Amazon Fsx for Lustre
17
17
(train.py)
Read through
Read
Fsx for Lustre
mount
Training Instance

Amazon EFS Filesystem
18
18
(train.py)
Read
EFS
mount
Training Instance

Alluxio for Data Access in
Model Training on AWS

Model Training
Alluxio on AWS - Reference Architecture
Model Serving
Inference cluster
Models
Training Data
Models
1
2
3
4
5
Alluxio
Training cluster
Training Data
2
20
Alluxio

Alluxio on AWS Provides:
21
● Automatically load / unload / update data from your existing data lake
● Faster access to training data informed by data access patterns
● Maintain optimal data access with high data throughput to keep the GPU fully
utilized
● Deploy models faster and provides high concurrency model serving to inference
nodes
● Increase the productivity of the data engineering team by eliminating the need to
manage data copies
● Reduce cloud storage API and egress costs, such as the cost of S3 GET requests,
data transfer costs, etc.

Model Training
Alluxio on AWS - Reference Architecture
Model Serving
Inference cluster
Models
Training Data
Models
1
2
3
4
5
Alluxio
Training cluster
Training Data
2
22
Alluxio
GCP

Alluxio Model Training
Demonstration

Alluxio Demo Environment
24
Local Folder / Dataset
GPU Training
Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard
Alluxio

Alluxio Demo …
25
Alluxio AWS Model Training Demo - Recording
Alluxio FUSE vs AWS S3 FUSE Demo - Recording
Alluxio APIs Demo - Recording

26
Training Directly from Storage
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
Visualization Dashboard Results (w/o Alluxio)

27
Visualization Dashboard Results (with Alluxio)
Training with Alluxio
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)

Q&A
twitter.com/alluxio slackin.alluxio.io
linkedin.com/alluxio
www.alluxio.io
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK

Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS

Recommended

Recommended

More Related Content

Similar to Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS

Similar to Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS