Alluxio Monthly Webinar
Oct. 3, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Greg Palmer (Lead Solutions Engineer)
Model training requires extensive computational and GPU resources. When training models on AWS, loading data from S3 often becomes a major bottleneck, wasting valuable GPU cycles. Optimizing data loading can greatly reduce GPU idle time and increase GPU utilization.
In this webinar, Greg Palmer will discuss best practices for efficient data loading during model training on AWS. He will demonstrate how to use Alluxio on EKS as a distributed cache to accelerate PyTorch training jobs that read datasets from S3. This architecture significantly improves the utilization of GPUs from 30% to 90%+, archives ~5x faster training, and lower cloud storage costs.
What you will learn:
- The challenges of feeding data-hungry GPUs in the cloud
- How to accelerate model training by optimizing data loading on AWS
- The reference architecture for running PyTorch jobs with Alluxio cache on EKS while reading data from S3, with benchmark results of training ResNet50 and BERT
- How to use TensorBoard to identify bottlenecks in GPU utilization
2. Adoption of Artificial Intelligence (AI)
● 49% of CIOs are using or plan to use AI[1]
● Recent boom of generative AI is accelerating adoption
2
● Successful AI projects require access to data
● As AI use cases grow more complex…
○ Understanding data access patterns becomes more
important
[1] Gartner, “2023 Gartner CIO survey”
4. How Access to Data Hinders the
Success of AI
4
● High-quality AI models require access to massive datasets
● Data access is slow and costly[3]
● Increasing size of models slows down application performance
● Limited availability of GPUs necessitates remote data transfer[4]
● GPUs waiting for data, results in underutilized GPUs
[3] AI and compute, https://openai.com/research/ai-and-compute [4] Amazon EC2 P4 Instances, https://aws.amazon.com/ec2/instance-types/p4
11. Data Access Solutions Should Support:
11
● High performance and throughput for ML workloads
● Dataset management, including load/unload/update of data from the data
lake
● Cloud-native capabilities, such as multi-tenancy, scalability, and elasticity
● Eliminate data redundancy to avoid managing multiple copies of data
● Reduced dependency on specialized networking hardware
● Flexibility to place compute anywhere, regardless of the location of the data
● Agnostic to cloud service providers to avoid vendor lock-in
● Future-proofing to adapt to advancements in storage and computation
technologies
● Security, including consistent authentication and authorization
13. Data Access for Model
Training on AWS[5]
[5] Amazon AWS Docs: https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html
14. Data Access for Model Training on AWS
S3 File Mode
14
14
Instance Filesystem:
/opt/ml/input/data/training-channel
Training Instance
Training Script Process
(train.py)
Copy dataset ahead of time
Read
15. Data Access for Model Training on AWS
S3 FastFile Mode
15
15
Instance Filesystem:
/opt/ml/input/data/training-channel
Training Script Process
(train.py)
Stream in real-time
Read
FUSE Process
mount
Training Instance
16. Data Access for Model Training on AWS
S3 Pipe Mode
16
16
Instance Filesystem:
/opt/ml/input/data/training-channel
Training Script Process
(train.py)
Stream in real-time
Read
Training Instance
17. Data Access for Model Training on AWS
Amazon Fsx for Lustre
17
17
Instance Filesystem:
/opt/ml/input/data/training-channel
Training Script Process
(train.py)
Read through
Read
Fsx for Lustre
mount
Training Instance
18. Data Access for Model Training on AWS
Amazon EFS Filesystem
18
18
Instance Filesystem:
/opt/ml/input/data/training-channel
Training Script Process
(train.py)
Read
EFS
mount
Training Instance
20. Model Training
Alluxio on AWS - Reference Architecture
Model Serving
Inference cluster
Models
Training Data
Models
1
2
3
4
5
Alluxio
Training cluster
Training Data
2
20
Alluxio
21. Alluxio on AWS Provides:
21
● Automatically load / unload / update data from your existing data lake
● Faster access to training data informed by data access patterns
● Maintain optimal data access with high data throughput to keep the GPU fully
utilized
● Deploy models faster and provides high concurrency model serving to inference
nodes
● Increase the productivity of the data engineering team by eliminating the need to
manage data copies
● Reduce cloud storage API and egress costs, such as the cost of S3 GET requests,
data transfer costs, etc.
22. Model Training
Alluxio on AWS - Reference Architecture
Model Serving
Inference cluster
Models
Training Data
Models
1
2
3
4
5
Alluxio
Training cluster
Training Data
2
22
Alluxio
GCP
24. Alluxio Demo Environment
24
Local Folder / Dataset
GPU Training
Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard
Alluxio
25. Alluxio Demo …
25
Alluxio AWS Model Training Demo - Recording
Alluxio FUSE vs AWS S3 FUSE Demo - Recording
Alluxio APIs Demo - Recording
26. 26
Training Directly from Storage
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
Visualization Dashboard Results (w/o Alluxio)
27. 27
Visualization Dashboard Results (with Alluxio)
Training with Alluxio
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)