Alluxio Webinar
June 26. 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
Tarik Bennett (Senior Solutions Engineer, Alluxio)
Beinan Wang (Tech Lead, Alluxio)
When training models on ultra-large datasets, one of the biggest challenges is low GPU utilization. These powerful processors are often underutilized due to inefficient I/O and data access. This mismatch between computation and storage leads to wasted GPU resources, low performance, and high cloud storage costs. The rise of generative AI and GPU scarcity is only making this problem worse.
In this webinar, Tarik and Beinan discuss strategies for transforming idle GPUs into optimal powerhouses. They will focus on cost-effective management of ultra-large datasets for AI and analytics.
4. High Scalability
Training billions files
ESSENTIAL
High Availability
99.99%
ESSENTIAL
High Performance
Higher GPU utilization
ESSENTIAL
Always Increasing Expectations…
Icons created by kerismaker, HJ Studio - Flaticon
5. What Does Managing Data Involve?
Data Preprocessing
Improving the quality and reliability
of the data for model training
Model Training
Read training data, vision (image) or
NLP/LLM (text), for DL using GPUs
Model Deployment
Consumption of trained models for
online or offline inference
Feature Engineering
Selecting relevant and informative
features from raw data
PyTorch | Tensorflow | Spark
Spark PyTorch | Tensorflow | Spark
Model
Training Data Result
Model
Compute
Stage
Spark | Trino | Presto
Result
Curated Data
Not discussed today
- Security
- Privacy (PII)
- Data Cleaning
- Data Pipelines
- Data Governance
Curated /
Processed Data
14. Architecture Overview
Online ML platform
Inference cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
14
15. AI Training Test with Alluxio
15
Local Folder /dataset
Alluxio
GPU Training
Remote Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard
16. Test Setup
● Alluxio via Kubernetes - Provides caching for training data
● GPU server - AWS EC2/Kubernetes
● Deep learning algorithm (CV) - ResNet (one of the most popular CV algorithms)
● Deep learning framework - PyTorch
● Dataset - ImageNet (subset - ~35k images, each is ~100kB - 200kB)
● Dataset storage - S3 (single region)
● Mounting - FUSE
● Visualization - TensorBoard
● Code execution - Jupyter notebook
16
17. Training Test Steps
1. Loading the dataset into Alluxio
2. Running the training job
3. Reading the dataset from Alluxio through PyTorch
DataLoader in each epoch
4. Visualizing the GPU utilization and other metrics
17
18. 18
Training Directly from Storage
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
Visualization Dashboard Results (Control)
19. 19
Visualization Dashboard Results (Alluxio)
Training with Alluxio
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)