Alluxio Webinar - Maximize GPU Utilization for Model Training

•

0 likes•93 views

Alluxio Webinar June 26. 2023 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Tarik Bennett (Senior Solutions Engineer, Alluxio) Beinan Wang (Tech Lead, Alluxio) When training models on ultra-large datasets, one of the biggest challenges is low GPU utilization. These powerful processors are often underutilized due to inefficient I/O and data access. This mismatch between computation and storage leads to wasted GPU resources, low performance, and high cloud storage costs. The rise of generative AI and GPU scarcity is only making this problem worse. In this webinar, Tarik and Beinan discuss strategies for transforming idle GPUs into optimal powerhouses. They will focus on cost-effective management of ultra-large datasets for AI and analytics.

Software

From Idle to Optimal:
Maximize GPU Utilization
for Model Training
Beinan Wang
Tarik Bennett

Senior Staﬀ Engineer @ Alluxio
Trino Contributor
PrestoDB Committer
Senior Solutions Engineer
@ Alluxio
Dr. Beinan Wang
Tarik Bennett
2

Source: https://www.wsj.com/articles/rush-to-use-generative-ai-pushes-companies-to-get-data-in-order
“Training large language models requires ready
access to vast amounts of data, whose storage,
processing, and protection can be costly.”

High Scalability
Training billions files
ESSENTIAL
High Availability
99.99%
ESSENTIAL
High Performance
Higher GPU utilization
ESSENTIAL
Always Increasing Expectations…
Icons created by kerismaker, HJ Studio - Flaticon

What Does Managing Data Involve?
Data Preprocessing
Improving the quality and reliability
of the data for model training
Model Training
Read training data, vision (image) or
NLP/LLM (text), for DL using GPUs
Model Deployment
Consumption of trained models for
online or oﬀline inference
Feature Engineering
Selecting relevant and informative
features from raw data
PyTorch | Tensorflow | Spark
Spark PyTorch | Tensorflow | Spark
Model
Training Data Result
Model
Compute
Stage
Spark | Trino | Presto
Result
Curated Data
Not discussed today
- Security
- Privacy (PII)
- Data Cleaning
- Data Pipelines
- Data Governance
Curated /
Processed Data

100,000,000,000,000,000,000,000
bytes of data will be stored in the cloud by 2025
6
Source: Cybersecurity Ventures

Issues Managing Ultra-Large Datasets
Non-Functional Storage Requirements
High Performance
- Many options
Cost-Effective
- Commodity storage

10%
of your data is hot data
8
Source: Alluxio

9
Data
Caching
Helps
Boost
Performance
Save Costs
Prevent
Network
Congestion
Oﬀload
Under
Storage

10
Data Caching at Uber Scale
3 Clusters, 1500 Nodes
Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/
50%
Input Read
Performance
10%
Data Read Traﬀic
to HDFS

GPUs are
scarce
GPUs are
expensive
Challenges as you try to scale
Low GPU
Utilization

Addressing Low
GPU Utilization with Caching
13

Architecture Overview
Online ML platform
Inference cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
14

AI Training Test with Alluxio
15
Local Folder /dataset
Alluxio
GPU Training
Remote Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard

Test Setup
● Alluxio via Kubernetes - Provides caching for training data
● GPU server - AWS EC2/Kubernetes
● Deep learning algorithm (CV) - ResNet (one of the most popular CV algorithms)
● Deep learning framework - PyTorch
● Dataset - ImageNet (subset - ~35k images, each is ~100kB - 200kB)
● Dataset storage - S3 (single region)
● Mounting - FUSE
● Visualization - TensorBoard
● Code execution - Jupyter notebook
16

Training Test Steps
1. Loading the dataset into Alluxio
2. Running the training job
3. Reading the dataset from Alluxio through PyTorch
DataLoader in each epoch
4. Visualizing the GPU utilization and other metrics
17

18
Training Directly from Storage
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
Visualization Dashboard Results (Control)

19
Visualization Dashboard Results (Alluxio)
Training with Alluxio
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)

Source: https://developer.nvidia.com/blog/accelerating-analytics-and-ai-with-alluxio-and-nvidia-gpus/
“The beneﬁts from GPU acceleration are limited
if data access dominates the execution time. “

Thank You
twitter.com/alluxio slackin.alluxio.io
linkedin.com/alluxio
www.alluxio.io
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK

Similar to Alluxio Webinar - Maximize GPU Utilization for Model Training

Deep learning for FinTechgeetachauhan

BSC LMS DDL Ganesan Narayanasamy

S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...Henry Saputra

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks

TensorFlow 16: Building a Data Science Platform Seldon

Innovation with ai at scale on the edge vt sept 2019 v0Ganesan Narayanasamy

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks

Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Intel® Software

OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs

Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai

Azure machine learning serviceRuth Yakubu

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowIT Arena

Serverless machine learning architectures at HelixaData Science Milan

Accelerating Cloud Training With AlluxioAlluxio, Inc.

Alluxio Product school Webinar - Distributed Caching for Generative AIAlluxio, Inc.

[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...Amazon Web Services Korea

ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association

ML Infrastracture @ Dropbox Tsahi Glik

Similar to Alluxio Webinar - Maximize GPU Utilization for Model Training (20)

Deep learning for FinTech

BSC LMS DDL

S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS

TensorFlow 16: Building a Data Science Platform

Innovation with ai at scale on the edge vt sept 2019 v0

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Infrastructure Agnostic Machine Learning Workload Deployment

Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...

OS for AI: Elastic Microservices & the Next Gen of ML

Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...

Azure machine learning service

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow

Serverless machine learning architectures at Helixa

Accelerating Cloud Training With Alluxio

Alluxio Product school Webinar - Distributed Caching for Generative AI

[AWS Dev Day] 인공지능 / 기계 학습 | AWS 기반 기계 학습 자동화 및 최적화를 위한 실전 기법 - 남궁영환 AWS 솔루션...

ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...

ML Infrastracture @ Dropbox

Recently uploaded

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

TECUNIQUE: Success Stories: IT Service providermohitmore19

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Right Money Management App For Your Financial GoalsJhone kinadey

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Recently uploaded (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Hand gesture recognition PROJECT PPT.pptx

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Unlocking the Future of AI Agents with Large Language Models

How To Use Server-Side Rendering with Nuxt.js

TECUNIQUE: Success Stories: IT Service provider

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service

HR Software Buyers Guide in 2024 - HRSoftware.com

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Right Money Management App For Your Financial Goals

Microsoft AI Transformation Partner Playbook.pdf

Alluxio Webinar - Maximize GPU Utilization for Model Training

1. From Idle to Optimal: Maximize GPU Utilization for Model Training Beinan Wang Tarik Bennett

2. Senior Staﬀ Engineer @ Alluxio Trino Contributor PrestoDB Committer Senior Solutions Engineer @ Alluxio Dr. Beinan Wang Tarik Bennett 2

3. Source: https://www.wsj.com/articles/rush-to-use-generative-ai-pushes-companies-to-get-data-in-order “Training large language models requires ready access to vast amounts of data, whose storage, processing, and protection can be costly.”

4. High Scalability Training billions files ESSENTIAL High Availability 99.99% ESSENTIAL High Performance Higher GPU utilization ESSENTIAL Always Increasing Expectations… Icons created by kerismaker, HJ Studio - Flaticon

5. What Does Managing Data Involve? Data Preprocessing Improving the quality and reliability of the data for model training Model Training Read training data, vision (image) or NLP/LLM (text), for DL using GPUs Model Deployment Consumption of trained models for online or oﬀline inference Feature Engineering Selecting relevant and informative features from raw data PyTorch | Tensorflow | Spark Spark PyTorch | Tensorflow | Spark Model Training Data Result Model Compute Stage Spark | Trino | Presto Result Curated Data Not discussed today - Security - Privacy (PII) - Data Cleaning - Data Pipelines - Data Governance Curated / Processed Data

6. 100,000,000,000,000,000,000,000 bytes of data will be stored in the cloud by 2025 6 Source: Cybersecurity Ventures

7. Issues Managing Ultra-Large Datasets Non-Functional Storage Requirements High Performance - Many options Cost-Effective - Commodity storage

8. 10% of your data is hot data 8 Source: Alluxio

9. 9 Data Caching Helps Boost Performance Save Costs Prevent Network Congestion Oﬀload Under Storage

10. 10 Data Caching at Uber Scale 3 Clusters, 1500 Nodes Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/ 50% Input Read Performance 10% Data Read Traﬀic to HDFS

11. Maximizing GPUs 11

12. GPUs are scarce GPUs are expensive Challenges as you try to scale Low GPU Utilization

13. Addressing Low GPU Utilization with Caching 13

14. Architecture Overview Online ML platform Inference cluster Models Models Training Data Models 1 2 3 4 5 Offline training platform Alluxio Training cluster Training Data 2 14

15. AI Training Test with Alluxio 15 Local Folder /dataset Alluxio GPU Training Remote Storage Kubernetes Interactive Notebook Alluxio Operator Visualization Dashboard

16. Test Setup ● Alluxio via Kubernetes - Provides caching for training data ● GPU server - AWS EC2/Kubernetes ● Deep learning algorithm (CV) - ResNet (one of the most popular CV algorithms) ● Deep learning framework - PyTorch ● Dataset - ImageNet (subset - ~35k images, each is ~100kB - 200kB) ● Dataset storage - S3 (single region) ● Mounting - FUSE ● Visualization - TensorBoard ● Code execution - Jupyter notebook 16

17. Training Test Steps 1. Loading the dataset into Alluxio 2. Running the training job 3. Reading the dataset from Alluxio through PyTorch DataLoader in each epoch 4. Visualizing the GPU utilization and other metrics 17

18. 18 Training Directly from Storage - > 80% of total time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) Visualization Dashboard Results (Control)

19. 19 Visualization Dashboard Results (Alluxio) Training with Alluxio - Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X)

20. Source: https://developer.nvidia.com/blog/accelerating-analytics-and-ai-with-alluxio-and-nvidia-gpus/ “The beneﬁts from GPU acceleration are limited if data access dominates the execution time. “

21. Thank You twitter.com/alluxio slackin.alluxio.io linkedin.com/alluxio www.alluxio.io JOIN THE CONVERSATION ON SLACK ALLUXIO.IO/SLACK

Alluxio Webinar - Maximize GPU Utilization for Model Training

Recommended

Recommended

More Related Content

Similar to Alluxio Webinar - Maximize GPU Utilization for Model Training

Similar to Alluxio Webinar - Maximize GPU Utilization for Model Training (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Alluxio Webinar - Maximize GPU Utilization for Model Training