Alluxio + Eckerson Webinar
Sep. 12, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Kevin Petrie
(Vice President of Research
)
- Sridhar Venkatesh (SVP of Product)
As enterprises race to roll out artificial intelligence, often overlooked are the infrastructure needs to support scalable ML model development and deployment. Efforts to effectively access and utilize GPUs often lead to extensive data engineering managing data copies or specialized storage, leading to out-of-control cloud and infrastructure costs.
To address the challenges, enterprises need a new data access layer to connect compute engines to data stores wherever they reside in distributed environments.
Join this webinar with Kevin Petrie, Eckerson Group VP of Research, and Sridhar Venkatesh, Alluxio SVP of Product, to explore tools, techniques, and best practices to remove data access bottlenecks and accelerate AI/ML model training. You will learn:
- Modern requirements for AI/ML model training and data engineering
- The challenges of GPU utilization in machine learning and the need for specialized hardware
- How a new data access layer connects compute to data stores across environments
- Best practices for optimizing ML training and guiding principles for success
13. 13
Retooling the enterprise data infrastructure
Legacy data centers can’t keep up
High Performance
Computing
Specialized
Hardware
Varied
Workloads
We're seeing incredible orders to retool
the world's data centers… a 10-year
transition to basically recycle or reclaim
the world's data centers and build it out
as accelerated computing.
Jensen Huang
Nvidia CEO
“
14. Challenges as you try to scale
14
GPUs are this year’s
toilet paper.
Wall Street Journal
“
GPUs are
scarce
GPUs are
expensive
Low GPU
Utilization
15. Business Pressures Complex & Costly Solutions
GPUs are
scarce
GPUs are
expensive
Low GPU
Utilization
Faster model
development times
Increased
freshness
Higher accuracy
and traceability
Rapidly growing
datasets
Extensive data engineering
managing data copies
Specialized storage
Out of control cloud and
infra costs
15
17. 1.Faster Time-to-Market
50%
Hundreds of thousands of dollars saved annually
compared to previous deployment.
2-3X Model Training Performance Cost Reduction, Performance Boost
International B2C with a multi-cloud, cross-region AI platform, serving LLMs and training
models from object storage. They optimized their AI platform with Alluxio to speed data
delivery to training clusters and facilitate faster model deployment in latency sensitive
production use cases.
Models Deployed in Minutes vs Days
Faster model deployment times
18. 2. Higher GPU Utilization
“In a cloud environment, where GPU hardware is paid for as a function of time, you need
fast, performant, reliable, and cost effective data for your model training pipelines to keep
your GPU utilization close to 99%.”
20-30%
Average reported GPU utilization
based on direct access from remote
storage
GPU Utilization accessing commodity storage
GPU Utilization accessing Alluxio
Alluxio serves high throughput data to K8s training
workloads.
90
%
GPU utilization from Alluxio serving
data pulled from object storage. In
increase from 50% utilization via s3fs-
fuse.
19. 3. Reduction in Personnel
Increase in Productivity
Pre-Processed
Data
Data
Management
Pre-
Processed
Data
Training
Clusters
Data scientists
send requests to
AI platform
teams. Platform
teams set up
individual data
pipelines.
With Alluxio, data
scientists just
access their data.
Alluxio
consolidates many
pipelines into an
access layer.
Pipeline or
Scheduler
Training
Clusters
20. 20
4. Reduction in Infrastructure Spend
Alluxio optimizes data platforms to increase efficiency
Data Engineering
Pipelines
Data workflows improved by on-
demand access from Alluxio cache
S3 Egress and API
Fees
Fees significantly reduced via
granular caching and data
reuse
High Performance
Computing
Replaceable with low-cost hardware
at comparable performance
Reduced or Eliminated
Network Congestion Network congestion reduced by
serving files locally
21. 5. Cloud Vendor Leverage
Multi-cloud strategies with cost-effective benefits
Respond to Limited GPU Availability
Demand for GPUs has exploded
Organizations use Alluxio to supply high performance data access
to remote GPU clusters wherever they find capacity.
Increase Cloud Agility
Competing CSPs may provide attractive discounts
Alluxio empowers organizations to capitalize on hardware discounts
or cost-effective storage in real-time. Users access data wherever it
resides.
Avoid Vendor Lock-In
Negotiate with CSPs from a stronger position
Single cloud deployments are convenient, but that may become an
obstacle in negotiations. Alluxio facilitates hybrid and multi-cloud.