SlideShare a Scribd company logo
1 of 147
Download to read offline
© 2024 Cerebras Systems Inc. All Rights Reserved
Andrew Feldman
CEO & Co-Founder Cerebras
© 2024 Cerebras Systems Inc. All Rights Reserved
AI Has Fundamentally Changed
Computing
AI Supercomputers
x86 Servers
© 2024 Cerebras Systems Inc. All Rights Reserved
There’s a vast chasm in
AI capabilities
AI Developers Are Struggling with
Distributed GPU Training
© 2024 Cerebras Systems Inc. All Rights Reserved
“It can be a frustrating daily life
experience of training large models…You're
there carefully monitoring the vital signs of
your run: loss spikes, numerical issues,
throughput, gradient norms, policy
entropy, etc... or 10,000 GPUs could be
idling.”
Co-Founder, OpenAI
© 2024 Cerebras Systems Inc. All Rights Reserved
Co-founder, Reka AI
Former Google Brain Scientist,
“Multi-node GPU training is more of
an afterthought as opposed to
distributed training as a first class
citizen…it’s a hardware lottery."
© 2024 Cerebras Systems Inc. All Rights Reserved
“Building large scale training
clusters from scratch and achieving
high MFU and reliability is damn
hard”
Senior Foundation Model Engineer, Uber
GPT-1
120M Parameters
4 Contributors
GPT-4
1.7T Parameters
240+ contributors
35 just for distributed training
& supercomputing
© 2024 Cerebras Systems Inc. All Rights Reserved
Large Models Simply Don’t Fit on GPUs
ChatGPT (28TB)
H100 (80GB)
© 2024 Cerebras Systems Inc. All Rights Reserved
Developers must cut the model into many pieces..
© 2024 Cerebras Systems Inc. All Rights Reserved
And spread them on hundreds of GPUs
© 2024 Cerebras Systems Inc. All Rights Reserved
An ML problem just turned into a parallel programming problem.
A hardware problem just became a supercomputer problem.
Then re-write the model to work across a cluster
© 2024 Cerebras Systems Inc. All Rights Reserved
This causes a code
explosion
nanoGPT
1B Parameters
639 lines of code
nanoGPT
1B Parameters
639 lines of code
Megatron
100B Parameters
20,507 lines of code
© 2024 Cerebras Systems Inc. All Rights Reserved
You never have to do this on Cerebras
© 2024 Cerebras Systems Inc. All Rights Reserved
The Cerebras Way
Build a compute & memory system that’s vastly larger than the model
Cerebras CS-3 = 1,200 TB
ChatGPT
© 2024 Cerebras Systems Inc. All Rights Reserved
4 trillion transistors
46,225 mm2 silicon
900,000 cores optimized for sparse
linear algebra
5nm TSMC process
125 Petaflops of AI compute
44 Gigabytes of on-chip memory
21 PByte/s memory bandwidth
214 Pbit/s fabric bandwidth
Cerebras
Wafer-Scale Engine
The fastest AI chip on earth again
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras Wafer Scale Engine 3 Versus the H100
Cerebras WSE-3
4 Trillion Transistors
46,225 mm2 Silicon
Largest GPU
80 Billion Transistors
814 mm2 Silicon
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras CS-3
© 2024 Cerebras Systems Inc. All Rights Reserved
CS-3
SwarmX
MemoryX
Wafer Scale Cluster: The World’s Most Scalable
AI Supercomputer
1 terabyte
1 CS-3
125 petaflops
1 billion parameters
1 petabyte
2048 CS-3s
256 exaflops
24 trillion parameters
© 2024 Cerebras Systems Inc. All Rights Reserved
Exa-scale Performance
© 2024 Cerebras Systems Inc. All Rights Reserved
Single Device Simplicity
MemoryX Memory Units
SwarmX Interconnect
Wafer Scale Engines
1 to 2048 CS-3s Look and Program Like a Single Device
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
• Click to edit Master text styles
• Second level
• Third level
• Fourth level
• Fifth level
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Condor Galaxy 2
Stockton, California
© 2024 Cerebras Systems Inc. All Rights Reserved
Condor Galaxy 3 AI Supercomputer
64
CS-3 nodes
58 million
AI cores
8 exaFLOPS
FP16 AI compute
108 TB
Parameter memory
388 Tbps
On-chip bandwidth
Dallas, Texas
© 2024 Cerebras Systems Inc. All Rights Reserved
AI Supercomputers
Built & Operated in the United States
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 1
Santa Clara, CA
●4 ExaFLOPs
●64 x CS-2s
●82 TB of Memory
ONLINE
Stockton, CA Dallas, TX
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 2
●4 ExaFLOPs
●64 x CS-2s
●82 TB of Memory
ONLINE
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 3
●8 ExaFLOPs
●64 x CS-3s
●108 TB of Memory
Q2 2024
© 2024 Cerebras Systems Inc. All Rights Reserved
CEO of Microsoft
Satya Nadella
■ JAIS 30B parameter, bilingual
Arabic-English model
■ Microsoft’s core LLM offering
in the Middle East
■ Available on Azure
Cerebras & G42
World leading Arabic LLM
© 2024 Cerebras Systems Inc. All Rights Reserved
“Mayo Clinic selected Cerebras
as its first generative AI
collaborator for its large-scale,
domain-specific AI expertise to
accelerate breakthrough insights
for the benefit of patients.”
Cerebras & Mayo Clinic
Breakthrough insights for the
benefit of patients
Medical Director for Strategy at Mayo Clinic
Dr. Matthew Callstrom
© 2024 Cerebras Systems Inc. All Rights Reserved
“When the largest problem is
solved, a speedup of 228x is
achieved... Moreover…it is unlikely
that such a performance gap can
be closed… given the strong
scalability issues encountered by
this kind of algorithm when using a
large number of multi-GPU nodes
in HPC clusters.”
Cerebras & TotalEnergies
VP of Engineering at TotalEnergies
Diego Klahr VP
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras Cluster with 48
systems exceeded the
performance of the World’s
#1 Supercomputer ‘Frontier’
with 37,000 GPUs or a 100x
cost saving.
Cerebras & KAUST
Tony Chan
President, KAUST
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras CS-3 Architecture Deep Dive
Sean Lie, CTO and Co-Founder, Cerebras
© 2024 Cerebras Systems Inc. All Rights Reserved
• 2x performance
• Same power
• Same price
Cerebras CS-3: A Generational Leap for AI
LLM Training Performance
© 2024 Cerebras Systems Inc. All Rights Reserved
Registers
• Building on tried-and-true WSE-2 core…
WSE-3 Core
4-way 16b SIMD
WSE-2 Core
Memory
SRAM
48kB
Cache
256B
Fabric
16 General Purpose 44 Data Structure
© 2024 Cerebras Systems Inc. All Rights Reserved
Improved performance for AI compute
• New higher performance tensor operations
• New 8-way SIMD for 16b data (FP/BF16)
• New 16-way SIMD for 8b data (Fixed/INT8)
• New faster non-linear functions
• 2x higher compute performance core
High bandwidth memory and cache
• 48kB memory per core
• New 512B local cache per core
• Full bandwidth for full SIMD performance
WSE-3 Core
Continuing Distributed AI Architecture Leadership
WSE-3 Core
48 Data Structure
Registers
8-way 16b SIMD
Memory
SRAM
48kB
Cache
512B
Fabric
16 General Purpose
16-way 8b SIMD
© 2024 Cerebras Systems Inc. All Rights Reserved
From Small Core to Massive Wafer
Die
Core
WSE-3
84 Die
900k Cores
10.7k Cores
© 2024 Cerebras Systems Inc. All Rights Reserved
Uniquely capable of wafer-scale integration
• Invented process in first generation WSE
• Extended to 5nm in collaboration with
TSMC
Co-designed from ground up
• Uniform architecture with built-in
redundancy
• Extending uniform fabric across die
• Wafer behaves as single massive chip
WSE-3 Interconnect
Enabling the Only Wafer Scale Chip in the World
© 2024 Cerebras Systems Inc. All Rights Reserved
WSE-3 Interconnect
Enabling the Biggest Chip in the World
GPU GPU GPU GPU GPU GPU GPU GPU
NV
Link
NV
Link
NV
Link
NV
Link
Each H100 8xH100
Bandwidth
900GB/s
36x 100Gb/s serial
7.2TB/s
288x 100Gb/s serial
Power
36W 288W
5.0 pJ/bit
Each Die 84xDie
2880GB/s
480x 24Gb/s parallel
242TB/s
40320x 24Gb/s
parallel
1.1W 92W
0.05 pJ/bit
10x
More Die
33x
More Bandwidth
100x
More Power Efficient
Wafer Scale Engine
Traditional
Serial across connectors, PCBs,
cables
Parallel across <1mm on silicon
*GPU estimate use 5nm 100G serdes power with Nvidia H100 NVLink bandwidth
© 2024 Cerebras Systems Inc. All Rights Reserved
CS-3 System: Purpose Built for Wafer-Scale
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras CS-3 Nvidia H100 Cerebras Advantage
Chip size 46,225 mm2 814 mm2 57x
Cores 900,000 16,896 FP32 + 528 Tensor 52x
On-chip memory 44 Gigabytes 0.05 Gigabytes 880x
Memory bandwidth 21 Petabytes/sec 0.003 Petabytes/sec 7,000X
Fabric bandwidth 214 Petabits/sec 0.0576 Petabits/sec 3,715X
CS-3 vs. GPU
Orders of Magnitude Performance Advantage
Enabling large scale training
Finetune LLaMA 70B on 1B tokens in a day
on a single chip
© 2024 Cerebras Systems Inc. All Rights Reserved
Cluster natively operates as single device
WSE-3 is big enough to run largest models
• Enables compute and memory
disaggregation
• Train with data-parallel only scaling
Architect cluster-level memory and compute
• External memory stores model weights
• Untangle memory and compute
dependency
CS-3 Cluster
Designed as Single ML Accelerator
…
SwarmX Interconnect
MemoryX Memory Units
Wafer Scale Engines
© 2024 Cerebras Systems Inc. All Rights Reserved
Model capacity not limited by device
• Weights streamed onto wafer to compute
layer
• Weights trigger compute using HW
dataflow
• Weights are never stored on wafer
Decoupling weight optimizer compute
• Gradients streamed out of wafer
• Weight update occurs in MemoryX
MemoryX External Memory
Virtually Unlimited Model Weight Capacity
Memory hierarchy capable of massive models on single device
Weights
Gradients
MemoryX
Optimizer
Compute
Weight
Memory
CS-3
© 2024 Cerebras Systems Inc. All Rights Reserved
Data-parallel only training across CS-3s
• Weights are broadcast to all CS-3s
• Gradients are reduced on way back
Multi-system scaling with the same
execution model as single system
• Same system architecture
• Same network execution flow
• Same software user interface
SwarmX Fabric
Purpose Built Interconnect for Simple Scaling
MemoryX
Optimizer
Compute
Weight
Memory Weights
Gradients
Weights
Gradients
SwarmX
CS-3s
Scaling to cluster compute while operating like a single device
© 2024 Cerebras Systems Inc. All Rights Reserved
CS-3 Cluster Compute
CS-2 Cluster
192 CS-2 systems
12 exaFLOPS AI Compute
© 2024 Cerebras Systems Inc. All Rights Reserved
• 2048 CS-3
in single cluster
• 256 exaFLOPS
AI Compute
• Programs like a
single device
CS-3 Cluster Compute
Supercomputer Performance, Single Device Experience
© 2024 Cerebras Systems Inc. All Rights Reserved
SwarmX
Scalable spine-leaf topology
• Standard-based 400/800G
Ethernet
• Performance and cost effective
• RDMA for low overhead and
latency
. . .
. . .
Scaling to 256 exaFLOPS
Purpose Built Scalable Network for AI Training
. . .
CS-2 CS-3
Cluster
Size
192 systems 2048 systems
Link
Speed
100 Gb/s
400 Gb/s
800 Gb/s
Cluster
Bandwidth
1 Pb/s 10 Pb/s
Cluster Options
© 2024 Cerebras Systems Inc. All Rights Reserved
Train Today’s SOTA Models in Hour or Days
~1 month ~1 day
Meta GPU Cluster Cerebras CS-3 Cluster
LLaMA 70B Training
© 2024 Cerebras Systems Inc. All Rights Reserved
Train Today’s SOTA Models in Hour or Days
~1 month ~1 day
Meta GPU Cluster Cerebras CS-3 Cluster
But the CS-3 cluster operates like single device
LLaMA 70B Training
© 2024 Cerebras Systems Inc. All Rights Reserved
CS-3 Cluster Memory
Memory SKUs
Memory
(TByte)
1.5 12
Parameters
(Billion)
30 240
CS-2 Options
© 2024 Cerebras Systems Inc. All Rights Reserved
MemoryX: The First Petabyte-Scale AI Memory
System
100x Larger
Models
24 Trillion
Parameters
Enterprise SKUs Hyperscale SKUs
Memory
(TByte)
1.5 12 24 36 120 1,200
Parameters
(Billion)
30 240 480 720 2,400 24,000
CS-3 MemoryX Options
© 2024 Cerebras Systems Inc. All Rights Reserved
MemoryX
Compute
State
Efficient hybrid state store
• Weights stored in DDR5 and Flash
• Perf and power/cost efficiency
Flexible compute
• Optimizer and other ops run on
CPU
• General purpose and flexible
• Support for all common ML ops
Enabling Multi-Trillion Parameter Models
Most Scalable and Efficient Model Memory
Model weights
CPU
Model optimizer
and operations
CS-2 CS-3
DRAM
Memory
12 TB DDR4
240B params
36 TB DDR5
720B params
Flash
Memory
1.2 PB
24T params
CPU
Perf
1x 2x
Cluster Options
© 2024 Cerebras Systems Inc. All Rights Reserved
Large Cluster Memory on a Single Device
© 2024 Cerebras Systems Inc. All Rights Reserved
Train Tomorrow’s Trillion+ Parameter Models
~1.5 years ~3 weeks
1000s of GPU Cerebras CS-3 Cluster
And the CS-3 cluster still operates like single device
Imagine…
LLaMA 1T Training
© 2024 Cerebras Systems Inc. All Rights Reserved
Interconnect Interconnect
...
Memory Memory
Interconnect
Memory
I see one
big device
I see one
big device
I see one
big device
You Program It Like A Single Device
No Matter The Cluster Size
1x CS-3 4x CS-3 2048x CS-3
Wafer
Scale
Cluster
© 2024 Cerebras Systems Inc. All Rights Reserved
Interconnect Interconnect
...
Memory Memory
Interconnect
Memory
And Your Model Always Fits
1B or 1T Parameters
1.5TB 36TB 1,200 TB
Wafer
Scale
Cluster
Llama
7B
Llama 70B Llama 700B
I see one
big device
I see one
big device
I see one
big device
© 2024 Cerebras Systems Inc. All Rights Reserved
Real world seamless cluster scaling
• User: G42
• Model: Jais30B
• Cluster: Condor Galaxy-1
• Experience: “It just worked”
• No complex distributed software
• No changes to parallelism model
• No changes to hyper-parameters
Training SOTA large models everyday
• Unique capability enabled by wafer-scale
0
8
16
24
32
40
48
56
64
0 8 16 24 32 40 48 56 64
Relative
Speedup
(x
factor)
Number of CS-2s
Jias30B Measured Training Speedup on CG-1
Resulting in Near Linear Scaling
Any Scale While Operating as a Single Device
© 2024 Cerebras Systems Inc. All Rights Reserved
External chip interconnect
Low perf high power connections
Custom proprietary switches
Complex distributed software
Hybrid model-parallel partitioning
Cerebras Design Philosophy:
Massive Compute + Memory for Large Scale Models
On-chip interconnect
“Free” high perf communication
Big enough to run largest models
Simple data-parallel only scaling
Disaggregate compute and memory
GPU
Wafer Scale Engine
NV
Link
NV
Link
NV
Link
NV
Link
© 2024 Cerebras Systems Inc. All Rights Reserved
But we can and need to do even better…
© 2024 Cerebras Systems Inc. All Rights Reserved
40,000x more compute
In just 5 years
Current trajectory is unsustainable
We must find more efficient
methods
Sparsity is the key
But We Can and Need to Do Even Better
Sparsity Solves the Explosive Cost of Gen AI
BERT
GPT-2
Megatron-LM
T5
T-NLG
GPT-3 Jurassic
Gopher
MT-NLG
Chincilla LLaMa
GPT-4
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
2018 2019 2020 2021 2022 2023 2024
Training
Compute
(exaFLOPs)
Year
exaFLOPs to Train
© 2024 Cerebras Systems Inc. All Rights Reserved
Sparsity opportunities are everywhere
• Neural networks have native sparsity
• e.g. ReLU or Dropout
• Neural networks can be made sparse
• e.g. sparse weights
• Models are over parameterized by design
• Training is act of discovering important
weights
Training dense is wasteful and inefficient
• But not all hardware can take advantage of
all forms of sparsity
Neural Networks are Sparse
Sparsity
© 2024 Cerebras Systems Inc. All Rights Reserved
Memory bandwidth built for sparsity
• Traditional hardware built for dense
• High data reuse à caching à low mem bw
• Wafer-scale memory built for sparse
• Low data reuse à caching à high mem bw
• Enabled by orders of magnitude more mem
bw
CS-3 accelerates all forms of sparsity
• Static and dynamic sparsity
• Structured and unstructured sparsity
Sparsity Acceleration is Memory Bound
x
x
Memory Bandwidth (Byte/FLOP)
Required Available
Dense MatMul
~0.001
H100
0.003
Sparse MatMul
~1
WSE-3
2
© 2024 Cerebras Systems Inc. All Rights Reserved
Examples of sparse training opportunities
• Dynamic activation sparsity
• e.g. Google: 95% sparse ReLU FFN in LLMs1
• Structured weight sparsity
• e.g. Mistral: 75% sparse FFN MoE 8x7B2
• Unstructured weight sparsity
• e.g. Cerebras: 75% sparse SPDF GPT3
Solving unsustainable scaling for training
• Only HW to accelerate all forms of sparsity
• Even future sparse techniques
Accelerating All Forms of Sparse Training
1 Li et al., The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers, 2023
2 Jiang et al., Mixtral of Experts, 2024
3 Thangarasa et al., SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models, 2023
0%
20%
40%
60%
80%
100%
ReLU MoE SPDF
Relative
FLOPs
FLOP Reduction From Sparsity
Dense Sparse
1.7x 2.0x 2.8x
© 2024 Cerebras Systems Inc. All Rights Reserved
But sparsity can also transform inference
on a variety of hardware…
© 2024 Cerebras Systems Inc. All Rights Reserved
Neural Magic + Cerebras
Accelerated Inferencing for LLM Optimization
Mark Kurtz
CTO
Neural Magic
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
OUR LEADERSHIP
Who are we?
AI leader in model optimization and inference server acceleration
MIT Professor of Electrical Engineering
and Computer Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of Multicore
Algorithms and Computational
Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO of
Google Cloud, former CTO and EVP of
Worldwide Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of Distributed
Computing and Machine Learning
Dan Alistarh
Principal Research Scientist
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Who are we?
AI leader in model optimization and inference server acceleration
• 200+ accepted papers
• 60 patents
• GPTQ
• SparseGPT
• Sparse Fine-Tuning
• nm-vllm
• DeepSparse
• SparseML
As a software-delivered solution, we have deep
expertise across AI model training and optimization.
We invented many of the current AI industry’s state-of-
the-art techniques for quantization and sparsification.
Our solutions include enterprise inference servers to
open-source libraries and a sparsified models repo.
OUR LEADERSHIP
MIT Professor of Electrical
Engineering and Computer
Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of
Multicore Algorithms and
Computational Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO
of Google Cloud, former CTO
and EVP of Worldwide
Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of
Distributed Computing and
Machine Learning
Dan Alistarh
Principal Research Scientist
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Who are we?
AI leader in model optimization and inference server acceleration
OUR LEADERSHIP
MIT Professor of Electrical
Engineering and Computer
Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of
Multicore Algorithms and
Computational Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO
of Google Cloud, former CTO
and EVP of Worldwide
Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of
Distributed Computing and
Machine Learning
Dan Alistarh
Principal Research Scientist
• 200+ accepted papers
• 60 patents
• GPTQ
• SparseGPT
• Sparse Fine-Tuning
• nm-vllm
• DeepSparse
• SparseML
As a software-delivered solution, we have deep
expertise across AI model training and optimization.
We invented many of the current AI industry’s state-of-
the-art techniques for quantization and sparsification.
Our solutions include enterprise inference servers to
open-source libraries and a sparsified models repo.
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges with LLM deployment
Deploying to production
! Issues include
• Requires lots of compute
• Requires lots of memory
• Increases latency
• Very demanding on inference
serving infrastructure
• Expensive to operate and
support
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges with LLM deployment
Deploying to production
! Issues include
• Requires lots of compute
• Requires lots of memory
• Increases latency
• Very demanding on inference
serving infrastructure
• Expensive to operate and
support
Options to resolve
• Decrease the size of the
LLM
• Apply quantization to
combat the accuracy issue
when model size is reduced
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges with LLM deployment
Deploying to production
! Issues include
• Requires lots of compute
• Requires lots of memory
• Increases latency
• Very demanding on inference
serving infrastructure
• Expensive to operate and
support
Options to resolve
• Decrease the size of the
LLM
• Apply quantization to
combat the accuracy issue
when model size is reduced
Llama 2 Size vs Accuracy
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges with LLM deployment
Deploying to production
! Issues include
• Requires lots of compute
• Requires lots of memory
• Increases latency
• Very demanding on inference
serving infrastructure
• Expensive to operate and
support
Options to resolve
• Decrease the size of the
LLM
• Apply quantization to
combat the accuracy issue
when model size is reduced
Llama 2 Size vs Accuracy
© 2024 Cerebras Systems Inc. All Rights Reserved
Before Pruning
The solution - Sparsity
© 2024 Cerebras Systems Inc. All Rights Reserved
The solution - Sparsity
After Pruning
Before Pruning
© 2024 Cerebras Systems Inc. All Rights Reserved
The solution - Sparsity
• Preserves the model’s accuracy
while reducing the size of the model
Unstructured Sparsity:
After Pruning
• Improves inference and training
performance
Before Pruning
© 2024 Cerebras Systems Inc. All Rights Reserved
Our research collaboration with Cerebras
Create open-source sparse foundational
models that organizations can easily
deploy and use with faster inference.
© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained
from Meta
© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
1.7-2.4X
Reduction in
FLOPS
© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
90%
Accuracy Recovery
© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
Chat
50%, 70%
Code Generation
50%, 70%
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras Proprietary & Confidential Information
Results
Full recovery with 50% and 70% sparse models.
Sparsity vs Accuracy for UltraChat 200k Sparsity vs Accuracy for Evol Code Alpaca
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras Proprietary & Confidential Information
Results
4.3X
Memory Reduction
Memory Usage vs Compression Level - Llama 2 7B
© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
Chat
50%, 70%
Code Generation
50%, 70%
© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Chat
50%, 70%
Code Generation
50%, 70%
Fine-Tuning
Your Use Case
Sparse Fine-
Tuning for a few
hours
Quantization
with GPTQ
© 2024 Cerebras Systems Inc. All Rights Reserved
Our process
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
DeepSparse
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Fine-Tuning
Your Use Case
Sparse Fine-
Tuning for a few
hours
Quantization
with GPTQ
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Chat
50%, 70%
Code Generation
50%, 70%
© 2024 Cerebras Systems Inc. All Rights Reserved
Local inference performance
With sparsity, real time chat is now possible on local CPUs.
Single Stream Token Generation - Llama 2 7B Single Stream Latency - Llama 2 7B
© 2024 Cerebras Systems Inc. All Rights Reserved
Server inference performance
With sparsity, CPU performance is competitive with GPUs.
Single Stream Decode Performance - Llama 2 7B Multi Stream Decode Performance - Llama 2 7B
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Comparison
Unoptimized Model
Llama 2 7B FP32
Sparse Quantized Model
Llama 2 7B 70% Sparse INT8
20 Tokens/Second
2 Tokens/Second
Using Neural Magic DeepSparse on an 8-core AMD Genoa CPU
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Key takeaways
Run SOTA models real
time on just a laptop with
Neural Magic DeepSparse
Transform your infrastructure with
just software to support LLMs
Train sparse models
faster with Cerebras
Takeaway 1
Up to 4X faster than llama.cpp
4X
Up to 7X
Takeaway 2
Up to 7X more inference streams
per server than llama.cpp at the
same performance level
Up to 2X
Takeaway 3
2X faster sparse training
Faster
© 2024 Cerebras Systems Inc. All Rights Reserved
© 2024 Cerebras Systems Inc. All Rights Reserved
Next steps
Neural Magic’s Hugging Face
Organization Cerebras Blog
• Arxiv paper with our current results
• Larger models
• Higher sparsities
• INT4 quantization support
• Combine with parameter efficient fine-tuning
Stay tuned for more collaboration with Cerebras
Neural Magic Docs
© 2024 Cerebras Systems Inc. All Rights Reserved
Thank you
Follow us to stay
current on all
things Neural
Magic, including
product updates,
ML research
developments,
and more.
@neuralmagic
Join our
Community
Engage with
fellow ML
practitioners. Ask
questions, share
feedback, and
improve the way
you use Neural
Magic.
Connect with
Neural Magic to
stay up to date with
#SoftwareDelivered
AI.
neural-magic
© 2024 Cerebras Systems Inc. All Rights Reserved
Models & Product
Jessica Liu, VP of Product, Cerebras
© 2024 Cerebras Systems Inc. All Rights Reserved
The goal of AI training: make the loss curve go down
© 2024 Cerebras Systems Inc. All Rights Reserved
⚠
But it’s not so simple...
© 2024 Cerebras Systems Inc. All Rights Reserved
This happens all the time
© 2024 Cerebras Systems Inc. All Rights Reserved
Model performance can vary greatly
© 2024 Cerebras Systems Inc. All Rights Reserved
Lots of time and cost riding on "getting the big run right"
Challenges of large GenAI training & fine-tuning
Out of memory
GPU failure
Numerics bug
Low utilization
2.
ML Complexity
1.
Distribution
3.
Cost
© 2024 Cerebras Systems Inc. All Rights Reserved
How to get good model quality at scale
Run
Experiments
Pick Winners Scale Up
Design the
Experiments
1.3 B
500M
© 2024 Cerebras Systems Inc. All Rights Reserved
How to get good model quality at scale
Run
Experiments
Pick Winners Scale Up
Design the
Experiments
1.3 B 3B
© 2024 Cerebras Systems Inc. All Rights Reserved
How to get good model quality at scale
Run
Experiments
Pick Winners Scale Up
Design the
Experiments
3 B 7 B Good config
for 1 3B, 30 B
© 2024 Cerebras Systems Inc. All Rights Reserved
Run
Experiments
Pick Winners Scale Up
Design the
Experiments
How to get good model quality at scale
Time
/
Work
.5 B 3 B 13 B 100B
© 2024 Cerebras Systems Inc. All Rights Reserved
Run
Experiments
Pick Winners Scale Up
Design the
Experiments
How to get good model quality at scale (on GPUs)
Time
/
Work
.5B
3B
13B
100B
1 GPU
8 GPUs
Data Parallelism
256 GPUs
Data & Tensor &
Pipeline parallel
2048 GPUs
Data & Tensor &
Pipeline & Expert &
Sequence parallelism
© 2024 Cerebras Systems Inc. All Rights Reserved
You have to micromanage the
distribution strategy:
• Tensor or pipeline model parallelism
• Distributed data parallelism
• Expert parallelism
• Interleaved pipelining schedule
• Activation checkpointing &
recomputation
• Interplay among model size, cluster size,
connectivity between nodes, number of
nodes, etc.
Scaling frameworks still require tons of work
© 2024 Cerebras Systems Inc. All Rights Reserved
Lines of Code
----------------------------
Python 18395
C/C++ 1118
C++ 649
CUDA 220
HTML 107
Bourne Shell 9
make 7
Markdown 1
Text 1
----------------------------
Total 20507
----------------------------
Nvidia’s GPT-175B Model
20,000 lines of code, weeks to implement
Hard to debug
You have to micromanage the
distribution strategy:
• Tensor or pipeline model parallelism
• Distributed data parallelism
• Expert parallelism
• Interleaved pipelining schedule
• Activation checkpointing &
recomputation
• Interplay among model size, cluster size,
connectivity between nodes, number of
nodes, etc.
Scaling frameworks still require tons of work
© 2024 Cerebras Systems Inc. All Rights Reserved
Cut experiment iteration time from weeks to a day
Lines of Code
----------------------------
Python 18395
C/C++ 1118
C++ 649
CUDA 220
HTML 107
Bourne Shell 9
make 7
Markdown 1
Text 1
----------------------------
Total 20507
----------------------------
Lines of Code
----------------------------
Python 565
C/C++ 0
C++ 0
CUDA 0
HTML 0
Bourne Shell 0
make 0
Markdown 0
Text 0
----------------------------
Total 565
----------------------------
Cerebras’ GPT-175B Model
565 lines of code, 1 Day to implement
"GPT-3 in 565 lines of code" Blog
Nvidia’s GPT-175B Model
20,000 lines of code, weeks to implement
Hard to debug
© 2024 Cerebras Systems Inc. All Rights Reserved
How to scale from 1B to 70B on Cerebras
### GPT-3 XL 1.3B
hidden_size: 2048
num_hidden_layers: 24
num_heads: 16
gpt3_1b_params.yaml
python run.py 
--params gpt3_1b_params.yaml 
--num_steps=100 
--model_dir=model_dir 
Training:
### Llama-2 70B
hidden_size: 8192
num_hidden_layers: 80
num_heads: 64
llama2_70b_params.yaml
python run.py 
--params llama2_70B_params.yaml 
--num_steps=100 
--model_dir=model_dir 
Training:
© 2024 Cerebras Systems Inc. All Rights Reserved
Scaling from one CS-3 to a cluster is a 1-line change
© 2024 Cerebras Systems Inc. All Rights Reserved
Time
/
Work
Cerebras gets you to high-quality large models
faster & more cheaply
On CS-3,
Data parallel only
any model size
Run
Experiments
Pick Winners Scale Up
Design
Sweeps
.5 B 3 B 13 B 100B
© 2024 Cerebras Systems Inc. All Rights Reserved
On GPUs, small models are the default;
large models take large engineering effort.
On CS-3s, large models are the default;
small models come for free.
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras Proprietary & Confidential Information
Med42: Llama-70B Fine-tuned in <1 Week
to Pass the US Medical License Exam
• Scored 72% on USMLE, beating GPT-3.5
• With M42: global healthcare company
with over 450 hospitals and clinics
• Custom curated healthcare dataset of
peer-reviewed papers, medical
textbooks, international health agency
datasets.
• Run finished in 1 weekend
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras Proprietary & Confidential Information
FLOR-6.3B State-of-the-Art Catalan,
Spanish, and English LLM
• Best Catalan model, beating BLOOM-7.3B
• Used latest language adaptation techniques
for languages with less training data
• Reduced inference cost by 10% vs. BLOOM,
incorporating a new, more efficient tokenizer
• Used to build RAG systems for specialized
domains
• Trained on 140B Tokens and in 2.5 days.
• Open Source: Downloaded over 3000 times
FLOR-6.3B
© 2024 Cerebras Systems Inc. All Rights Reserved
JAIS-30B: State-of-the-Art
Arabic-English Bilingual LLM
• SoTA Arabic: Outperforms all other Arabic models
• English: Llama-30B quality in English
• Co-developed with G42’s Core42 and MBZUAI
• Now on Azure AI Cloud as the foundation of their
Model-as-a-Service in the Middle East
Checkpoints on
HuggingFace
Paper available
on Arxiv
© 2024 Cerebras Systems Inc. All Rights Reserved
Challenges
(1) Few high-quality Arabic datasets and
preprocessing pipelines
(2) Tokenizers trained on English
corpora don’t extend well to Arabic
(3) Want highest quality model with
best cost and compute efficiency
Used latest ML techniques – AliBi, SwiGLU
activation, MuP, Scaling laws
Ran many tuning experiments on models of
590M, 1.3B, 2.7B, 6.7B.
New vocab optimized for cross-lingual
alignment and trained custom tokenizer
Built new multi-lingual set, experimenting with
mixes of Arabic-only, and Arabic, English, and
code, to find optimal mix (1:2:0.4)
What we did
© 2024 Cerebras Systems Inc. All Rights Reserved
"I’ve found it really easy to experiment at every model size and scale
on multiple CS systems, which we need to do to get the best results.
There’s no difference between running a job on a single CS versus
multiple ones. All it takes is a small config change, and everything just
works with observable linear speedup!
Launched my first distributed LLM training within the first hour of
logging into a CS cluster for the first time!”
Neha Sengupta, Core42
Principle Applied Scientist
© 2024 Cerebras Systems Inc. All Rights Reserved
Jais-30B-v3 sets new record for open-source Arabic LLMs,
finishes training on 1.3 Trillion tokens
35.1
59.3
39.1
53.1
31.2
49.2
35.1
48.2
31.0
38.1
30.2
48.4
28.9
33.9
26.9
48.4
28.6
32.1
26.4
49.3
MMLU Hellaswag ARC-C TruthfulQA
Jais-30B outperforms on all common NLP benchmarks in Arabic
Jais-30b-chat acegpt-13b-chat BLOOMz (7.1B) LLaMA (30B) falcon-40b_instruct
Note, results are displayed in order of the legend.
© 2024 Cerebras Systems Inc. All Rights Reserved
The Future is Multimodal
An explosion of exploration in multimodality
Source: Recent advances in Multimodal LLMs
© 2024 Cerebras Systems Inc. All Rights Reserved
• Generalized support for Visual Q&A:
• Multiple vision encoders
• Multiple LLM backbones
• Cross-projection learning
• Multiple modalities to an LLM backbone
• Easy scaling for model size and context length
• Easy to configure many leading literature models
(e.g. LLaVA, AnyMAL, Eyes Wide Shut)
• Dataset: support for quick import of custom datasets
Multimodality is easy on Cerebras
Multimodal Output
CLIP Llama
SigLIP
DinoV2
Mistral
Zephyr
Plug & play vision & LLM backbones
© 2024 Cerebras Systems Inc. All Rights Reserved
Demo
© 2024 Cerebras Systems Inc. All Rights Reserved
Demo
© 2024 Cerebras Systems Inc. All Rights Reserved
Reproducing state-of-the-art results in just a
couple weeks
62.0
58.2
78.5
85.9
62.3
58.2
78.5
85.3
63.3
60.4
80.6
63.5
60.8
80.7
85.7
GQA VQA(t) VQA(v2) POPE
LLaVA1.5 (7B) Cerebras-LLaVA 1.5 (7B) SGPT4V (7B) Cerebras-SGPT4V (7B)
7B parameter model 13B parameter model
not
reported
Note, results are displayed in order of the legend.
63.3 61.3
80.0
85.9
64.2 63.4
82.0
85.8
GQA VQA(t) VQA(v2) POPE
LLaVA1.5 (13B) Cerebras-LLaVA 1.5 (13B)
63.3 61.3
80.0
85.9
64.2 63.36
82.0
85.8
GQA VQA(t) VQA(v2) POPE
LLaVA1.5 (13B) Cerebras-LLaVA 1.5 (13B)
62.0
58.2
78.5
85.9
62.3
58.2
78.5
85.3
63.3
60.4
80.6
63.5
60.8
80.7
85.7
GQA VQA(t) VQA(v2) POPE
LLaVA1.5 (7B) Cerebras-LLaVA 1.5 (7B) SGPT4V (7B) Cerebras-SGPT4V (7B)
Reproducing state-of-the-art results in just a
couple weeks
Improving
POPE GQA VQAt MME VQAv2
CS3-LLaVA-7B 86.7 63.9 61.5 1573 81.4
LLaVA 1.5 13B HD 86.3 64.7 62.5 1500 81.8
7B model competitive with LLaVA 1.5 13 Billion HD
- 2X larger and 1.7X higher resolution image input
This model came out <2 months ago
© 2024 Cerebras Systems Inc. All Rights Reserved
Get started quickly with Cerebras ModelZoo
Model code with flexible configuration setup
• Different image encoders:
• CLIP
• SigLIP
• Dino v2
• Different LLM backbones:
• LLaMA
• Mistral
• Zephyr
• Different training recipes:
• LLaMA Pro
• Eyes Wide Shut
• Freezing different parts of the model
Prepared Datasets
• LLAVA 1.5, ShareGPT4V, Instruct4V
• ChartQA, DocVQA, DVQA, ArxivQA, AI2Diagrams
Data pre-processing scripts
• HDF5 file generation support
• Handles mix of multimodal and text-only data
• Optimized for high-throughput training
Easy scaling for model and data
• LLM model size
• Long context lengths
• Image resolution and patch size
© 2024 Cerebras Systems Inc. All Rights Reserved
Model Checkpoints Available on HuggingFace
7B – available now
13B – available now
70B – end of March!
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras’ goal is to bring
State-of-the-Art AI to
every organization
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras solutions meet you wherever you need
Cerebras Wafer Scale Clusters
Cerebras Cloud
Cerebras AI Solutions
© 2024 Cerebras Systems Inc. All Rights Reserved
Cerebras AI Model Services
GenAI Success with Cerebras ML Experts on
the Fastest, Most Efficient Platform
• Speed: Multi-Billion param models in days to weeks.
• Tailored to you: Custom chatbots, VQA Systems,
Code Completion, Foundation models, and more
• All the latest ML Techniques: RAG, DPO, LoRA,
MuP, data augmentation, and more.
• Total Ownership: Your data, your model weights.
© 2024 Cerebras Systems Inc. All Rights Reserved
Models on Cerebras
From multi-lingual LLMs to healthcare chatbots to code models.
© 2024 Cerebras Systems Inc. All Rights Reserved
All the Latest ML Techniques & Recipes
Variable Seq Training
DPO
LL360 – Open data, models, scripts
Multi-lingual
Pre-training & IFT
Llama70B fine tuning
Domain Adaptation
GPT-3 in 565 lines
of code
Most FLOP efficient
LLM dataset
First family of open GPT models
and OSS use of muP
RAG
LoRA
MoE
Multi
Modal
Sparse
Models
© 2024 Cerebras Systems Inc. All Rights Reserved
The model belongs to you
Your data stays with you
© 2024 Cerebras Systems Inc. All Rights Reserved
Cloud
Cerebras AI Supercomputers
Exascale compute with the programmability of a single device
On-Prem
© 2024 Cerebras Systems Inc. All Rights Reserved
AI Applications & Research Panel
Andy Hock, SVP Product & Strategy, Cerebras
Cerebras AI Applications
& Research Panel
Praneetha Elugunti
Mayo Clinic
Jim Culver
GSK
Tim Bishop
Mayo Clinic
Irinia Rish
University of Montreal
Andy Hock
Cerebras
Cerebras x
Qualcomm
Fireside Chat with
Rashid Attar, VP of Cloud Computing,
Qualcomm
Cerebras x Qualcomm Technology Partnership
Reducing Inference Cost by 10x
Cerebras CS-3
AI Training
Qualcomm Cloud AI100 Ultra
AI Inference
Jointly optimized software stack for
cost efficient LLMs
Cerebras
Stack
Qualcomm
Stack
Sparse training Sparse inference
Train in FP16 Compile & run in MX6
Train large + small models Apply speculative decoding
Network Architecture
Search
Compile & run on Ultra AI 100
Cerebras x Qualcomm: Up t0 10x
Inference Performance
10
8
6
4
2
0
Baseline
Speculative
Decoding
MX6
Compression
Neural
Architectural
Search
Sparsity
Total
Tokens
/
$
1x
1.8x
2.2x
2.5x 2.5x
~10x
Cerebras x G42
Fireside Chat with
Kiril Evtimov, Group CTO G42 & CEO
Core42
G42 across the Entire AI Value Chain
Customer &
Industry Tailored
Solutions
Data
Centers
Compute
Infrastructure
Cloud
Platforms
AI Model
Development
Cloud &
Enterprise AI
Deployment
Application
Development
476B Arabic tokens
1.63T Total tokens
The world’s largest
open-source Arabic LLM
30B parameter, bilingual
Arabic-English model
Trained on the
Condor Galaxy 1 and 2
AI Supercomputer
Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip

More Related Content

Similar to Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip

Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxMemory Fabric Forum
 
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...Big Data Week
 
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI InfrastructureMemory Fabric Forum
 
Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0Deepak Shankar
 
Exploration of Radars and Software Defined Radios using VisualSim
Exploration of  Radars and Software Defined Radios using VisualSimExploration of  Radars and Software Defined Radios using VisualSim
Exploration of Radars and Software Defined Radios using VisualSimDeepak Shankar
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
Build FAST Learning Apps with Docker and OpenPOWER
Build FAST Learning Apps with Docker and OpenPOWERBuild FAST Learning Apps with Docker and OpenPOWER
Build FAST Learning Apps with Docker and OpenPOWERIndrajit Poddar
 
MemVerge: Memory Expansion Without Breaking the Budget
MemVerge: Memory Expansion Without Breaking the BudgetMemVerge: Memory Expansion Without Breaking the Budget
MemVerge: Memory Expansion Without Breaking the BudgetMemory Fabric Forum
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03Haris456
 
April 2014 IBM announcement webcast
April 2014 IBM announcement webcastApril 2014 IBM announcement webcast
April 2014 IBM announcement webcastHELP400
 
AWS Summit Bogotá Track Avanzado: EC2 avanzado
AWS Summit Bogotá Track Avanzado: EC2 avanzadoAWS Summit Bogotá Track Avanzado: EC2 avanzado
AWS Summit Bogotá Track Avanzado: EC2 avanzadoAmazon Web Services
 
Exadata_X10M-Hardware-Overview.pdf
Exadata_X10M-Hardware-Overview.pdfExadata_X10M-Hardware-Overview.pdf
Exadata_X10M-Hardware-Overview.pdfKoko842772
 
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...Memory Fabric Forum
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engineBhuvaneshwaran R
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionSearce Inc
 
The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems Rebekah Rodriguez
 

Similar to Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip (20)

Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
 
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
 
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
 
Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
Exploration of Radars and Software Defined Radios using VisualSim
Exploration of  Radars and Software Defined Radios using VisualSimExploration of  Radars and Software Defined Radios using VisualSim
Exploration of Radars and Software Defined Radios using VisualSim
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
Build FAST Learning Apps with Docker and OpenPOWER
Build FAST Learning Apps with Docker and OpenPOWERBuild FAST Learning Apps with Docker and OpenPOWER
Build FAST Learning Apps with Docker and OpenPOWER
 
MemVerge: Memory Expansion Without Breaking the Budget
MemVerge: Memory Expansion Without Breaking the BudgetMemVerge: Memory Expansion Without Breaking the Budget
MemVerge: Memory Expansion Without Breaking the Budget
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03
 
April 2014 IBM announcement webcast
April 2014 IBM announcement webcastApril 2014 IBM announcement webcast
April 2014 IBM announcement webcast
 
Palestra IBM-Mack Zvm linux
Palestra  IBM-Mack Zvm linux  Palestra  IBM-Mack Zvm linux
Palestra IBM-Mack Zvm linux
 
AWS Summit Bogotá Track Avanzado: EC2 avanzado
AWS Summit Bogotá Track Avanzado: EC2 avanzadoAWS Summit Bogotá Track Avanzado: EC2 avanzado
AWS Summit Bogotá Track Avanzado: EC2 avanzado
 
Exadata_X10M-Hardware-Overview.pdf
Exadata_X10M-Hardware-Overview.pdfExadata_X10M-Hardware-Overview.pdf
Exadata_X10M-Hardware-Overview.pdf
 
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engine
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in Production
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
Power overview 2018 08-13b
Power overview 2018 08-13bPower overview 2018 08-13b
Power overview 2018 08-13b
 
The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems
 

Recently uploaded

Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝soniya singh
 
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...Call Girls in Nagpur High Profile
 
VVIP Pune Call Girls Balaji Nagar (7001035870) Pune Escorts Nearby with Compl...
VVIP Pune Call Girls Balaji Nagar (7001035870) Pune Escorts Nearby with Compl...VVIP Pune Call Girls Balaji Nagar (7001035870) Pune Escorts Nearby with Compl...
VVIP Pune Call Girls Balaji Nagar (7001035870) Pune Escorts Nearby with Compl...Call Girls in Nagpur High Profile
 
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | DelhiFULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhisoniya singh
 
如何办理萨省大学毕业证(UofS毕业证)成绩单留信学历认证原版一比一
如何办理萨省大学毕业证(UofS毕业证)成绩单留信学历认证原版一比一如何办理萨省大学毕业证(UofS毕业证)成绩单留信学历认证原版一比一
如何办理萨省大学毕业证(UofS毕业证)成绩单留信学历认证原版一比一ga6c6bdl
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile serviceCall Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile servicerehmti665
 
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...ranjana rawat
 
VIP Call Girl Saharanpur Aashi 8250192130 Independent Escort Service Saharanpur
VIP Call Girl Saharanpur Aashi 8250192130 Independent Escort Service SaharanpurVIP Call Girl Saharanpur Aashi 8250192130 Independent Escort Service Saharanpur
VIP Call Girl Saharanpur Aashi 8250192130 Independent Escort Service SaharanpurSuhani Kapoor
 
Slim Call Girls Service Badshah Nagar * 9548273370 Naughty Call Girls Service...
Slim Call Girls Service Badshah Nagar * 9548273370 Naughty Call Girls Service...Slim Call Girls Service Badshah Nagar * 9548273370 Naughty Call Girls Service...
Slim Call Girls Service Badshah Nagar * 9548273370 Naughty Call Girls Service...nagunakhan
 
Dubai Call Girls O528786472 Call Girls In Dubai Wisteria
Dubai Call Girls O528786472 Call Girls In Dubai WisteriaDubai Call Girls O528786472 Call Girls In Dubai Wisteria
Dubai Call Girls O528786472 Call Girls In Dubai WisteriaUnited Arab Emirates
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service ThanePooja Nehwal
 
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai Mumbai ...
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai  Mumbai ...High Profile Call Girls In Andheri 7738631006 Call girls in mumbai  Mumbai ...
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai Mumbai ...Pooja Nehwal
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单留信学历认证原版一比一
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单留信学历认证原版一比一如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单留信学历认证原版一比一
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单留信学历认证原版一比一ga6c6bdl
 
Thane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsThane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsPooja Nehwal
 
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一zul5vf0pq
 

Recently uploaded (20)

young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
 
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
Call Girls in Dwarka Sub City 💯Call Us 🔝8264348440🔝
 
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
 
VVIP Pune Call Girls Balaji Nagar (7001035870) Pune Escorts Nearby with Compl...
VVIP Pune Call Girls Balaji Nagar (7001035870) Pune Escorts Nearby with Compl...VVIP Pune Call Girls Balaji Nagar (7001035870) Pune Escorts Nearby with Compl...
VVIP Pune Call Girls Balaji Nagar (7001035870) Pune Escorts Nearby with Compl...
 
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | DelhiFULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
 
如何办理萨省大学毕业证(UofS毕业证)成绩单留信学历认证原版一比一
如何办理萨省大学毕业证(UofS毕业证)成绩单留信学历认证原版一比一如何办理萨省大学毕业证(UofS毕业证)成绩单留信学历认证原版一比一
如何办理萨省大学毕业证(UofS毕业证)成绩单留信学历认证原版一比一
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile serviceCall Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile service
 
9953330565 Low Rate Call Girls In Jahangirpuri Delhi NCR
9953330565 Low Rate Call Girls In Jahangirpuri  Delhi NCR9953330565 Low Rate Call Girls In Jahangirpuri  Delhi NCR
9953330565 Low Rate Call Girls In Jahangirpuri Delhi NCR
 
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
(SANA) Call Girls Landewadi ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
 
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
 
VIP Call Girl Saharanpur Aashi 8250192130 Independent Escort Service Saharanpur
VIP Call Girl Saharanpur Aashi 8250192130 Independent Escort Service SaharanpurVIP Call Girl Saharanpur Aashi 8250192130 Independent Escort Service Saharanpur
VIP Call Girl Saharanpur Aashi 8250192130 Independent Escort Service Saharanpur
 
Slim Call Girls Service Badshah Nagar * 9548273370 Naughty Call Girls Service...
Slim Call Girls Service Badshah Nagar * 9548273370 Naughty Call Girls Service...Slim Call Girls Service Badshah Nagar * 9548273370 Naughty Call Girls Service...
Slim Call Girls Service Badshah Nagar * 9548273370 Naughty Call Girls Service...
 
Dubai Call Girls O528786472 Call Girls In Dubai Wisteria
Dubai Call Girls O528786472 Call Girls In Dubai WisteriaDubai Call Girls O528786472 Call Girls In Dubai Wisteria
Dubai Call Girls O528786472 Call Girls In Dubai Wisteria
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
 
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai Mumbai ...
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai  Mumbai ...High Profile Call Girls In Andheri 7738631006 Call girls in mumbai  Mumbai ...
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai Mumbai ...
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单留信学历认证原版一比一
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单留信学历认证原版一比一如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单留信学历认证原版一比一
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单留信学历认证原版一比一
 
Thane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsThane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call Girls
 
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
定制加拿大滑铁卢大学毕业证(Waterloo毕业证书)成绩单(文凭)原版一比一
 

Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip

  • 1. © 2024 Cerebras Systems Inc. All Rights Reserved
  • 2. Andrew Feldman CEO & Co-Founder Cerebras
  • 3. © 2024 Cerebras Systems Inc. All Rights Reserved AI Has Fundamentally Changed Computing AI Supercomputers x86 Servers
  • 4. © 2024 Cerebras Systems Inc. All Rights Reserved There’s a vast chasm in AI capabilities
  • 5. AI Developers Are Struggling with Distributed GPU Training
  • 6. © 2024 Cerebras Systems Inc. All Rights Reserved “It can be a frustrating daily life experience of training large models…You're there carefully monitoring the vital signs of your run: loss spikes, numerical issues, throughput, gradient norms, policy entropy, etc... or 10,000 GPUs could be idling.” Co-Founder, OpenAI
  • 7. © 2024 Cerebras Systems Inc. All Rights Reserved Co-founder, Reka AI Former Google Brain Scientist, “Multi-node GPU training is more of an afterthought as opposed to distributed training as a first class citizen…it’s a hardware lottery."
  • 8. © 2024 Cerebras Systems Inc. All Rights Reserved “Building large scale training clusters from scratch and achieving high MFU and reliability is damn hard” Senior Foundation Model Engineer, Uber
  • 10. GPT-4 1.7T Parameters 240+ contributors 35 just for distributed training & supercomputing
  • 11. © 2024 Cerebras Systems Inc. All Rights Reserved Large Models Simply Don’t Fit on GPUs ChatGPT (28TB) H100 (80GB)
  • 12. © 2024 Cerebras Systems Inc. All Rights Reserved Developers must cut the model into many pieces..
  • 13. © 2024 Cerebras Systems Inc. All Rights Reserved And spread them on hundreds of GPUs
  • 14. © 2024 Cerebras Systems Inc. All Rights Reserved An ML problem just turned into a parallel programming problem. A hardware problem just became a supercomputer problem. Then re-write the model to work across a cluster
  • 15. © 2024 Cerebras Systems Inc. All Rights Reserved This causes a code explosion nanoGPT 1B Parameters 639 lines of code nanoGPT 1B Parameters 639 lines of code Megatron 100B Parameters 20,507 lines of code
  • 16. © 2024 Cerebras Systems Inc. All Rights Reserved You never have to do this on Cerebras
  • 17. © 2024 Cerebras Systems Inc. All Rights Reserved The Cerebras Way Build a compute & memory system that’s vastly larger than the model Cerebras CS-3 = 1,200 TB ChatGPT
  • 18. © 2024 Cerebras Systems Inc. All Rights Reserved 4 trillion transistors 46,225 mm2 silicon 900,000 cores optimized for sparse linear algebra 5nm TSMC process 125 Petaflops of AI compute 44 Gigabytes of on-chip memory 21 PByte/s memory bandwidth 214 Pbit/s fabric bandwidth Cerebras Wafer-Scale Engine The fastest AI chip on earth again
  • 19. © 2024 Cerebras Systems Inc. All Rights Reserved
  • 20. © 2024 Cerebras Systems Inc. All Rights Reserved
  • 21. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras Wafer Scale Engine 3 Versus the H100 Cerebras WSE-3 4 Trillion Transistors 46,225 mm2 Silicon Largest GPU 80 Billion Transistors 814 mm2 Silicon
  • 22. © 2024 Cerebras Systems Inc. All Rights Reserved
  • 23. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras CS-3
  • 24. © 2024 Cerebras Systems Inc. All Rights Reserved CS-3 SwarmX MemoryX Wafer Scale Cluster: The World’s Most Scalable AI Supercomputer 1 terabyte 1 CS-3 125 petaflops 1 billion parameters 1 petabyte 2048 CS-3s 256 exaflops 24 trillion parameters
  • 25. © 2024 Cerebras Systems Inc. All Rights Reserved Exa-scale Performance
  • 26. © 2024 Cerebras Systems Inc. All Rights Reserved Single Device Simplicity MemoryX Memory Units SwarmX Interconnect Wafer Scale Engines 1 to 2048 CS-3s Look and Program Like a Single Device
  • 27. © 2024 Cerebras Systems Inc. All Rights Reserved
  • 28. © 2024 Cerebras Systems Inc. All Rights Reserved • Click to edit Master text styles • Second level • Third level • Fourth level • Fifth level
  • 29. © 2024 Cerebras Systems Inc. All Rights Reserved
  • 30. © 2024 Cerebras Systems Inc. All Rights Reserved Condor Galaxy 2 Stockton, California
  • 31. © 2024 Cerebras Systems Inc. All Rights Reserved Condor Galaxy 3 AI Supercomputer 64 CS-3 nodes 58 million AI cores 8 exaFLOPS FP16 AI compute 108 TB Parameter memory 388 Tbps On-chip bandwidth Dallas, Texas
  • 32. © 2024 Cerebras Systems Inc. All Rights Reserved AI Supercomputers Built & Operated in the United States Condor Galaxy 1 Santa Clara, CA Condor Galaxy 1 Santa Clara, CA ●4 ExaFLOPs ●64 x CS-2s ●82 TB of Memory ONLINE Stockton, CA Dallas, TX Condor Galaxy 1 Santa Clara, CA Condor Galaxy 2 ●4 ExaFLOPs ●64 x CS-2s ●82 TB of Memory ONLINE Condor Galaxy 1 Santa Clara, CA Condor Galaxy 3 ●8 ExaFLOPs ●64 x CS-3s ●108 TB of Memory Q2 2024
  • 33. © 2024 Cerebras Systems Inc. All Rights Reserved CEO of Microsoft Satya Nadella ■ JAIS 30B parameter, bilingual Arabic-English model ■ Microsoft’s core LLM offering in the Middle East ■ Available on Azure Cerebras & G42 World leading Arabic LLM
  • 34. © 2024 Cerebras Systems Inc. All Rights Reserved “Mayo Clinic selected Cerebras as its first generative AI collaborator for its large-scale, domain-specific AI expertise to accelerate breakthrough insights for the benefit of patients.” Cerebras & Mayo Clinic Breakthrough insights for the benefit of patients Medical Director for Strategy at Mayo Clinic Dr. Matthew Callstrom
  • 35. © 2024 Cerebras Systems Inc. All Rights Reserved “When the largest problem is solved, a speedup of 228x is achieved... Moreover…it is unlikely that such a performance gap can be closed… given the strong scalability issues encountered by this kind of algorithm when using a large number of multi-GPU nodes in HPC clusters.” Cerebras & TotalEnergies VP of Engineering at TotalEnergies Diego Klahr VP
  • 36. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras Cluster with 48 systems exceeded the performance of the World’s #1 Supercomputer ‘Frontier’ with 37,000 GPUs or a 100x cost saving. Cerebras & KAUST Tony Chan President, KAUST
  • 37. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras CS-3 Architecture Deep Dive Sean Lie, CTO and Co-Founder, Cerebras
  • 38. © 2024 Cerebras Systems Inc. All Rights Reserved • 2x performance • Same power • Same price Cerebras CS-3: A Generational Leap for AI LLM Training Performance
  • 39. © 2024 Cerebras Systems Inc. All Rights Reserved Registers • Building on tried-and-true WSE-2 core… WSE-3 Core 4-way 16b SIMD WSE-2 Core Memory SRAM 48kB Cache 256B Fabric 16 General Purpose 44 Data Structure
  • 40. © 2024 Cerebras Systems Inc. All Rights Reserved Improved performance for AI compute • New higher performance tensor operations • New 8-way SIMD for 16b data (FP/BF16) • New 16-way SIMD for 8b data (Fixed/INT8) • New faster non-linear functions • 2x higher compute performance core High bandwidth memory and cache • 48kB memory per core • New 512B local cache per core • Full bandwidth for full SIMD performance WSE-3 Core Continuing Distributed AI Architecture Leadership WSE-3 Core 48 Data Structure Registers 8-way 16b SIMD Memory SRAM 48kB Cache 512B Fabric 16 General Purpose 16-way 8b SIMD
  • 41. © 2024 Cerebras Systems Inc. All Rights Reserved From Small Core to Massive Wafer Die Core WSE-3 84 Die 900k Cores 10.7k Cores
  • 42. © 2024 Cerebras Systems Inc. All Rights Reserved Uniquely capable of wafer-scale integration • Invented process in first generation WSE • Extended to 5nm in collaboration with TSMC Co-designed from ground up • Uniform architecture with built-in redundancy • Extending uniform fabric across die • Wafer behaves as single massive chip WSE-3 Interconnect Enabling the Only Wafer Scale Chip in the World
  • 43. © 2024 Cerebras Systems Inc. All Rights Reserved WSE-3 Interconnect Enabling the Biggest Chip in the World GPU GPU GPU GPU GPU GPU GPU GPU NV Link NV Link NV Link NV Link Each H100 8xH100 Bandwidth 900GB/s 36x 100Gb/s serial 7.2TB/s 288x 100Gb/s serial Power 36W 288W 5.0 pJ/bit Each Die 84xDie 2880GB/s 480x 24Gb/s parallel 242TB/s 40320x 24Gb/s parallel 1.1W 92W 0.05 pJ/bit 10x More Die 33x More Bandwidth 100x More Power Efficient Wafer Scale Engine Traditional Serial across connectors, PCBs, cables Parallel across <1mm on silicon *GPU estimate use 5nm 100G serdes power with Nvidia H100 NVLink bandwidth
  • 44. © 2024 Cerebras Systems Inc. All Rights Reserved CS-3 System: Purpose Built for Wafer-Scale
  • 45. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras CS-3 Nvidia H100 Cerebras Advantage Chip size 46,225 mm2 814 mm2 57x Cores 900,000 16,896 FP32 + 528 Tensor 52x On-chip memory 44 Gigabytes 0.05 Gigabytes 880x Memory bandwidth 21 Petabytes/sec 0.003 Petabytes/sec 7,000X Fabric bandwidth 214 Petabits/sec 0.0576 Petabits/sec 3,715X CS-3 vs. GPU Orders of Magnitude Performance Advantage Enabling large scale training Finetune LLaMA 70B on 1B tokens in a day on a single chip
  • 46. © 2024 Cerebras Systems Inc. All Rights Reserved Cluster natively operates as single device WSE-3 is big enough to run largest models • Enables compute and memory disaggregation • Train with data-parallel only scaling Architect cluster-level memory and compute • External memory stores model weights • Untangle memory and compute dependency CS-3 Cluster Designed as Single ML Accelerator … SwarmX Interconnect MemoryX Memory Units Wafer Scale Engines
  • 47. © 2024 Cerebras Systems Inc. All Rights Reserved Model capacity not limited by device • Weights streamed onto wafer to compute layer • Weights trigger compute using HW dataflow • Weights are never stored on wafer Decoupling weight optimizer compute • Gradients streamed out of wafer • Weight update occurs in MemoryX MemoryX External Memory Virtually Unlimited Model Weight Capacity Memory hierarchy capable of massive models on single device Weights Gradients MemoryX Optimizer Compute Weight Memory CS-3
  • 48. © 2024 Cerebras Systems Inc. All Rights Reserved Data-parallel only training across CS-3s • Weights are broadcast to all CS-3s • Gradients are reduced on way back Multi-system scaling with the same execution model as single system • Same system architecture • Same network execution flow • Same software user interface SwarmX Fabric Purpose Built Interconnect for Simple Scaling MemoryX Optimizer Compute Weight Memory Weights Gradients Weights Gradients SwarmX CS-3s Scaling to cluster compute while operating like a single device
  • 49. © 2024 Cerebras Systems Inc. All Rights Reserved CS-3 Cluster Compute CS-2 Cluster 192 CS-2 systems 12 exaFLOPS AI Compute
  • 50. © 2024 Cerebras Systems Inc. All Rights Reserved • 2048 CS-3 in single cluster • 256 exaFLOPS AI Compute • Programs like a single device CS-3 Cluster Compute Supercomputer Performance, Single Device Experience
  • 51. © 2024 Cerebras Systems Inc. All Rights Reserved SwarmX Scalable spine-leaf topology • Standard-based 400/800G Ethernet • Performance and cost effective • RDMA for low overhead and latency . . . . . . Scaling to 256 exaFLOPS Purpose Built Scalable Network for AI Training . . . CS-2 CS-3 Cluster Size 192 systems 2048 systems Link Speed 100 Gb/s 400 Gb/s 800 Gb/s Cluster Bandwidth 1 Pb/s 10 Pb/s Cluster Options
  • 52. © 2024 Cerebras Systems Inc. All Rights Reserved Train Today’s SOTA Models in Hour or Days ~1 month ~1 day Meta GPU Cluster Cerebras CS-3 Cluster LLaMA 70B Training
  • 53. © 2024 Cerebras Systems Inc. All Rights Reserved Train Today’s SOTA Models in Hour or Days ~1 month ~1 day Meta GPU Cluster Cerebras CS-3 Cluster But the CS-3 cluster operates like single device LLaMA 70B Training
  • 54. © 2024 Cerebras Systems Inc. All Rights Reserved CS-3 Cluster Memory Memory SKUs Memory (TByte) 1.5 12 Parameters (Billion) 30 240 CS-2 Options
  • 55. © 2024 Cerebras Systems Inc. All Rights Reserved MemoryX: The First Petabyte-Scale AI Memory System 100x Larger Models 24 Trillion Parameters Enterprise SKUs Hyperscale SKUs Memory (TByte) 1.5 12 24 36 120 1,200 Parameters (Billion) 30 240 480 720 2,400 24,000 CS-3 MemoryX Options
  • 56. © 2024 Cerebras Systems Inc. All Rights Reserved MemoryX Compute State Efficient hybrid state store • Weights stored in DDR5 and Flash • Perf and power/cost efficiency Flexible compute • Optimizer and other ops run on CPU • General purpose and flexible • Support for all common ML ops Enabling Multi-Trillion Parameter Models Most Scalable and Efficient Model Memory Model weights CPU Model optimizer and operations CS-2 CS-3 DRAM Memory 12 TB DDR4 240B params 36 TB DDR5 720B params Flash Memory 1.2 PB 24T params CPU Perf 1x 2x Cluster Options
  • 57. © 2024 Cerebras Systems Inc. All Rights Reserved Large Cluster Memory on a Single Device
  • 58. © 2024 Cerebras Systems Inc. All Rights Reserved Train Tomorrow’s Trillion+ Parameter Models ~1.5 years ~3 weeks 1000s of GPU Cerebras CS-3 Cluster And the CS-3 cluster still operates like single device Imagine… LLaMA 1T Training
  • 59. © 2024 Cerebras Systems Inc. All Rights Reserved Interconnect Interconnect ... Memory Memory Interconnect Memory I see one big device I see one big device I see one big device You Program It Like A Single Device No Matter The Cluster Size 1x CS-3 4x CS-3 2048x CS-3 Wafer Scale Cluster
  • 60. © 2024 Cerebras Systems Inc. All Rights Reserved Interconnect Interconnect ... Memory Memory Interconnect Memory And Your Model Always Fits 1B or 1T Parameters 1.5TB 36TB 1,200 TB Wafer Scale Cluster Llama 7B Llama 70B Llama 700B I see one big device I see one big device I see one big device
  • 61. © 2024 Cerebras Systems Inc. All Rights Reserved Real world seamless cluster scaling • User: G42 • Model: Jais30B • Cluster: Condor Galaxy-1 • Experience: “It just worked” • No complex distributed software • No changes to parallelism model • No changes to hyper-parameters Training SOTA large models everyday • Unique capability enabled by wafer-scale 0 8 16 24 32 40 48 56 64 0 8 16 24 32 40 48 56 64 Relative Speedup (x factor) Number of CS-2s Jias30B Measured Training Speedup on CG-1 Resulting in Near Linear Scaling Any Scale While Operating as a Single Device
  • 62. © 2024 Cerebras Systems Inc. All Rights Reserved External chip interconnect Low perf high power connections Custom proprietary switches Complex distributed software Hybrid model-parallel partitioning Cerebras Design Philosophy: Massive Compute + Memory for Large Scale Models On-chip interconnect “Free” high perf communication Big enough to run largest models Simple data-parallel only scaling Disaggregate compute and memory GPU Wafer Scale Engine NV Link NV Link NV Link NV Link
  • 63. © 2024 Cerebras Systems Inc. All Rights Reserved But we can and need to do even better…
  • 64. © 2024 Cerebras Systems Inc. All Rights Reserved 40,000x more compute In just 5 years Current trajectory is unsustainable We must find more efficient methods Sparsity is the key But We Can and Need to Do Even Better Sparsity Solves the Explosive Cost of Gen AI BERT GPT-2 Megatron-LM T5 T-NLG GPT-3 Jurassic Gopher MT-NLG Chincilla LLaMa GPT-4 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 2018 2019 2020 2021 2022 2023 2024 Training Compute (exaFLOPs) Year exaFLOPs to Train
  • 65. © 2024 Cerebras Systems Inc. All Rights Reserved Sparsity opportunities are everywhere • Neural networks have native sparsity • e.g. ReLU or Dropout • Neural networks can be made sparse • e.g. sparse weights • Models are over parameterized by design • Training is act of discovering important weights Training dense is wasteful and inefficient • But not all hardware can take advantage of all forms of sparsity Neural Networks are Sparse Sparsity
  • 66. © 2024 Cerebras Systems Inc. All Rights Reserved Memory bandwidth built for sparsity • Traditional hardware built for dense • High data reuse à caching à low mem bw • Wafer-scale memory built for sparse • Low data reuse à caching à high mem bw • Enabled by orders of magnitude more mem bw CS-3 accelerates all forms of sparsity • Static and dynamic sparsity • Structured and unstructured sparsity Sparsity Acceleration is Memory Bound x x Memory Bandwidth (Byte/FLOP) Required Available Dense MatMul ~0.001 H100 0.003 Sparse MatMul ~1 WSE-3 2
  • 67. © 2024 Cerebras Systems Inc. All Rights Reserved Examples of sparse training opportunities • Dynamic activation sparsity • e.g. Google: 95% sparse ReLU FFN in LLMs1 • Structured weight sparsity • e.g. Mistral: 75% sparse FFN MoE 8x7B2 • Unstructured weight sparsity • e.g. Cerebras: 75% sparse SPDF GPT3 Solving unsustainable scaling for training • Only HW to accelerate all forms of sparsity • Even future sparse techniques Accelerating All Forms of Sparse Training 1 Li et al., The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers, 2023 2 Jiang et al., Mixtral of Experts, 2024 3 Thangarasa et al., SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models, 2023 0% 20% 40% 60% 80% 100% ReLU MoE SPDF Relative FLOPs FLOP Reduction From Sparsity Dense Sparse 1.7x 2.0x 2.8x
  • 68. © 2024 Cerebras Systems Inc. All Rights Reserved But sparsity can also transform inference on a variety of hardware…
  • 69. © 2024 Cerebras Systems Inc. All Rights Reserved Neural Magic + Cerebras Accelerated Inferencing for LLM Optimization Mark Kurtz CTO Neural Magic
  • 70. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved OUR LEADERSHIP Who are we? AI leader in model optimization and inference server acceleration MIT Professor of Electrical Engineering and Computer Science, ACM Fellow Nir Shavit Co-Founder MIT Research Scientist of Multicore Algorithms and Computational Connectomes Alex Matveev Co-Founder Chief Scientist Former VP of Product and CTO of Google Cloud, former CTO and EVP of Worldwide Engineering for RedHat Brian Stevens CEO of Neural Magic IST Austria Professor of Distributed Computing and Machine Learning Dan Alistarh Principal Research Scientist
  • 71. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Who are we? AI leader in model optimization and inference server acceleration • 200+ accepted papers • 60 patents • GPTQ • SparseGPT • Sparse Fine-Tuning • nm-vllm • DeepSparse • SparseML As a software-delivered solution, we have deep expertise across AI model training and optimization. We invented many of the current AI industry’s state-of- the-art techniques for quantization and sparsification. Our solutions include enterprise inference servers to open-source libraries and a sparsified models repo. OUR LEADERSHIP MIT Professor of Electrical Engineering and Computer Science, ACM Fellow Nir Shavit Co-Founder MIT Research Scientist of Multicore Algorithms and Computational Connectomes Alex Matveev Co-Founder Chief Scientist Former VP of Product and CTO of Google Cloud, former CTO and EVP of Worldwide Engineering for RedHat Brian Stevens CEO of Neural Magic IST Austria Professor of Distributed Computing and Machine Learning Dan Alistarh Principal Research Scientist
  • 72. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Who are we? AI leader in model optimization and inference server acceleration OUR LEADERSHIP MIT Professor of Electrical Engineering and Computer Science, ACM Fellow Nir Shavit Co-Founder MIT Research Scientist of Multicore Algorithms and Computational Connectomes Alex Matveev Co-Founder Chief Scientist Former VP of Product and CTO of Google Cloud, former CTO and EVP of Worldwide Engineering for RedHat Brian Stevens CEO of Neural Magic IST Austria Professor of Distributed Computing and Machine Learning Dan Alistarh Principal Research Scientist • 200+ accepted papers • 60 patents • GPTQ • SparseGPT • Sparse Fine-Tuning • nm-vllm • DeepSparse • SparseML As a software-delivered solution, we have deep expertise across AI model training and optimization. We invented many of the current AI industry’s state-of- the-art techniques for quantization and sparsification. Our solutions include enterprise inference servers to open-source libraries and a sparsified models repo.
  • 73. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Challenges with LLM deployment Deploying to production ! Issues include • Requires lots of compute • Requires lots of memory • Increases latency • Very demanding on inference serving infrastructure • Expensive to operate and support
  • 74. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Challenges with LLM deployment Deploying to production ! Issues include • Requires lots of compute • Requires lots of memory • Increases latency • Very demanding on inference serving infrastructure • Expensive to operate and support Options to resolve • Decrease the size of the LLM • Apply quantization to combat the accuracy issue when model size is reduced
  • 75. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Challenges with LLM deployment Deploying to production ! Issues include • Requires lots of compute • Requires lots of memory • Increases latency • Very demanding on inference serving infrastructure • Expensive to operate and support Options to resolve • Decrease the size of the LLM • Apply quantization to combat the accuracy issue when model size is reduced Llama 2 Size vs Accuracy
  • 76. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Challenges with LLM deployment Deploying to production ! Issues include • Requires lots of compute • Requires lots of memory • Increases latency • Very demanding on inference serving infrastructure • Expensive to operate and support Options to resolve • Decrease the size of the LLM • Apply quantization to combat the accuracy issue when model size is reduced Llama 2 Size vs Accuracy
  • 77. © 2024 Cerebras Systems Inc. All Rights Reserved Before Pruning The solution - Sparsity
  • 78. © 2024 Cerebras Systems Inc. All Rights Reserved The solution - Sparsity After Pruning Before Pruning
  • 79. © 2024 Cerebras Systems Inc. All Rights Reserved The solution - Sparsity • Preserves the model’s accuracy while reducing the size of the model Unstructured Sparsity: After Pruning • Improves inference and training performance Before Pruning
  • 80. © 2024 Cerebras Systems Inc. All Rights Reserved Our research collaboration with Cerebras Create open-source sparse foundational models that organizations can easily deploy and use with faster inference.
  • 81. © 2024 Cerebras Systems Inc. All Rights Reserved Our process Llama 2 2T Tokens Pretrained from Meta
  • 82. © 2024 Cerebras Systems Inc. All Rights Reserved Our process Llama 2 2T Tokens Pretrained from Meta Sparse Pretraining Sparse GPT Sparse Pretraining on Cerebras 150B Tokens 1.7-2.4X Reduction in FLOPS
  • 83. © 2024 Cerebras Systems Inc. All Rights Reserved Our process Llama 2 2T Tokens Pretrained from Meta Sparse Pretraining Sparse GPT Sparse Pretraining on Cerebras 150B Tokens Sparse Foundational Models Llama 2 7B Llama 2 7B 70% Sparse 50% Sparse 90% Accuracy Recovery
  • 84. © 2024 Cerebras Systems Inc. All Rights Reserved Our process Llama 2 2T Tokens Pretrained from Meta Sparse Pretraining Sparse GPT Sparse Pretraining on Cerebras 150B Tokens Off the Shelf Sparse Fine-Tuning Quantization with GPTQ Sparse Foundational Models Llama 2 7B Llama 2 7B 70% Sparse 50% Sparse
  • 85. © 2024 Cerebras Systems Inc. All Rights Reserved Our process Llama 2 2T Tokens Pretrained from Meta Sparse Pretraining Sparse GPT Sparse Pretraining on Cerebras 150B Tokens Sparse Foundational Models Llama 2 7B Llama 2 7B 70% Sparse 50% Sparse Off the Shelf Sparse Fine-Tuning Quantization with GPTQ Chat 50%, 70% Code Generation 50%, 70%
  • 86. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras Proprietary & Confidential Information Results Full recovery with 50% and 70% sparse models. Sparsity vs Accuracy for UltraChat 200k Sparsity vs Accuracy for Evol Code Alpaca
  • 87. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras Proprietary & Confidential Information Results 4.3X Memory Reduction Memory Usage vs Compression Level - Llama 2 7B
  • 88. © 2024 Cerebras Systems Inc. All Rights Reserved Our process Llama 2 2T Tokens Pretrained from Meta Sparse Pretraining Sparse GPT Sparse Pretraining on Cerebras 150B Tokens Sparse Foundational Models Llama 2 7B Llama 2 7B 70% Sparse 50% Sparse Off the Shelf Sparse Fine-Tuning Quantization with GPTQ Chat 50%, 70% Code Generation 50%, 70%
  • 89. © 2024 Cerebras Systems Inc. All Rights Reserved Our process Off the Shelf Sparse Fine-Tuning Quantization with GPTQ Llama 2 2T Tokens Pretrained from Meta Sparse Pretraining Sparse GPT Sparse Pretraining on Cerebras 150B Tokens Sparse Foundational Models Llama 2 7B Llama 2 7B 70% Sparse 50% Sparse Chat 50%, 70% Code Generation 50%, 70% Fine-Tuning Your Use Case Sparse Fine- Tuning for a few hours Quantization with GPTQ
  • 90. © 2024 Cerebras Systems Inc. All Rights Reserved Our process Off the Shelf Sparse Fine-Tuning Quantization with GPTQ DeepSparse Llama 2 2T Tokens Pretrained from Meta Sparse Pretraining Sparse GPT Sparse Pretraining on Cerebras 150B Tokens Fine-Tuning Your Use Case Sparse Fine- Tuning for a few hours Quantization with GPTQ Sparse Foundational Models Llama 2 7B Llama 2 7B 70% Sparse 50% Sparse Chat 50%, 70% Code Generation 50%, 70%
  • 91. © 2024 Cerebras Systems Inc. All Rights Reserved Local inference performance With sparsity, real time chat is now possible on local CPUs. Single Stream Token Generation - Llama 2 7B Single Stream Latency - Llama 2 7B
  • 92. © 2024 Cerebras Systems Inc. All Rights Reserved Server inference performance With sparsity, CPU performance is competitive with GPUs. Single Stream Decode Performance - Llama 2 7B Multi Stream Decode Performance - Llama 2 7B
  • 93. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Comparison Unoptimized Model Llama 2 7B FP32 Sparse Quantized Model Llama 2 7B 70% Sparse INT8 20 Tokens/Second 2 Tokens/Second Using Neural Magic DeepSparse on an 8-core AMD Genoa CPU
  • 94. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Key takeaways Run SOTA models real time on just a laptop with Neural Magic DeepSparse Transform your infrastructure with just software to support LLMs Train sparse models faster with Cerebras Takeaway 1 Up to 4X faster than llama.cpp 4X Up to 7X Takeaway 2 Up to 7X more inference streams per server than llama.cpp at the same performance level Up to 2X Takeaway 3 2X faster sparse training Faster
  • 95. © 2024 Cerebras Systems Inc. All Rights Reserved © 2024 Cerebras Systems Inc. All Rights Reserved Next steps Neural Magic’s Hugging Face Organization Cerebras Blog • Arxiv paper with our current results • Larger models • Higher sparsities • INT4 quantization support • Combine with parameter efficient fine-tuning Stay tuned for more collaboration with Cerebras Neural Magic Docs
  • 96. © 2024 Cerebras Systems Inc. All Rights Reserved Thank you Follow us to stay current on all things Neural Magic, including product updates, ML research developments, and more. @neuralmagic Join our Community Engage with fellow ML practitioners. Ask questions, share feedback, and improve the way you use Neural Magic. Connect with Neural Magic to stay up to date with #SoftwareDelivered AI. neural-magic
  • 97. © 2024 Cerebras Systems Inc. All Rights Reserved Models & Product Jessica Liu, VP of Product, Cerebras
  • 98. © 2024 Cerebras Systems Inc. All Rights Reserved The goal of AI training: make the loss curve go down
  • 99. © 2024 Cerebras Systems Inc. All Rights Reserved ⚠ But it’s not so simple...
  • 100. © 2024 Cerebras Systems Inc. All Rights Reserved This happens all the time
  • 101. © 2024 Cerebras Systems Inc. All Rights Reserved Model performance can vary greatly
  • 102. © 2024 Cerebras Systems Inc. All Rights Reserved Lots of time and cost riding on "getting the big run right" Challenges of large GenAI training & fine-tuning Out of memory GPU failure Numerics bug Low utilization 2. ML Complexity 1. Distribution 3. Cost
  • 103. © 2024 Cerebras Systems Inc. All Rights Reserved How to get good model quality at scale Run Experiments Pick Winners Scale Up Design the Experiments 1.3 B 500M
  • 104. © 2024 Cerebras Systems Inc. All Rights Reserved How to get good model quality at scale Run Experiments Pick Winners Scale Up Design the Experiments 1.3 B 3B
  • 105. © 2024 Cerebras Systems Inc. All Rights Reserved How to get good model quality at scale Run Experiments Pick Winners Scale Up Design the Experiments 3 B 7 B Good config for 1 3B, 30 B
  • 106. © 2024 Cerebras Systems Inc. All Rights Reserved Run Experiments Pick Winners Scale Up Design the Experiments How to get good model quality at scale Time / Work .5 B 3 B 13 B 100B
  • 107. © 2024 Cerebras Systems Inc. All Rights Reserved Run Experiments Pick Winners Scale Up Design the Experiments How to get good model quality at scale (on GPUs) Time / Work .5B 3B 13B 100B 1 GPU 8 GPUs Data Parallelism 256 GPUs Data & Tensor & Pipeline parallel 2048 GPUs Data & Tensor & Pipeline & Expert & Sequence parallelism
  • 108. © 2024 Cerebras Systems Inc. All Rights Reserved You have to micromanage the distribution strategy: • Tensor or pipeline model parallelism • Distributed data parallelism • Expert parallelism • Interleaved pipelining schedule • Activation checkpointing & recomputation • Interplay among model size, cluster size, connectivity between nodes, number of nodes, etc. Scaling frameworks still require tons of work
  • 109. © 2024 Cerebras Systems Inc. All Rights Reserved Lines of Code ---------------------------- Python 18395 C/C++ 1118 C++ 649 CUDA 220 HTML 107 Bourne Shell 9 make 7 Markdown 1 Text 1 ---------------------------- Total 20507 ---------------------------- Nvidia’s GPT-175B Model 20,000 lines of code, weeks to implement Hard to debug You have to micromanage the distribution strategy: • Tensor or pipeline model parallelism • Distributed data parallelism • Expert parallelism • Interleaved pipelining schedule • Activation checkpointing & recomputation • Interplay among model size, cluster size, connectivity between nodes, number of nodes, etc. Scaling frameworks still require tons of work
  • 110. © 2024 Cerebras Systems Inc. All Rights Reserved Cut experiment iteration time from weeks to a day Lines of Code ---------------------------- Python 18395 C/C++ 1118 C++ 649 CUDA 220 HTML 107 Bourne Shell 9 make 7 Markdown 1 Text 1 ---------------------------- Total 20507 ---------------------------- Lines of Code ---------------------------- Python 565 C/C++ 0 C++ 0 CUDA 0 HTML 0 Bourne Shell 0 make 0 Markdown 0 Text 0 ---------------------------- Total 565 ---------------------------- Cerebras’ GPT-175B Model 565 lines of code, 1 Day to implement "GPT-3 in 565 lines of code" Blog Nvidia’s GPT-175B Model 20,000 lines of code, weeks to implement Hard to debug
  • 111. © 2024 Cerebras Systems Inc. All Rights Reserved How to scale from 1B to 70B on Cerebras ### GPT-3 XL 1.3B hidden_size: 2048 num_hidden_layers: 24 num_heads: 16 gpt3_1b_params.yaml python run.py --params gpt3_1b_params.yaml --num_steps=100 --model_dir=model_dir Training: ### Llama-2 70B hidden_size: 8192 num_hidden_layers: 80 num_heads: 64 llama2_70b_params.yaml python run.py --params llama2_70B_params.yaml --num_steps=100 --model_dir=model_dir Training:
  • 112. © 2024 Cerebras Systems Inc. All Rights Reserved Scaling from one CS-3 to a cluster is a 1-line change
  • 113. © 2024 Cerebras Systems Inc. All Rights Reserved Time / Work Cerebras gets you to high-quality large models faster & more cheaply On CS-3, Data parallel only any model size Run Experiments Pick Winners Scale Up Design Sweeps .5 B 3 B 13 B 100B
  • 114. © 2024 Cerebras Systems Inc. All Rights Reserved On GPUs, small models are the default; large models take large engineering effort. On CS-3s, large models are the default; small models come for free.
  • 115. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras Proprietary & Confidential Information Med42: Llama-70B Fine-tuned in <1 Week to Pass the US Medical License Exam • Scored 72% on USMLE, beating GPT-3.5 • With M42: global healthcare company with over 450 hospitals and clinics • Custom curated healthcare dataset of peer-reviewed papers, medical textbooks, international health agency datasets. • Run finished in 1 weekend
  • 116. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras Proprietary & Confidential Information FLOR-6.3B State-of-the-Art Catalan, Spanish, and English LLM • Best Catalan model, beating BLOOM-7.3B • Used latest language adaptation techniques for languages with less training data • Reduced inference cost by 10% vs. BLOOM, incorporating a new, more efficient tokenizer • Used to build RAG systems for specialized domains • Trained on 140B Tokens and in 2.5 days. • Open Source: Downloaded over 3000 times FLOR-6.3B
  • 117. © 2024 Cerebras Systems Inc. All Rights Reserved JAIS-30B: State-of-the-Art Arabic-English Bilingual LLM • SoTA Arabic: Outperforms all other Arabic models • English: Llama-30B quality in English • Co-developed with G42’s Core42 and MBZUAI • Now on Azure AI Cloud as the foundation of their Model-as-a-Service in the Middle East Checkpoints on HuggingFace Paper available on Arxiv
  • 118. © 2024 Cerebras Systems Inc. All Rights Reserved Challenges (1) Few high-quality Arabic datasets and preprocessing pipelines (2) Tokenizers trained on English corpora don’t extend well to Arabic (3) Want highest quality model with best cost and compute efficiency Used latest ML techniques – AliBi, SwiGLU activation, MuP, Scaling laws Ran many tuning experiments on models of 590M, 1.3B, 2.7B, 6.7B. New vocab optimized for cross-lingual alignment and trained custom tokenizer Built new multi-lingual set, experimenting with mixes of Arabic-only, and Arabic, English, and code, to find optimal mix (1:2:0.4) What we did
  • 119. © 2024 Cerebras Systems Inc. All Rights Reserved "I’ve found it really easy to experiment at every model size and scale on multiple CS systems, which we need to do to get the best results. There’s no difference between running a job on a single CS versus multiple ones. All it takes is a small config change, and everything just works with observable linear speedup! Launched my first distributed LLM training within the first hour of logging into a CS cluster for the first time!” Neha Sengupta, Core42 Principle Applied Scientist
  • 120. © 2024 Cerebras Systems Inc. All Rights Reserved Jais-30B-v3 sets new record for open-source Arabic LLMs, finishes training on 1.3 Trillion tokens 35.1 59.3 39.1 53.1 31.2 49.2 35.1 48.2 31.0 38.1 30.2 48.4 28.9 33.9 26.9 48.4 28.6 32.1 26.4 49.3 MMLU Hellaswag ARC-C TruthfulQA Jais-30B outperforms on all common NLP benchmarks in Arabic Jais-30b-chat acegpt-13b-chat BLOOMz (7.1B) LLaMA (30B) falcon-40b_instruct Note, results are displayed in order of the legend.
  • 121. © 2024 Cerebras Systems Inc. All Rights Reserved The Future is Multimodal
  • 122. An explosion of exploration in multimodality Source: Recent advances in Multimodal LLMs
  • 123. © 2024 Cerebras Systems Inc. All Rights Reserved • Generalized support for Visual Q&A: • Multiple vision encoders • Multiple LLM backbones • Cross-projection learning • Multiple modalities to an LLM backbone • Easy scaling for model size and context length • Easy to configure many leading literature models (e.g. LLaVA, AnyMAL, Eyes Wide Shut) • Dataset: support for quick import of custom datasets Multimodality is easy on Cerebras Multimodal Output CLIP Llama SigLIP DinoV2 Mistral Zephyr Plug & play vision & LLM backbones
  • 124. © 2024 Cerebras Systems Inc. All Rights Reserved Demo
  • 125. © 2024 Cerebras Systems Inc. All Rights Reserved Demo
  • 126. © 2024 Cerebras Systems Inc. All Rights Reserved Reproducing state-of-the-art results in just a couple weeks 62.0 58.2 78.5 85.9 62.3 58.2 78.5 85.3 63.3 60.4 80.6 63.5 60.8 80.7 85.7 GQA VQA(t) VQA(v2) POPE LLaVA1.5 (7B) Cerebras-LLaVA 1.5 (7B) SGPT4V (7B) Cerebras-SGPT4V (7B) 7B parameter model 13B parameter model not reported Note, results are displayed in order of the legend. 63.3 61.3 80.0 85.9 64.2 63.4 82.0 85.8 GQA VQA(t) VQA(v2) POPE LLaVA1.5 (13B) Cerebras-LLaVA 1.5 (13B)
  • 127. 63.3 61.3 80.0 85.9 64.2 63.36 82.0 85.8 GQA VQA(t) VQA(v2) POPE LLaVA1.5 (13B) Cerebras-LLaVA 1.5 (13B) 62.0 58.2 78.5 85.9 62.3 58.2 78.5 85.3 63.3 60.4 80.6 63.5 60.8 80.7 85.7 GQA VQA(t) VQA(v2) POPE LLaVA1.5 (7B) Cerebras-LLaVA 1.5 (7B) SGPT4V (7B) Cerebras-SGPT4V (7B) Reproducing state-of-the-art results in just a couple weeks Improving POPE GQA VQAt MME VQAv2 CS3-LLaVA-7B 86.7 63.9 61.5 1573 81.4 LLaVA 1.5 13B HD 86.3 64.7 62.5 1500 81.8 7B model competitive with LLaVA 1.5 13 Billion HD - 2X larger and 1.7X higher resolution image input This model came out <2 months ago
  • 128. © 2024 Cerebras Systems Inc. All Rights Reserved Get started quickly with Cerebras ModelZoo Model code with flexible configuration setup • Different image encoders: • CLIP • SigLIP • Dino v2 • Different LLM backbones: • LLaMA • Mistral • Zephyr • Different training recipes: • LLaMA Pro • Eyes Wide Shut • Freezing different parts of the model Prepared Datasets • LLAVA 1.5, ShareGPT4V, Instruct4V • ChartQA, DocVQA, DVQA, ArxivQA, AI2Diagrams Data pre-processing scripts • HDF5 file generation support • Handles mix of multimodal and text-only data • Optimized for high-throughput training Easy scaling for model and data • LLM model size • Long context lengths • Image resolution and patch size
  • 129. © 2024 Cerebras Systems Inc. All Rights Reserved Model Checkpoints Available on HuggingFace 7B – available now 13B – available now 70B – end of March!
  • 130. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras’ goal is to bring State-of-the-Art AI to every organization
  • 131. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras solutions meet you wherever you need Cerebras Wafer Scale Clusters Cerebras Cloud Cerebras AI Solutions
  • 132. © 2024 Cerebras Systems Inc. All Rights Reserved Cerebras AI Model Services GenAI Success with Cerebras ML Experts on the Fastest, Most Efficient Platform • Speed: Multi-Billion param models in days to weeks. • Tailored to you: Custom chatbots, VQA Systems, Code Completion, Foundation models, and more • All the latest ML Techniques: RAG, DPO, LoRA, MuP, data augmentation, and more. • Total Ownership: Your data, your model weights.
  • 133. © 2024 Cerebras Systems Inc. All Rights Reserved Models on Cerebras From multi-lingual LLMs to healthcare chatbots to code models.
  • 134. © 2024 Cerebras Systems Inc. All Rights Reserved All the Latest ML Techniques & Recipes Variable Seq Training DPO LL360 – Open data, models, scripts Multi-lingual Pre-training & IFT Llama70B fine tuning Domain Adaptation GPT-3 in 565 lines of code Most FLOP efficient LLM dataset First family of open GPT models and OSS use of muP RAG LoRA MoE Multi Modal Sparse Models
  • 135. © 2024 Cerebras Systems Inc. All Rights Reserved The model belongs to you Your data stays with you
  • 136. © 2024 Cerebras Systems Inc. All Rights Reserved Cloud Cerebras AI Supercomputers Exascale compute with the programmability of a single device On-Prem
  • 137. © 2024 Cerebras Systems Inc. All Rights Reserved AI Applications & Research Panel Andy Hock, SVP Product & Strategy, Cerebras
  • 138. Cerebras AI Applications & Research Panel Praneetha Elugunti Mayo Clinic Jim Culver GSK Tim Bishop Mayo Clinic Irinia Rish University of Montreal Andy Hock Cerebras
  • 139. Cerebras x Qualcomm Fireside Chat with Rashid Attar, VP of Cloud Computing, Qualcomm
  • 140.
  • 141. Cerebras x Qualcomm Technology Partnership Reducing Inference Cost by 10x Cerebras CS-3 AI Training Qualcomm Cloud AI100 Ultra AI Inference
  • 142. Jointly optimized software stack for cost efficient LLMs Cerebras Stack Qualcomm Stack Sparse training Sparse inference Train in FP16 Compile & run in MX6 Train large + small models Apply speculative decoding Network Architecture Search Compile & run on Ultra AI 100
  • 143. Cerebras x Qualcomm: Up t0 10x Inference Performance 10 8 6 4 2 0 Baseline Speculative Decoding MX6 Compression Neural Architectural Search Sparsity Total Tokens / $ 1x 1.8x 2.2x 2.5x 2.5x ~10x
  • 144. Cerebras x G42 Fireside Chat with Kiril Evtimov, Group CTO G42 & CEO Core42
  • 145. G42 across the Entire AI Value Chain Customer & Industry Tailored Solutions Data Centers Compute Infrastructure Cloud Platforms AI Model Development Cloud & Enterprise AI Deployment Application Development
  • 146. 476B Arabic tokens 1.63T Total tokens The world’s largest open-source Arabic LLM 30B parameter, bilingual Arabic-English model Trained on the Condor Galaxy 1 and 2 AI Supercomputer