Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip

Andrew Feldman
CEO & Co-Founder Cerebras

AI Has Fundamentally Changed
Computing
AI Supercomputers
x86 Servers

There’s a vast chasm in
AI capabilities

AI Developers Are Struggling with
Distributed GPU Training

“It can be a frustrating daily life
experience of training large models…You're
there carefully monitoring the vital signs of
your run: loss spikes, numerical issues,
throughput, gradient norms, policy
entropy, etc... or 10,000 GPUs could be
idling.”
Co-Founder, OpenAI

Co-founder, Reka AI
Former Google Brain Scientist,
“Multi-node GPU training is more of
an afterthought as opposed to
distributed training as a first class
citizen…it’s a hardware lottery."

“Building large scale training
clusters from scratch and achieving
high MFU and reliability is damn
hard”
Senior Foundation Model Engineer, Uber

GPT-1
120M Parameters
4 Contributors

GPT-4
1.7T Parameters
240+ contributors
35 just for distributed training
& supercomputing

Large Models Simply Don’t Fit on GPUs
ChatGPT (28TB)
H100 (80GB)

Developers must cut the model into many pieces..

And spread them on hundreds of GPUs

An ML problem just turned into a parallel programming problem.
A hardware problem just became a supercomputer problem.
Then re-write the model to work across a cluster

This causes a code
explosion
nanoGPT
1B Parameters
639 lines of code
nanoGPT
1B Parameters
639 lines of code
Megatron
100B Parameters
20,507 lines of code

You never have to do this on Cerebras

The Cerebras Way
Build a compute & memory system that’s vastly larger than the model
Cerebras CS-3 = 1,200 TB
ChatGPT

4 trillion transistors
46,225 mm2 silicon
900,000 cores optimized for sparse
linear algebra
5nm TSMC process
125 Petaflops of AI compute
44 Gigabytes of on-chip memory
21 PByte/s memory bandwidth
214 Pbit/s fabric bandwidth
Cerebras
Wafer-Scale Engine
The fastest AI chip on earth again

Cerebras Wafer Scale Engine 3 Versus the H100
Cerebras WSE-3
4 Trillion Transistors
46,225 mm2 Silicon
Largest GPU
80 Billion Transistors
814 mm2 Silicon

Cerebras CS-3

CS-3
SwarmX
MemoryX
Wafer Scale Cluster: The World’s Most Scalable
AI Supercomputer
1 terabyte
1 CS-3
125 petaflops
1 billion parameters
1 petabyte
2048 CS-3s
256 exaflops
24 trillion parameters

Exa-scale Performance

Single Device Simplicity
MemoryX Memory Units
SwarmX Interconnect
Wafer Scale Engines
1 to 2048 CS-3s Look and Program Like a Single Device

• Click to edit Master text styles
• Second level
• Third level
• Fourth level
• Fifth level

Condor Galaxy 2
Stockton, California

Condor Galaxy 3 AI Supercomputer
64
CS-3 nodes
58 million
AI cores
8 exaFLOPS
FP16 AI compute
108 TB
Parameter memory
388 Tbps
On-chip bandwidth
Dallas, Texas

AI Supercomputers
Built & Operated in the United States
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 1
Santa Clara, CA
●4 ExaFLOPs
●64 x CS-2s
●82 TB of Memory
ONLINE
Stockton, CA Dallas, TX
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 2
●4 ExaFLOPs
●64 x CS-2s
●82 TB of Memory
ONLINE
Condor Galaxy 1
Santa Clara, CA
Condor Galaxy 3
●8 ExaFLOPs
●64 x CS-3s
●108 TB of Memory
Q2 2024

CEO of Microsoft
Satya Nadella
■ JAIS 30B parameter, bilingual
Arabic-English model
■ Microsoft’s core LLM offering
in the Middle East
■ Available on Azure
Cerebras & G42
World leading Arabic LLM

“Mayo Clinic selected Cerebras
as its first generative AI
collaborator for its large-scale,
domain-specific AI expertise to
accelerate breakthrough insights
for the benefit of patients.”
Cerebras & Mayo Clinic
Breakthrough insights for the
benefit of patients
Medical Director for Strategy at Mayo Clinic
Dr. Matthew Callstrom

“When the largest problem is
solved, a speedup of 228x is
achieved... Moreover…it is unlikely
that such a performance gap can
be closed… given the strong
scalability issues encountered by
this kind of algorithm when using a
large number of multi-GPU nodes
in HPC clusters.”
Cerebras & TotalEnergies
VP of Engineering at TotalEnergies
Diego Klahr VP

Cerebras Cluster with 48
systems exceeded the
performance of the World’s
#1 Supercomputer ‘Frontier’
with 37,000 GPUs or a 100x
cost saving.
Cerebras & KAUST
Tony Chan
President, KAUST

Cerebras CS-3 Architecture Deep Dive
Sean Lie, CTO and Co-Founder, Cerebras

• 2x performance
• Same power
• Same price
Cerebras CS-3: A Generational Leap for AI
LLM Training Performance

Registers
• Building on tried-and-true WSE-2 core…
WSE-3 Core
4-way 16b SIMD
WSE-2 Core
Memory
SRAM
48kB
Cache
256B
Fabric
16 General Purpose 44 Data Structure

Improved performance for AI compute
• New higher performance tensor operations
• New 8-way SIMD for 16b data (FP/BF16)
• New 16-way SIMD for 8b data (Fixed/INT8)
• New faster non-linear functions
• 2x higher compute performance core
High bandwidth memory and cache
• 48kB memory per core
• New 512B local cache per core
• Full bandwidth for full SIMD performance
WSE-3 Core
Continuing Distributed AI Architecture Leadership
WSE-3 Core
48 Data Structure
Registers
8-way 16b SIMD
Memory
SRAM
48kB
Cache
512B
Fabric
16 General Purpose
16-way 8b SIMD

From Small Core to Massive Wafer
Die
Core
WSE-3
84 Die
900k Cores
10.7k Cores

Uniquely capable of wafer-scale integration
• Invented process in first generation WSE
• Extended to 5nm in collaboration with
TSMC
Co-designed from ground up
• Uniform architecture with built-in
redundancy
• Extending uniform fabric across die
• Wafer behaves as single massive chip
WSE-3 Interconnect
Enabling the Only Wafer Scale Chip in the World

WSE-3 Interconnect
Enabling the Biggest Chip in the World
GPU GPU GPU GPU GPU GPU GPU GPU
NV
Link
NV
Link
NV
Link
NV
Link
Each H100 8xH100
Bandwidth
900GB/s
36x 100Gb/s serial
7.2TB/s
288x 100Gb/s serial
Power
36W 288W
5.0 pJ/bit
Each Die 84xDie
2880GB/s
480x 24Gb/s parallel
242TB/s
40320x 24Gb/s
parallel
1.1W 92W
0.05 pJ/bit
10x
More Die
33x
More Bandwidth
100x
More Power Efficient
Wafer Scale Engine
Traditional
Serial across connectors, PCBs,
cables
Parallel across <1mm on silicon
*GPU estimate use 5nm 100G serdes power with Nvidia H100 NVLink bandwidth

CS-3 System: Purpose Built for Wafer-Scale

Cerebras CS-3 Nvidia H100 Cerebras Advantage
Chip size 46,225 mm2 814 mm2 57x
Cores 900,000 16,896 FP32 + 528 Tensor 52x
On-chip memory 44 Gigabytes 0.05 Gigabytes 880x
Memory bandwidth 21 Petabytes/sec 0.003 Petabytes/sec 7,000X
Fabric bandwidth 214 Petabits/sec 0.0576 Petabits/sec 3,715X
CS-3 vs. GPU
Orders of Magnitude Performance Advantage
Enabling large scale training
Finetune LLaMA 70B on 1B tokens in a day
on a single chip

Cluster natively operates as single device
WSE-3 is big enough to run largest models
• Enables compute and memory
disaggregation
• Train with data-parallel only scaling
Architect cluster-level memory and compute
• External memory stores model weights
• Untangle memory and compute
dependency
CS-3 Cluster
Designed as Single ML Accelerator
…
SwarmX Interconnect
MemoryX Memory Units
Wafer Scale Engines

Model capacity not limited by device
• Weights streamed onto wafer to compute
layer
• Weights trigger compute using HW
dataflow
• Weights are never stored on wafer
Decoupling weight optimizer compute
• Gradients streamed out of wafer
• Weight update occurs in MemoryX
MemoryX External Memory
Virtually Unlimited Model Weight Capacity
Memory hierarchy capable of massive models on single device
Weights
Gradients
MemoryX
Optimizer
Compute
Weight
Memory
CS-3

Data-parallel only training across CS-3s
• Weights are broadcast to all CS-3s
• Gradients are reduced on way back
Multi-system scaling with the same
execution model as single system
• Same system architecture
• Same network execution flow
• Same software user interface
SwarmX Fabric
Purpose Built Interconnect for Simple Scaling
MemoryX
Optimizer
Compute
Weight
Memory Weights
Gradients
Weights
Gradients
SwarmX
CS-3s
Scaling to cluster compute while operating like a single device

CS-3 Cluster Compute
CS-2 Cluster
192 CS-2 systems
12 exaFLOPS AI Compute

• 2048 CS-3
in single cluster
• 256 exaFLOPS
AI Compute
• Programs like a
single device
CS-3 Cluster Compute
Supercomputer Performance, Single Device Experience

SwarmX
Scalable spine-leaf topology
• Standard-based 400/800G
Ethernet
• Performance and cost effective
• RDMA for low overhead and
latency
. . .
. . .
Scaling to 256 exaFLOPS
Purpose Built Scalable Network for AI Training
. . .
CS-2 CS-3
Cluster
Size
192 systems 2048 systems
Link
Speed
100 Gb/s
400 Gb/s
800 Gb/s
Cluster
Bandwidth
1 Pb/s 10 Pb/s
Cluster Options

Train Today’s SOTA Models in Hour or Days
~1 month ~1 day
Meta GPU Cluster Cerebras CS-3 Cluster
LLaMA 70B Training

Train Today’s SOTA Models in Hour or Days
~1 month ~1 day
Meta GPU Cluster Cerebras CS-3 Cluster
But the CS-3 cluster operates like single device
LLaMA 70B Training

CS-3 Cluster Memory
Memory SKUs
Memory
(TByte)
1.5 12
Parameters
(Billion)
30 240
CS-2 Options

MemoryX: The First Petabyte-Scale AI Memory
System
100x Larger
Models
24 Trillion
Parameters
Enterprise SKUs Hyperscale SKUs
Memory
(TByte)
1.5 12 24 36 120 1,200
Parameters
(Billion)
30 240 480 720 2,400 24,000
CS-3 MemoryX Options

MemoryX
Compute
State
Efficient hybrid state store
• Weights stored in DDR5 and Flash
• Perf and power/cost efficiency
Flexible compute
• Optimizer and other ops run on
CPU
• General purpose and flexible
• Support for all common ML ops
Enabling Multi-Trillion Parameter Models
Most Scalable and Efficient Model Memory
Model weights
CPU
Model optimizer
and operations
CS-2 CS-3
DRAM
Memory
12 TB DDR4
240B params
36 TB DDR5
720B params
Flash
Memory
1.2 PB
24T params
CPU
Perf
1x 2x
Cluster Options

Large Cluster Memory on a Single Device

Train Tomorrow’s Trillion+ Parameter Models
~1.5 years ~3 weeks
1000s of GPU Cerebras CS-3 Cluster
And the CS-3 cluster still operates like single device
Imagine…
LLaMA 1T Training

Interconnect Interconnect
...
Memory Memory
Interconnect
Memory
I see one
big device
I see one
big device
I see one
big device
You Program It Like A Single Device
No Matter The Cluster Size
1x CS-3 4x CS-3 2048x CS-3
Wafer
Scale
Cluster

Interconnect Interconnect
...
Memory Memory
Interconnect
Memory
And Your Model Always Fits
1B or 1T Parameters
1.5TB 36TB 1,200 TB
Wafer
Scale
Cluster
Llama
7B
Llama 70B Llama 700B
I see one
big device
I see one
big device
I see one
big device

Real world seamless cluster scaling
• User: G42
• Model: Jais30B
• Cluster: Condor Galaxy-1
• Experience: “It just worked”
• No complex distributed software
• No changes to parallelism model
• No changes to hyper-parameters
Training SOTA large models everyday
• Unique capability enabled by wafer-scale
0
8
16
24
32
40
48
56
64
0 8 16 24 32 40 48 56 64
Relative
Speedup
(x
factor)
Number of CS-2s
Jias30B Measured Training Speedup on CG-1
Resulting in Near Linear Scaling
Any Scale While Operating as a Single Device

External chip interconnect
Low perf high power connections
Custom proprietary switches
Complex distributed software
Hybrid model-parallel partitioning
Cerebras Design Philosophy:
Massive Compute + Memory for Large Scale Models
On-chip interconnect
“Free” high perf communication
Big enough to run largest models
Simple data-parallel only scaling
Disaggregate compute and memory
GPU
Wafer Scale Engine
NV
Link
NV
Link
NV
Link
NV
Link

But we can and need to do even better…

40,000x more compute
In just 5 years
Current trajectory is unsustainable
We must find more efficient
methods
Sparsity is the key
But We Can and Need to Do Even Better
Sparsity Solves the Explosive Cost of Gen AI
BERT
GPT-2
Megatron-LM
T5
T-NLG
GPT-3 Jurassic
Gopher
MT-NLG
Chincilla LLaMa
GPT-4
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
2018 2019 2020 2021 2022 2023 2024
Training
Compute
(exaFLOPs)
Year
exaFLOPs to Train

Sparsity opportunities are everywhere
• Neural networks have native sparsity
• e.g. ReLU or Dropout
• Neural networks can be made sparse
• e.g. sparse weights
• Models are over parameterized by design
• Training is act of discovering important
weights
Training dense is wasteful and inefficient
• But not all hardware can take advantage of
all forms of sparsity
Neural Networks are Sparse
Sparsity

Memory bandwidth built for sparsity
• Traditional hardware built for dense
• High data reuse à caching à low mem bw
• Wafer-scale memory built for sparse
• Low data reuse à caching à high mem bw
• Enabled by orders of magnitude more mem
bw
CS-3 accelerates all forms of sparsity
• Static and dynamic sparsity
• Structured and unstructured sparsity
Sparsity Acceleration is Memory Bound
x
x
Memory Bandwidth (Byte/FLOP)
Required Available
Dense MatMul
~0.001
H100
0.003
Sparse MatMul
~1
WSE-3
2

Examples of sparse training opportunities
• Dynamic activation sparsity
• e.g. Google: 95% sparse ReLU FFN in LLMs1
• Structured weight sparsity
• e.g. Mistral: 75% sparse FFN MoE 8x7B2
• Unstructured weight sparsity
• e.g. Cerebras: 75% sparse SPDF GPT3
Solving unsustainable scaling for training
• Only HW to accelerate all forms of sparsity
• Even future sparse techniques
Accelerating All Forms of Sparse Training
1 Li et al., The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers, 2023
2 Jiang et al., Mixtral of Experts, 2024
3 Thangarasa et al., SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models, 2023
0%
20%
40%
60%
80%
100%
ReLU MoE SPDF
Relative
FLOPs
FLOP Reduction From Sparsity
Dense Sparse
1.7x 2.0x 2.8x

But sparsity can also transform inference
on a variety of hardware…

Neural Magic + Cerebras
Accelerated Inferencing for LLM Optimization
Mark Kurtz
CTO
Neural Magic

OUR LEADERSHIP
Who are we?
AI leader in model optimization and inference server acceleration
MIT Professor of Electrical Engineering
and Computer Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of Multicore
Algorithms and Computational
Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO of
Google Cloud, former CTO and EVP of
Worldwide Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of Distributed
Computing and Machine Learning
Dan Alistarh
Principal Research Scientist

Who are we?
• 200+ accepted papers
• 60 patents
• GPTQ
• SparseGPT
• Sparse Fine-Tuning
• nm-vllm
• DeepSparse
• SparseML
As a software-delivered solution, we have deep
expertise across AI model training and optimization.
We invented many of the current AI industry’s state-of-
the-art techniques for quantization and sparsification.
Our solutions include enterprise inference servers to
open-source libraries and a sparsified models repo.
OUR LEADERSHIP
MIT Professor of Electrical
Engineering and Computer
Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of
Multicore Algorithms and
Computational Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO
of Google Cloud, former CTO
and EVP of Worldwide
Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of
Distributed Computing and
Machine Learning
Dan Alistarh

Who are we?
OUR LEADERSHIP
MIT Professor of Electrical
Engineering and Computer
Science, ACM Fellow
Nir Shavit
Co-Founder
MIT Research Scientist of
Multicore Algorithms and
Computational Connectomes
Alex Matveev
Co-Founder
Chief Scientist
Former VP of Product and CTO
of Google Cloud, former CTO
and EVP of Worldwide
Engineering for RedHat
Brian Stevens
CEO of Neural Magic
IST Austria Professor of
Distributed Computing and
Machine Learning
Dan Alistarh
• 200+ accepted papers
• 60 patents
• GPTQ
• SparseGPT
• Sparse Fine-Tuning
• nm-vllm
• DeepSparse
• SparseML
As a software-delivered solution, we have deep
expertise across AI model training and optimization.
We invented many of the current AI industry’s state-of-
the-art techniques for quantization and sparsification.
Our solutions include enterprise inference servers to
open-source libraries and a sparsified models repo.

Challenges with LLM deployment
Deploying to production
! Issues include
• Requires lots of compute
• Requires lots of memory
• Increases latency
• Very demanding on inference
serving infrastructure
• Expensive to operate and
support

! Issues include
support
Options to resolve
• Decrease the size of the
LLM
• Apply quantization to
combat the accuracy issue
when model size is reduced

! Issues include
support
Options to resolve
• Decrease the size of the
LLM
• Apply quantization to
combat the accuracy issue
when model size is reduced
Llama 2 Size vs Accuracy

Before Pruning
The solution - Sparsity

After Pruning
Before Pruning

• Preserves the model’s accuracy
while reducing the size of the model
Unstructured Sparsity:
After Pruning
• Improves inference and training
performance
Before Pruning

Our research collaboration with Cerebras
Create open-source sparse foundational
models that organizations can easily
deploy and use with faster inference.

Our process
Llama 2
2T Tokens
Pretrained
from Meta

Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
1.7-2.4X
Reduction in
FLOPS

Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
90%
Accuracy Recovery

Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse

Our process
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
Chat
50%, 70%
Code Generation
50%, 70%

Cerebras Proprietary & Confidential Information
Results
Full recovery with 50% and 70% sparse models.
Sparsity vs Accuracy for UltraChat 200k Sparsity vs Accuracy for Evol Code Alpaca

Results
4.3X
Memory Reduction
Memory Usage vs Compression Level - Llama 2 7B

Our process
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Chat
50%, 70%
Code Generation
50%, 70%
Fine-Tuning
Your Use Case
Sparse Fine-
Tuning for a few
hours
Quantization
with GPTQ

Our process
Off the Shelf
Sparse
Fine-Tuning
Quantization
with GPTQ
DeepSparse
Llama 2
2T Tokens
Pretrained
from Meta
Sparse Pretraining
Sparse GPT
Sparse Pretraining
on Cerebras
150B Tokens
Fine-Tuning
Your Use Case
Sparse Fine-
Tuning for a few
hours
Quantization
with GPTQ
Sparse Foundational
Models
Llama 2 7B
Llama 2 7B
70% Sparse
50% Sparse
Chat
50%, 70%
Code Generation
50%, 70%

Local inference performance
With sparsity, real time chat is now possible on local CPUs.
Single Stream Token Generation - Llama 2 7B Single Stream Latency - Llama 2 7B

Server inference performance
With sparsity, CPU performance is competitive with GPUs.
Single Stream Decode Performance - Llama 2 7B Multi Stream Decode Performance - Llama 2 7B

Comparison
Unoptimized Model
Llama 2 7B FP32
Sparse Quantized Model
Llama 2 7B 70% Sparse INT8
20 Tokens/Second
2 Tokens/Second
Using Neural Magic DeepSparse on an 8-core AMD Genoa CPU

Key takeaways
Run SOTA models real
time on just a laptop with
Neural Magic DeepSparse
Transform your infrastructure with
just software to support LLMs
Train sparse models
faster with Cerebras
Takeaway 1
Up to 4X faster than llama.cpp
4X
Up to 7X
Takeaway 2
Up to 7X more inference streams
per server than llama.cpp at the
same performance level
Up to 2X
Takeaway 3
2X faster sparse training
Faster

Next steps
Neural Magic’s Hugging Face
Organization Cerebras Blog
• Arxiv paper with our current results
• Larger models
• Higher sparsities
• INT4 quantization support
• Combine with parameter efficient fine-tuning
Stay tuned for more collaboration with Cerebras
Neural Magic Docs

Thank you
Follow us to stay
current on all
things Neural
Magic, including
product updates,
ML research
developments,
and more.
@neuralmagic
Join our
Community
Engage with
fellow ML
practitioners. Ask
questions, share
feedback, and
improve the way
you use Neural
Magic.
Connect with
Neural Magic to
stay up to date with
#SoftwareDelivered
AI.
neural-magic

Models & Product
Jessica Liu, VP of Product, Cerebras

The goal of AI training: make the loss curve go down

⚠
But it’s not so simple...

This happens all the time

Model performance can vary greatly

Lots of time and cost riding on "getting the big run right"
Challenges of large GenAI training & fine-tuning
Out of memory
GPU failure
Numerics bug
Low utilization
2.
ML Complexity
1.
Distribution
3.
Cost

How to get good model quality at scale
Run
Experiments
Pick Winners Scale Up
Design the
Experiments
1.3 B
500M

Run
Experiments
Design the
Experiments
1.3 B 3B

Run
Experiments
Design the
Experiments
3 B 7 B Good config
for 1 3B, 30 B

Run
Experiments
Design the
Experiments
Time
/
Work
.5 B 3 B 13 B 100B

Run
Experiments
Design the
Experiments
How to get good model quality at scale (on GPUs)
Time
/
Work
.5B
3B
13B
100B
1 GPU
8 GPUs
Data Parallelism
256 GPUs
Data & Tensor &
Pipeline parallel
2048 GPUs
Data & Tensor &
Pipeline & Expert &
Sequence parallelism

You have to micromanage the
distribution strategy:
• Tensor or pipeline model parallelism
• Distributed data parallelism
• Expert parallelism
• Interleaved pipelining schedule
• Activation checkpointing &
recomputation
• Interplay among model size, cluster size,
connectivity between nodes, number of
nodes, etc.
Scaling frameworks still require tons of work

Lines of Code
----------------------------
Python 18395
C/C++ 1118
C++ 649
CUDA 220
HTML 107
Bourne Shell 9
make 7
Markdown 1
Text 1
----------------------------
Total 20507
----------------------------
Nvidia’s GPT-175B Model
20,000 lines of code, weeks to implement
Hard to debug
You have to micromanage the
distribution strategy:
• Tensor or pipeline model parallelism
• Distributed data parallelism
• Expert parallelism
• Interleaved pipelining schedule
• Activation checkpointing &
recomputation
• Interplay among model size, cluster size,
connectivity between nodes, number of
nodes, etc.
Scaling frameworks still require tons of work

Cut experiment iteration time from weeks to a day
Lines of Code
----------------------------
Python 18395
C/C++ 1118
C++ 649
CUDA 220
HTML 107
Bourne Shell 9
make 7
Markdown 1
Text 1
----------------------------
Total 20507
----------------------------
Lines of Code
----------------------------
Python 565
C/C++ 0
C++ 0
CUDA 0
HTML 0
Bourne Shell 0
make 0
Markdown 0
Text 0
----------------------------
Total 565
----------------------------
Cerebras’ GPT-175B Model
565 lines of code, 1 Day to implement
"GPT-3 in 565 lines of code" Blog
Nvidia’s GPT-175B Model
20,000 lines of code, weeks to implement
Hard to debug

How to scale from 1B to 70B on Cerebras
### GPT-3 XL 1.3B
hidden_size: 2048
num_hidden_layers: 24
num_heads: 16
gpt3_1b_params.yaml
python run.py
--params gpt3_1b_params.yaml
--num_steps=100
--model_dir=model_dir
Training:
### Llama-2 70B
hidden_size: 8192
num_hidden_layers: 80
num_heads: 64
llama2_70b_params.yaml
python run.py
--params llama2_70B_params.yaml
--num_steps=100
--model_dir=model_dir
Training:

Scaling from one CS-3 to a cluster is a 1-line change

Time
/
Work
Cerebras gets you to high-quality large models
faster & more cheaply
On CS-3,
Data parallel only
any model size
Run
Experiments
Design
Sweeps
.5 B 3 B 13 B 100B

On GPUs, small models are the default;
large models take large engineering effort.
On CS-3s, large models are the default;
small models come for free.

Med42: Llama-70B Fine-tuned in <1 Week
to Pass the US Medical License Exam
• Scored 72% on USMLE, beating GPT-3.5
• With M42: global healthcare company
with over 450 hospitals and clinics
• Custom curated healthcare dataset of
peer-reviewed papers, medical
textbooks, international health agency
datasets.
• Run finished in 1 weekend

FLOR-6.3B State-of-the-Art Catalan,
Spanish, and English LLM
• Best Catalan model, beating BLOOM-7.3B
• Used latest language adaptation techniques
for languages with less training data
• Reduced inference cost by 10% vs. BLOOM,
incorporating a new, more efficient tokenizer
• Used to build RAG systems for specialized
domains
• Trained on 140B Tokens and in 2.5 days.
• Open Source: Downloaded over 3000 times
FLOR-6.3B

JAIS-30B: State-of-the-Art
Arabic-English Bilingual LLM
• SoTA Arabic: Outperforms all other Arabic models
• English: Llama-30B quality in English
• Co-developed with G42’s Core42 and MBZUAI
• Now on Azure AI Cloud as the foundation of their
Model-as-a-Service in the Middle East
Checkpoints on
HuggingFace
Paper available
on Arxiv

Challenges
(1) Few high-quality Arabic datasets and
preprocessing pipelines
(2) Tokenizers trained on English
corpora don’t extend well to Arabic
(3) Want highest quality model with
best cost and compute efficiency
Used latest ML techniques – AliBi, SwiGLU
activation, MuP, Scaling laws
Ran many tuning experiments on models of
590M, 1.3B, 2.7B, 6.7B.
New vocab optimized for cross-lingual
alignment and trained custom tokenizer
Built new multi-lingual set, experimenting with
mixes of Arabic-only, and Arabic, English, and
code, to find optimal mix (1:2:0.4)
What we did

"I’ve found it really easy to experiment at every model size and scale
on multiple CS systems, which we need to do to get the best results.
There’s no difference between running a job on a single CS versus
multiple ones. All it takes is a small config change, and everything just
works with observable linear speedup!
Launched my first distributed LLM training within the first hour of
logging into a CS cluster for the first time!”
Neha Sengupta, Core42
Principle Applied Scientist

Jais-30B-v3 sets new record for open-source Arabic LLMs,
finishes training on 1.3 Trillion tokens
35.1
59.3
39.1
53.1
31.2
49.2
35.1
48.2
31.0
38.1
30.2
48.4
28.9
33.9
26.9
48.4
28.6
32.1
26.4
49.3
MMLU Hellaswag ARC-C TruthfulQA
Jais-30B outperforms on all common NLP benchmarks in Arabic
Jais-30b-chat acegpt-13b-chat BLOOMz (7.1B) LLaMA (30B) falcon-40b_instruct
Note, results are displayed in order of the legend.

The Future is Multimodal

An explosion of exploration in multimodality
Source: Recent advances in Multimodal LLMs

• Generalized support for Visual Q&A:
• Multiple vision encoders
• Multiple LLM backbones
• Cross-projection learning
• Multiple modalities to an LLM backbone
• Easy scaling for model size and context length
• Easy to configure many leading literature models
(e.g. LLaVA, AnyMAL, Eyes Wide Shut)
• Dataset: support for quick import of custom datasets
Multimodality is easy on Cerebras
Multimodal Output
CLIP Llama
SigLIP
DinoV2
Mistral
Zephyr
Plug & play vision & LLM backbones

Demo

Reproducing state-of-the-art results in just a
couple weeks
62.0
58.2
78.5
85.9
62.3
58.2
78.5
85.3
63.3
60.4
80.6
63.5
60.8
80.7
85.7
GQA VQA(t) VQA(v2) POPE
LLaVA1.5 (7B) Cerebras-LLaVA 1.5 (7B) SGPT4V (7B) Cerebras-SGPT4V (7B)
7B parameter model 13B parameter model
not
reported
Note, results are displayed in order of the legend.
63.3 61.3
80.0
85.9
64.2 63.4
82.0
85.8
LLaVA1.5 (13B) Cerebras-LLaVA 1.5 (13B)

63.3 61.3
80.0
85.9
64.2 63.36
82.0
85.8
LLaVA1.5 (13B) Cerebras-LLaVA 1.5 (13B)
62.0
58.2
78.5
85.9
62.3
58.2
78.5
85.3
63.3
60.4
80.6
63.5
60.8
80.7
85.7
LLaVA1.5 (7B) Cerebras-LLaVA 1.5 (7B) SGPT4V (7B) Cerebras-SGPT4V (7B)
Reproducing state-of-the-art results in just a
couple weeks
Improving
POPE GQA VQAt MME VQAv2
CS3-LLaVA-7B 86.7 63.9 61.5 1573 81.4
LLaVA 1.5 13B HD 86.3 64.7 62.5 1500 81.8
7B model competitive with LLaVA 1.5 13 Billion HD
- 2X larger and 1.7X higher resolution image input
This model came out <2 months ago

Get started quickly with Cerebras ModelZoo
Model code with flexible configuration setup
• Different image encoders:
• CLIP
• SigLIP
• Dino v2
• Different LLM backbones:
• LLaMA
• Mistral
• Zephyr
• Different training recipes:
• LLaMA Pro
• Eyes Wide Shut
• Freezing different parts of the model
Prepared Datasets
• LLAVA 1.5, ShareGPT4V, Instruct4V
• ChartQA, DocVQA, DVQA, ArxivQA, AI2Diagrams
Data pre-processing scripts
• HDF5 file generation support
• Handles mix of multimodal and text-only data
• Optimized for high-throughput training
Easy scaling for model and data
• LLM model size
• Long context lengths
• Image resolution and patch size

Model Checkpoints Available on HuggingFace
7B – available now
13B – available now
70B – end of March!

Cerebras’ goal is to bring
State-of-the-Art AI to
every organization

Cerebras solutions meet you wherever you need
Cerebras Wafer Scale Clusters
Cerebras Cloud
Cerebras AI Solutions

Cerebras AI Model Services
GenAI Success with Cerebras ML Experts on
the Fastest, Most Efficient Platform
• Speed: Multi-Billion param models in days to weeks.
• Tailored to you: Custom chatbots, VQA Systems,
Code Completion, Foundation models, and more
• All the latest ML Techniques: RAG, DPO, LoRA,
MuP, data augmentation, and more.
• Total Ownership: Your data, your model weights.

Models on Cerebras
From multi-lingual LLMs to healthcare chatbots to code models.

All the Latest ML Techniques & Recipes
Variable Seq Training
DPO
LL360 – Open data, models, scripts
Multi-lingual
Pre-training & IFT
Llama70B fine tuning
Domain Adaptation
GPT-3 in 565 lines
of code
Most FLOP efficient
LLM dataset
First family of open GPT models
and OSS use of muP
RAG
LoRA
MoE
Multi
Modal
Sparse
Models

The model belongs to you
Your data stays with you

Cloud
Cerebras AI Supercomputers
Exascale compute with the programmability of a single device
On-Prem

AI Applications & Research Panel
Andy Hock, SVP Product & Strategy, Cerebras

Cerebras AI Applications
& Research Panel
Praneetha Elugunti
Mayo Clinic
Jim Culver
GSK
Tim Bishop
Mayo Clinic
Irinia Rish
University of Montreal
Andy Hock
Cerebras

Cerebras x
Qualcomm
Fireside Chat with
Rashid Attar, VP of Cloud Computing,
Qualcomm

Cerebras x Qualcomm Technology Partnership
Reducing Inference Cost by 10x
Cerebras CS-3
AI Training
Qualcomm Cloud AI100 Ultra
AI Inference

Jointly optimized software stack for
cost efficient LLMs
Cerebras
Stack
Qualcomm
Stack
Sparse training Sparse inference
Train in FP16 Compile & run in MX6
Train large + small models Apply speculative decoding
Network Architecture
Search
Compile & run on Ultra AI 100

Cerebras x Qualcomm: Up t0 10x
Inference Performance
10
8
6
4
2
0
Baseline
Speculative
Decoding
MX6
Compression
Neural
Architectural
Search
Sparsity
Total
Tokens
/
$
1x
1.8x
2.2x
2.5x 2.5x
~10x

Cerebras x G42
Fireside Chat with
Kiril Evtimov, Group CTO G42 & CEO
Core42

G42 across the Entire AI Value Chain
Customer &
Industry Tailored
Solutions
Data
Centers
Compute
Infrastructure
Cloud
Platforms
AI Model
Development
Cloud &
Enterprise AI
Deployment
Application
Development

476B Arabic tokens
1.63T Total tokens
The world’s largest
open-source Arabic LLM
30B parameter, bilingual
Arabic-English model
Trained on the
Condor Galaxy 1 and 2
AI Supercomputer

Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip

Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip

Recommended

Recommended

More Related Content

Similar to Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip

Similar to Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip (20)

Recently uploaded

Recently uploaded (20)

Cerebras AI Day Deck :: A closer look at the world’s fastest AI Chip