The document summarizes a full-day forum hosted by the CXL Consortium and MemVerge on CXL. The morning agenda includes presentations on CXL from representatives of Google, Intel, PCI-SIG, Marvell, Samsung, and Micron. The afternoon agenda includes panels on CXL usage models from Meta, OCP, Anthropic, and MemVerge. A keynote presentation provides an update on the CXL Consortium and the recently released CXL 3.0 specification, including its expanded fabric capabilities and management features. The specification is aimed at enabling new usage models for memory sharing and expansion to address industry trends toward increased data processing demands.
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
CXLTM: Getting Ready for Take-Off - SEO-Optimized Title for CXL Forum Document
1. CXL™: Getting
Ready for Take-Off
Full-Day Forum at Flash Memory Summit
Hosted by The CXL Consortium and MemVerge
Slides and video now available at
https://memverge.com/cxl-forum/
2. Morning Agenda
2
Start End Name Title Organization
8:35 8:50 Siamak Tavallaei
President, CXL Consortium, Chief System
Architect, Google Infrastructure
8:50 9:10 Willie Nelson Technology Enabling Architect
9:10 9:30 Steve Glaser Principal Engineer, PCI-SIG Board Member
9:30 9:50 Shalesh Thusoo VP, CXL Product Development
9:50 10:10 Jonathan Prout Sr. Manager, Memory Product Planning
10:10 10:30 Uksong Kang Vice President, DRAM Product Planning
10:30 10:50 Ryan Baxter Sr. Director of Marketing
Session SPOS-101-1 on the FMS program
3. Afternoon Agenda
3
Start End Name Title Organization
3:25 3:45 Arvind Jagannath Cloud Platform Product Management
3:45 4:05 Mahesh Wagh Senior Fellow
4:05 4:25 Charles Fan CEO & Co-founder
4:25 4:45 Manoj Wadekar
SW-Defined Memory Workstream Lead,
OCP, Storage Architect, Meta
4:45 5:10
Siamak Tavallaei
Panel Moderator
President, CXL Consortium, Chief System
Architect, Google Infrastructure
5:10 5:35
Chris Mellor
Panel Moderator
Editor
Session SPOS-102-1 on the FMS program
4. Update from the CXL Consortium
4
Siamak Tavallaei
CXL President
Chief Systems Architect at
Google Systems Infrastructure
40. 40
AGENDA
Cache Coherence for Accelerators
Expansion Memory for CPUs
Flexible Tiered Memory Configurations
Security
41. 41
CPU-GPU CACHE COHERENCE
Unified programming model across CPU architectures
CPU-GPU coherence provides programmability
benefits
Ease of porting applications to GPU
Rapid development for new applications
Grace + Hopper Superchip introduces cache-
coherent programming to GPUs
CXL enables the same programming benefits
for our GPUs in systems based on 3rd-party
CPUs
Grace
CPU
Hopper
GPU
Coherent
NVLink C2C
x86/ Arm
CPU
NVIDIA
GPU
Coherent
CXL Link
42. 42
PROGRAMMABILITY BENEFITS
CXL CPU-GPU cache coherence reduces barrier to entry
Without Shared Virtual Memory (SVM) + coherence, nothing
works until everything works
Enables single allocator for all types of memory: Host, Host-
accelerator coherent, accelerator-only
Eases porting complicated pipelines in stages
Many SW layers exist between frameworks and drivers
Example: start with malloc, keep using malloc until you choose
otherwise
Vendor-provided allocators remain fully supported and functional
Workloads are pushing scaling boundaries fine-grained
synchronization is on the rise
Synchronization latency matters
Avoid setup latency, do it in-memory when possible
Host/device synchronization in device’s memory
Concurrent algorithms and data structures become available
Example: full C++ atomics support across host and device
Locks
Any suballocation can be used for synchronization, regardless of
placement
Ap
pP
erf
Programmi
ng Effort
Star
t
v1 with
SVM +
Coherenc
e
v1 without
SVM or
Coherence
43. 43
CXL FOR CPU MEMORY EXPANSION
SOC DDR channel count is becoming
constrained
CXL-enabled PCIe ports can be used for
additional memory capacity
Flexibility in underlying media choice, trading
off capacity/latency/persistence
DRAM
DRAM + cache
Storage-class memory
DDR/SCM + NVMe
DDR
Host
SOC0
Host
SOC1
DDR
DDR
DDR
DDR
DDR
DDR
DDR
DDR
DDR
CXL
Mem
CXL
Mem
CXL
Mem
CXL
Mem
44. 44
CXL FOR MEMORY DISAGGREGATION
Currently, data center servers are often over-provisioned with
memory
All Hosts must be have enough DRAM to handle the demands of worst-
case workloads
Under less memory-intensive workloads, DRAM is unused and wasted
DRAM is very expensive at data center scale
Large banks of CXL memory can be distributed among several Hosts
Memory pools may be attached to Hosts via CXL switches, or directly
attached using multi-port memory devices
Pooling
Each Host is allocated a portion of the disaggregated memory
Memory pools can be reallocated as needed
Reduces memory over-provisioning on each Host while allowing
flexibility to handle a range of workloads with differing memory
demands
Sharing
Address ranges which may be accessed by multiple Hosts simultaneously
Coherence may be provided in hardware by the CXL Device or may be
software-managed
CXL Switch Fabric
Host Host Host
CXL
Memory
Pool
CXL
Memory
Pool
45. 45
CXL FOR GPU EXPANSION MEMORY
Tackling AI with very large memory capacity demands
Accelerator workloads with large memory footprints are currently
challenged
Constrained by bandwidth available to Host over PCIe
Contention with Host SW for memory bandwidth
CXL memory expanders may be directly attached to accelerators for
private use
Tiered memory for GPUs: HBM and CXL tradeoffs
Bandwidth
Capacity
Cost
Flexibility
GPU
Host
CXL
Memory
HBM
HBM
Coherent CPU-GPUCXL Link
Private GPU-MemoryCXL Link(s)
46. 46
CXL FOR GPU MEMORY POOLING
Streamlined Accelerator Data Sharing
Memory pools may provide flexibility to apportion
memory to individual GPUs as needed
Provides solution to workloads where capacity is
important and bandwidth is secondary
Large data sets can be stored in CXL memory and
shared as needed among accelerators, without
burdening interface to Host
GPU GPU GPU
CXLSwitch Fabric
Host Host Host
CXL
Memory
Pool
CXL
Memory
Pool
HBM
HBM
HBM
HBM
HBM
HBM
47. 47
SHARED EXPANSION MEMORY
CPU-GPU Shared Memory Pools
CXL enables sharing of expansion memory
between Host and GPU
Future capabilities may allow expansion memory
to simultaneously be shared
Among Hosts
Between Hosts and Accelerators
Flexibility in provisioning under varying demands
Ease of programming model
CXL Switch could be local physical switch or
virtual switch over other physical transport
enabling remote disaggregated memory
CXLSwitch
Host
CXL
Memory
GPU
CXLSwitch
Host
GPU
Host Host
CXL
Memory
Pool
GPU CXL
Memory
Pool
GPU CXL
Memory
Pool
48. 48
CXL FOR CONFIDENTIAL COMPUTING
Vision for secure accelerated computing
Confidential computing components will be
Partitionable and assignable to Trusted
Execution Environment Virtual Machines (TVM)
TVMs can create their own secure virtual
environments including
Host resources
Accelerator partition
Shared memory partitions
Data transfers encrypted and integrity
protected
Components are securely authenticated
Partitions are secure from accesses by
untrusted entities
Other VMs/TVMs
Firmware
VMM
All CXL capabilities are enabled in secure
domains
Memory Pool
Memory Pool
CXL
Confidential Compute Host
Confidential Compute GPU
TVM
GPU
Partition
TVM TVM
GPU
Partition
GPU
Partition
Memory
Partition
Memory
Partition
Memory
Partition
Memory
Partition
63. 63
Jonathan Prout
Senior Manager, Memory
Product Planning
Samsung Electronics
Uksong Kang
Vice President,
DRAM Product Planning
SK Hynix
Ryan Baxter
Sr. Director Marketing
Micron
65. Industry Trends and Challenges
CXL™ (Compute Express Link) Introduction
CXL™ Memory Use Cases
Samsung’s CXL™ -based Memory Expander and SMDK (Scalable Memory
Development Kit)
Agenda
66. Industry Trends and Challenges
Artificial
Intelligence
Big Data
Edge
Cloud
5G
Massive demand for data-centric
technologies and applications
Memory bandwidth and density not
keeping up with increasing CPU core
count
Need a next gen interconnect for
heterogeneous computing and
server disaggregation
67. Industry Trends and Challenges
Normalized
growth
rate
0
0.5
1
1.5
2
2.5
3
3.5
4
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
CPU core count Memory channel BW per core
New memory
scaling solution is
needed
68. CXL™ Introduction
CXL is a high performance, low
latency protocol that leverages
PCIe physical layer
CXL is an open industry standard with
broad industry support
Processor
PCle Connector
PCIe
channel
PCIe Card CXL Card Type 1 Type 2 Type 3
Processor Processor Processor
CXL CXL CXL
Usages Usages Usages
Protocols Protocols
• CXL.io
• CXL.cache
• CXL.io
• CXL.cache
• CXL.memory
• CXL.io
• CXL.memory
Accelerator
Accelerator
NIC
Memory Buffer
Cache Cache
• PGAS NIC
• NIC atomics
• GP GPU
• Dense computation
• Memory BW expansion
• Memory capacity expansion
• Storage class memory
DDR
DDR
DDR
DDR
DDR
DDR
HBM
HBM
Memory
Memory
Memory
Memory
Caching Devices /
Accelerators Accelerators with Memory Memory Buffers
Protocols
69. CXL™ Type 3 Device
Home Agent
DDR
DDR
Host/CPU
CXL Memory
Expander
CXL.io
CXL.mem
Device
Memory
Memory
Controller
Memory
Controller
CXL is a cache coherent standard, meaning the host and the CXL device
see the same data seamlessly
70. CXL™ Type 3 Device - Memory Expansion
CXL enables systems to significantly scale memory capacity and bandwidth
8x 2DPC
(DIMM/channels)
Max. 8TB for 1CPU
DDRx 512GB
DDRx 512GB
DDRx 512GB
DDRx 512GB
Max. 12TB for 1CPU
DDRx 512GB
DDRx 512GB
DDRx 512GB
DDRx 512GB
Mem Ex 1TB
Mem Ex 1TB
Mem Ex 1TB
Mem Ex 1TB
8x 2DPC
(DIMM/channels)
CPU
CPU
71. Current Use Cases: Capacity / Bandwidth Expansion
IMDB Server
xTB
DRAM
CPU 0
xTB
DRAM
CPU 1
IMDB Server
xTB
DRAM
CPU 0
xTB
DRAM
CPU 1
IMDB Server
yTB
DRAM
CPU 0
yTB
DRAM
CPU 1
ZTB
CXL
ZTB
CXL
IMC Server
xGB
DRAM
CPU 0
xGB
DRAM
CPU 1
IMC Server
xGB
DRAM
CPU 0
xGB
DRAM
CPU 1
IMC Server
yGB
DRAM
CPU 0
yGB
DRAM
CPU 1
zGB
CXL
zGB
CXL
IMC Server
xGB
DRAM
CPU 0
xGB
DRAM
CPU 1
IMC Server
yGB
DRAM
CPU 0
yGB
DRAM
CPU 1
zGB
CXL
zGB
CXL
Capacity Expansion - TCO reduction
Bandwidth Expansion – performance improvement
zGB
CXL
zGB
CXL
zGB
CXL
zGB
CXL
73. Future Use Cases: Tiering and Pooling
IMC Server
xTB
DRAM
CPU 0
xTB
DRAM
CPU 1
IMC Server
xTB
DRAM
CPU 0
xTB
DRAM
CPU 1
IMC Server
xTB
DRAM
CPU 0
xTB
DRAM
CPU 1
IMC Server
yTB
DRAM
CPU 0
yTB
DRAM
CPU 1
IMC Server
yTB
DRAM
CPU 0
yTB
DRAM
CPU 1
IMC Server
yTB
DRAM
CPU 0
yTB
DRAM
CPU 1
zTB
CXL
zTB
CXL
MEMORY BOX
zTB
CXL
zTB
CXL
Memory Tiering* – Efficient expansion
Memory Pooling - Increased utilization
*Hot data on DRAM
Warm data on cost-optimized, CXL-attached media
74. Samsung CXL™ Proof of Concept
Supporting ecosystem growth with
CXL-based memory functional sample
Form Factor – EDSFF (E3.S) / AIC
Media – DDR4
Module Capacity – 128 GB
CXL Link Width – x16
Specification: CXL 2.0
Product Features
Ecosystem enablement success
Shipped 100+ samples since availability in 3Q ‘21
Successfully tested with a broad range of server, system,
and software providers across the industry
75. Samsung CXL™ Solution
Leading the industry toward mainstream
adoption of CXL-based memory Form Factor – EDSFF (E3.S)
Media – DDR5
Module Capacity – 512 GB
CXL Link Width – x8
Maximum CXL Bandwidth – 32GB/s
Specification – CXL 2.0
Other Features – RAS, Interleaving, Diagnostics, and more
Availability – Q3’22 for evaluation/testing
Product Features
76. SMDK- Scalable Memory Development Kit
Datacenterto EdgeApplications(IMDB,DLRM,ML/AI,etc)
CXLKernel
Compatible
API
Intelligent Tiering Engine
Optimization
API
Memory Pool Mgmt
Normal ZONE CXL.Mem ZONE
CPU
DRAM
Server
Main
Board
Memory Expander
SMDK
CXL
Allocator
77. Application Benchmark Test
System #1
Redis
DDR5 (32GB)
CXLMem (64GB)
Client
Set
60GB
Get
60GB
System #1
Client
System #2
Ethernet
Redis
DDR5 (32GB)
Redis
DDR5 (32GB)
Redis
DDR5 (32GB)
Cluster
Set
60GB
Get
60GB
vs
Redis Single Node
(DDR+CXL)
Redis Cluster
(DDR x 3)
CXL Link
Test Scenario Test Result
30
455
699
27
496
659
49
172 186
66
173 189
128B 4KB 1MB 128B 4KB 1MB
Set GET
Scale-up vs Scale-out (Redis)
Single Node(DRAM + CXL) 2 Node Cluster(DRAM)
MB/s
Scale-up performance 2.7x better than scale-out (4KB
chunk size)
78. Memory capacity and bandwidth per core is lagging industry demand
Conventional scaling technologies unable to meet the challenge
CXL is the most promising technology to address the gap
Capacity/bandwidth expansion, tiering, pooling use cases
Samsung is leading the advancement of CXL-based memory solutions
PoC, ASIC-based module, and SMDK
Tested PoC with a broad range of partners for more than 1 year
Samsung enthusiastically welcomes further collaboration with the industry
Visit Samsung booth to learn more about Samsung’s Memory Expander and SMDK
Key Takeaways
79.
80. Uksong Kang
VP, DRAM Product Planning
August 2, 2022
Adding New Value to Memory
Subsystems through CXL
98. Micron Confidential
Micron Confidential
DDR
memory
HBM
memory
NAND
storage
Compute
optimized server
AI optimized
server
Memor
y
Storage
CY20 CY25 CY30
AI
servers
Other
servers
CY20 CY25
Data centers = memory centers
Micron’s global data
center market
Memory and storage growth will never be as slow as before, and possibly never as fast as now.
Hyperscale adoption of AI
2x
16%
CAGR
7x storage
6x memory
Drives memory &
storage growth
$180B
$140B
$100B
$60B
$20B
Sources: 1. Hyperscale AI Adoption: Internal Bain research
2. Server Content referencing two published AWS EC2 hardware configs:
AWS Instance types 3/1/22
Standard server Config = 256GB DRAM , 0GB HBM, 1.2TB SSD Storage
AI Server Config = 1152 DRAM+ 320GB HBM, 8TB SSD Storage
3. Global Data Center Market = Micron MI Market Model
99. Micron Confidential
Micron Confidential
Memory-centric innovations in the data center
99
99
Applying the power of software-defined infrastructure
Increasing
latency
and
capacity
Increasing
cost
and
bandwidth
Hot
data
Warm
Data
Cold
data
Capacity storage
Fast storage
CXL-attached memory
Archival storage
In-pkg
direct attach
Near memory
Far memory
Compute
Compute
Memory
Memory
Storage
Storage
Storage
Memory
Memory
Memory
Storage
Storage
Compute
Compute
Compute
Storage
Storage
Storage
Memory
Memory
Compute
Compute
Compute
Memory
Memory
Memory
Compute
Compute
Storage
Storage
Storage
Compute
Modular
Composable
Performance
Efficient
100. Micron Confidential
Micron Confidential
CXL Use Cases
100
Alternative to Stacking
Stacking drives non-linear cost/bit
Provide Ultra-High Capacity
Expansion beyond 4H 3DS TSV
Add Memory Bandwidth
CXL enables memory attached points
Balance Memory
Capacity/BW
DRAM capacity/BW on demand
Balances GB/core and BW
Reduce System
Complexity
Fewer memory channels
Thermally optimized solutions
Enablement After
2DPC
Future 50% slot reduction
102. Micron Confidential
Micron Confidential
Micron’s “data centric” portfolio
102
Compute Storage
Networking
Hyperscale Enterprise & gov Communication Edge
Acceleration
Deep customer relationships Ecosystem enablement
Silicon
technology
Emerging
memory
Advanced
packaging
Tech node
leadership
HBM GDDR LPDDR DDR TLC NAND QLC NAND
Standards body leadership
A complete portfolio built on silicon technology, world class manufacturing, and a diversified supply chain.
103. Micron Confidential
Micron Confidential
Micron is committed to partnering with the industry;
ultimately serving and delighting our customers
103
Strategic ecosystem
partnerships
Define, develop and prove
technologies
DDR, LP, & GDDR
GPU Direct Storage
Enable differentiated solutions
Extend the ecosystem
Industry organizations
Provide leadership in industry
organizations to enable scalable
advancement
106. Afternoon Agenda
106
Start End Name Title Organization
3:25 3:45 Arvind Jagannath Cloud Platform Product Management
3:45 4:05 Mahesh Wagh Senior Fellow
4:05 4:25 Charles Fan CEO & Co-founder
4:25 4:45 Manoj Wadekar
SW-Defined Memory Workstream Lead,
OCP, Storage Architect, Meta
4:45 5:10
Siamak Tavallaei
Panel Moderator
President, CXL Consortium, Chief System
Architect, Google Infrastructure
5:10 5:35
Chris Mellor
Panel Moderator
Editor
Session SPOS-102-1 on the FMS program
122. | AMD | Data Center Group| 2022
[Public]
AGENDA
◢ Paradigm Shift and Memory Composability Progression
◢ Runtime Memory Management
◢ Tiered Memory
◢ NUMA domains and Page Migration
◢ Runtime Memory Pooling
123. | AMD | Data Center Group| 2022
[Public]
PARADIGM SHIFT
◢ Scalable, high-speed CXL™ Interconnect and
PIM (Processing in Memory) contribute to the
paradigm shift in memory intensive computations
◢ Efficiency Boost of the next generation data
center
◢ Management of the Host/Accelerator
subsystems combined with the terabytes of the
Fabric Attached Memory
◢ Reduced complexity of the SW stack combined
with direct access to multiple memory
technologies
124. | AMD | Data Center Group| 2022
[Public]
MEMORY COMPOSABILITY PROGRESSION
Host R
P
Buffer
Host R
P
End
Point
View
Mem Direct
Attach
Memory Scale-
Out
Mem Pooling & Disaggregation
• Addresses the cost and
underutilization of the memory
• Multi-domain Pooled Memory -
memory in the pool is allocated/
released when required
• Workloads/ applications
benefiting from memory capacity
• Design optimization for {BW/$,
Memory Capacity/$, BW/core}
125. | AMD | Data Center Group| 2022
[Public]
RUNTIME MEMORY MANAGEMENT
126. | AMD | Data Center Group| 2022
[Public]
TIERED MEMORY
NUMA Domains
Page Migration
127. | AMD | Data Center Group| 2022
[Public]
TIERED MEMORY
NUMA DOMAINS
• Exposed to the HV, Guest OS, Apps
• OS-assisted optimization of the
memory subsystem
• Base on ACPI objects -
SRAT/SLIT/HMAT
128. | AMD | Data Center Group| 2022
[Public]
TIERED MEMORY
PAGE MIGRATION
CCD CCD CCD CCD
IOD
CCD CCD CCD CCD
IOD
Near
Mem
Far Mem
NUMA domains
PROC
CXL mem
Far mem
CXL
CXL
Far Mem
Near Mem
Memory Expansion
PROC
Far Mem
Near Mem
Mem as a Cache
CCD CCD CCD CCD
IOD
CCD CCD CCD CCD
IOD
Near Mem
Far Mem CXL mem
Far Mem
CXL
CXL
Near Mem
NUMA domains
MISS
Shorter latency Longer latency
Near
Mem
‒ Active page migration between Far and Near memories
‒ HV/Guest migrates hot pages into Near Mem and retire cold
pages into Far Mem
‒ Focused DMA to transfer required datasets from the Far to
Near Mem
SW Assisted Page Migration
‒ HW managed Hot Dataset
‒ Near Mem Miss redirected to the Far Mem
‒ App/ HV unawareness
DRAM as a cache optimization
129. | AMD | Data Center Group| 2022
[Public]
TIERED MEMORY
SW ASSISTED PAGE MIGRATION
Combined HW /SW tracking of the
Memory Page Activity/ “hotness”
Detecting Page(s) candidates for
migration
Requesting HV/Guest permission to
migrate
HV/Guest API to Security Processor to
Migrate the Page(s)
Migration – stalling accessed to specific
pages/ copying the data
Page “hotness” –combined action
of the HW and SW tracking
HV/Guest authorization of the
migration
Security Processor as a root of
trust for performing the migration
130. | AMD | Data Center Group| 2022
[Public]
RUNTIME MEMORY ALLOCATION/POOLING
FABRIC ATTACHED MEMORY
Host Host
Tier2 Mem
Multi-Headed CXL
controller
Multiple structures serve for fabric level memory pooling
Combination of the private (dedicated to specific host) and shareable memory ranges
Protection of the memory regions from unauthorized guests and hypervisor
Allocation/Pooling of the memory ranges between Hosts is regulated by the fabric
aware SW layer (i.e., Fabric Manager)
131. | AMD | Data Center Group| 2022
[Public]
RUNTIME MEMORY ALLOCATION/POOLING
FABRIC ATTACHED MEMORY
Memory Allocation Layer – communicates
<new memory allocation per Host> based
on the system/apps needs
Fabric Manager – adjusts the fabric
settings and communicates new memory
allocations to the Host SW
Host SW - Invokes Hot Add/Hot Removal
method to increase/ reduce (or offline) an
amount of memory allocated to the Host
In some instances, Host SW can directly
invoke SP to adjust the memory size allocated
to the Host
On–die Security Processor (Root of Trust)
is involved in securing an exclusive access
to the memory range
132. | AMD | Data Center Group| 2022
[Public]
SUMMARY
Composable Disaggregated Memory is the key approach to address
the cost and underutilization of the System Memory
Further investment in the Runtime Management of the Composable &
Multi-Type memory structures is required to maximize the system level
performance across multiple use-cases
Application Transparency is another goal of efficient Runtime
Management by abstracting away an underlying fabric/memory
infrastructure
133.
134. CXL: The Dawn of Big Memory
Charles Fan
Co-founder & CEO
MemVerge
135. The Rise of Modern Data-Centric Applications
135
EDA Simulation
AI/ML Video Rendering
Geophysical
Genomics Risk Analysis
CFD
Financial Analytics
136. Opening the Door to the
Era of Big Memory
136
Abundant
Composable
Available
137. What happened 30+ years ago
137
Just Bunch
of Disks
Storage Area
Network
(SAN)
Advanced
Storage Services
Fiber Channel Storage Data Services
138. Where We Are Going…
138
Just Bunch
of Disks
New
Memory
Storage Area
Network (SAN)
Pooled Memory
Advanced
Storage Services
Memory-as-
a-Service
Fiber Channel Storage Data Services
CXL Memory Data Services
140. Dynamic Memory Expansion
Reduces Stranded Memory
Before CXL
Use Case #1
Used Memory Memory not used
* H. Li et. Al. First-generation Memory Disaggregation for Cloud Platforms.
arXiv:2203.00241v2 [cs.OS], March 5, 2022
Azure Paper*:
• Up to 50% of server costs is from DRAM alone
• Up to 25% of memory is stranded
• 50% of all VMs never touch 50% of their rented memory
141. Dynamic Memory Expansion
Reduces Stranded Memory
After CXL
Used Memory Memory not used
Use Case #1
Memory disaggregation can save billions of dollars per year.
142. Memory Auto-healing
With Transparent Migration
2. Provision a new memory
region from the pool
1. A memory module is becoming bad:
error rate going up.
3. Transparent
migration of
memory data
4. Memory Auto-healing
complete
Use Case #2
144. Using Shared Memory Read
Use Case #3
After CXL
S. Chen, et. Al. Optimizing Performance and Computing Resource Management of
in-memory Big Data Analytics with Disaggregated Persistent Memory. CCGRID'19
Project Splash is open source: https://github.com/MemVerge/splash
145. Key Software Components
145
Memory
Snapshot
Memory
Tiering
Resource
management
Transparent Memory Service
Operating Systems
App App App App
CXL Switch
CXL
Computing Hosts Memory Pool
Memory Provisioning &
Sharing
Capacity Optimization
Global Insights
Security
Data
Protection
Memory Machine Pool Manager
Operating System
Pool Server
Memory
Viewer
App profiler
Hardware API Integration
Memory
Sharing
146. Key Software Components
146
Memory
Snapshot
Memory
Tiering
Resource
management
Transparent Memory Service
Operating Systems
App App App App
CXL Switch
CXL
Computing Hosts Memory Pool
Memory Provisioning &
Sharing
Capacity Optimization
Global Insights
Security
Data
Protection
Memory Machine Pool Manager
Operating System
Pool Server
Memory
Viewer
App profiler
Hardware API Integration
Memory
Sharing
147. 14
7
Memory Capacity Expansion
• Software-defined Memory Pool
with intelligent Auto-tiering
• No application change required
Accelerate Time-to-discovery
• Transparent checkpointing
• Roll-back, restore and clone
anywhere any time
Reduce Cloud Cost by up to 70%
• Enable long-running applications
to use low-cost Spot instances
• Integration with cloud automation
and scheduler to auto-recover
from CSP preemptions
Memory Machine™
Memory Snapshot Service Memory Tiering Service System & Cloud
Orchestration Service
Transparent Memory Service
Linux
Compute Memory Storage
HBM
DDR
CXL
Genomics EDA Geophysics Risk Analysis Video Rendering Others
CPU GPU xPU
SSD
HDD
Announcing Memory Machine Cloud Edition
148. 14
8
Memory Machine™
Memory Snapshot Service Memory Tiering Service System & Cloud
Orchestration Service
Transparent Memory Service
Linux
64GB of DDR5 DRAM 64GB of CXL DRAM Expander Card
(Montage Technologies)
MLC
Memory Latency Checking
Early Results Running Memory Machine on CXL
Next-Gen Server
Streams
Microbenchmark
Application
Redis
149. Early Results Running Memory Machine on CXL
149
0
5
10
15
20
25
30
35
40
45
ALL Reads 3:1 Reads-
Writes
2:1 Reads-
Writes
1:1 Reads-
Writes
Stream-triad
like
Throughput
(GB/S)
Workload Types
MLC (Memory Latency Checker) Results
DDR5 Only CXL Only DDR+CXL Memory Machine Auto-Tiering
0
5
10
15
20
25
Copy: Scale: Add: Triad:
Throughput
(GB/s)
Workload Types
Stream Results
DDR5 Only CXL Only DDR+CXL Memory Machine Auto-Tiering
154. Software Partner to the CXL Ecosystem
154
Founded in 2017 to develop Big Memory software
Memory
Snapshot
Memory
Tiering
Resource
management
Transparent Memory Service
Memory Provisioning & Sharing Capacity Optimization
Global Insights
Security
Data Protection
Memory Machine Pool Manager
Memory
Viewer
App profiler
Hardware API Integration
Memory
Sharing
Big Memory Software
Processors:
Servers:
Switches:
Memory
Systems:
Clouds
Big Memory Apps
Standards Bodies
157. Agenda
• SDM Workstream within OCP
• Hyperscale Infra - Needs
• Memory Hierarchy to address the needs
• SDM Use cases
• SDM Activities and Status
157
158. SDM Team Charter
- SDM (Software Defined Memory) is a workstream within
Future Technology Initiatives within OCP
Charter:
- Identify key applications driving adoption of Hierarchical/ Hybrid
memory solutions
- Establish architecture and nomenclature for such Systems
- Offer benchmarks that enable validation of novel ideas for
HW/SW solutions for such systems
159. Hyperscale Infrastructure
159
• Application performance and growth depends on
⎻ DC, System, Component performance and growth
⎻ Compute, Memory, Storage, Network..
• Focusing on Memory discussion
Ads
FE
Web
Database/
Cache
Inference
Ads
Data
Wearhouse
Data
Wearhouse
Database/
Cache
Storage
Training
160. Memory Challenges
160
Bandwidth and Capacity
• The Gap between bandwidth
and capacity is widening
• Applications ready to trade
between bandwidth and
capacity
Power
• DIMMs consume significant
share of rack power
⎻ DDR5 exacerbates this
• Applications co-design to
achieve higher capacity at
optimized power
TCO
• Cost impact of min
capacity increase and
Die/ECC overheads
• Applications can trade
performance/capacity to
achieve optimal TCO
162. Use Case Examples
162
• Caching (e.g. Memcache/Memtier (Cachelib), Redis etc.)
⎻ Need to achieve higher QPS while satisfying “retention time”
⎻ Higher memory capacity needed
⎻ Current solutions include ”tiered memory” with DRAM+NAND, but need load/store
• Databases (E.g. RocksDB/MongoDB etc.)
⎻ Need to achieve efficient storage capacity per node and deliver QPS SLA
⎻ Higher amount of memory enables more storage per node
• Inference (E.g. DLRM)
⎻ Petaflops and Number of parameters are increasing rapidly
⎻ AI Models are scaling faster than the underlying memory technology
⎻ Current solutions include ”tiered memory” with DRAM+NAND, but need load/store
163. AI at Meta
● Across many applications/services and at scale → driving a portion of
our overall infrastructure (both HW and SW)
● From data centers to the edge
Keypoint
Segmentation
Augmented Reality
with Smart Camera
164. ● Compute, Memory BW, Memory Capacity, all scale for frontier models
○ Scaling typically is faster than scaling of technology
● The rapid scaling requires more vertical integration from SW requirements to HW
design
Problem Statement: AI workloads scale
rapidly
165. DLRM Memory Requirements
● Bandwidth
1. Considerable portion of capacity needs high BW
Accelerator memory.
2. Inference has a bigger portion of the capacity at
low Bandwidth. More so than training.
● Latency
3. Inference has a tight latency requirement, even on
the low BW end
166. System Implications of DLRM Requirements
● A tier of memory beyond HBM and DRAM can be
leveraged, particularly for inference
○ Higher latency than main memory. But still tight
latency profile (e.g TLC Nand Flash does not
work)
○ Trade off perf for density
○ This does not negate the Capacity and BW
demand for HBM and DRAM
167. “Tiered Memory” Pyramid with CXL
167
Capacity driven
Bandwidth driven
Databases,
Caching..
GP Compute,
Training..
Inference,
Caching..
BW Memory
NAND SSD
Capacity
Memory
DRAM
Cache
CXL Attached
HBM
• Load/store interface
• Cache line read/writes
• Scalable
• Heterogeneous
• Standard interfaces
169. 169
OCP SDM activity and progress
• SDM’s focus: Apply emerging memory technologies in the development of use cases
• The OCP SDM group has three real-world memory focus areas:
⎻ Databases/Caching
⎻ AI/ML & HPC
⎻ Virtualized Servers
• SDM Team Members: AMD, ARM, Intel, Meta, Micron, Microsoft, Omdia, Samsung,
VMWare,
• Vendors are demonstrating CXL Capable CPUs and devices
• Meta and others are investigating solutions to real world memory problems
SDM – Enabling memory solutions from emerging memory
technologies
170. 170
Ben Bolles
Executive Director,
Product Management
Liqid
Gerry Fan
Founder, CEO
Xconn Technologies
Siamak Tavallaei
Panel Moderator
President
CXL Consortium
George Apostol
CEO
Elastics.cloud
Christopher Cox
VP Technology
Montage
Composable Memory Panel
171. 171
Chris Mellor
Editor
Blocks and Files
Manoj Wadekar
SW-Defined Memory
Workstream Lead, OCP,
Storage Architect, Meta
Richard Solomon
Tech Mktg Mgr., PCIe/CXL
Synopsys
Bernie Wu
VP Strategic Alliances
MemVerge
James Cuff
Distinguished Engineer
Harvard University (retired)
Industry Expert, HPC & AI
Industr
y
Expert
Big Memory App Panel