3. Super Computing , Denver, Colorado, 2023
3
Memory Tiering with AutoNUMA for Caching: Redis with
Memtier
Memory capacity expansion through CXL improves Caching
Application’s throughput by ~10X and latency by ~1.8X over DRAM
with SSD configuration
1 1
2.8
3.3
10.3 10.2
0
2
4
6
8
10
12
Ops/sec Total Bandwidth
Normalized
Throughput (higher better)
DRAM + SSD
75% DRAM + 25% CXL w/ AutoNUMA
50% DRAM + 50% CXL w/ AutoNUMA
1 1
0.5
0.6
0.7
0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cache Allocate P99
Latency
Cache Find P99 Latency
Normalized
Latency (lower better)
DRAM + SSD
75% DRAM + 25% CXL w/ AutoNUMA
50% DRAM + 50% CXL w/ AutoNUMA
EPY
C
CPU
6 x DDR5 DIMM 6 x DDR5 DIMM
4 x DDR4 DIMM on AsteraLabs Leo CXL Card
CXL
32GT/s
System Configuration
50 threads each creating 20 Redis clients
to randomly access cache objects of
total 1TB of working set size
AutoNUMA is used as a page placement
policy for tiered memory
4. Super Computing , Denver, Colorado, 2023
4
Memory Tiering for In-Memory Database: MSSQL with
TPC-H
3.15
2.32 2.18 2.25 2.25 2.27 2.39
4.82 4.71 4.82
4.57 4.69 4.54
4.86
1
6.17 6.84
7.28 7.25 7.28
6.55 6.57
1.32
5.91
6.84
7.31
7.72 7.82
8.24 8.09
0
1
2
3
4
5
6
7
8
9
0 5 10 15 20 25 30 35
Normalized
Execution
Speed
w.r.t.
1
Stream
786GB
DRAM
(12*64GB)
Number of Streams
DRAM Only (12 * 64GB) DRAM Only (12 * 96GB)
DRAM + CXL (12 * 64GB + 1TB) DRAM + CXL (12 * 96GB + 1TB)
1.6X Speedup CXL memory expansion
can improve performance
over DDR module upgrade
at a reduced TCO
in collaboration with Micron
EPY
C
CPU
6 x DDR5 DIMM 6 x DDR5 DIMM
4 x Micron CMM DDR4
CXL
32GT/s
System Configuration
Total working set size is 3TB. For storage, 8x Micron 7450 NVME SSD is used
Linux’s default page placement policy is used for tiered memory management
5. Super Computing , Denver, Colorado, 2023
5
SW-Defined CXL Memory Bandwidth Expansion:
CloverLeaf
CXL memory benefits
HPC applications
through bandwidth
expansion over the DDR
modules
0.63
0.79
1.17
0
0.2
0.4
0.6
0.8
1
1.2
50% DRAM, 50% CXL 64% DRAM, 34% CXL 80% DRAM, 20% CXL
Normalized
Execution
Speedup
SW-Defined Interleaved Page Allocation Ratio
17% Speedup
DRAM Baseline
in collaboration with Micron
EPY
C
CPU
6 x DDR5 DIMM 6 x DDR5 DIMM
4 x Micron CMM DDR4
CXL
32GT/s
System Configuration
NUMA 0
NUMA 1
EPY
C
CPU
6 x DDR5 DIMM 6 x DDR5 DIMM
4 x Micron CMM DDR4
CXL
32GT/s
NUMA 0
NUMA 2
NUMA 1
EPY
C
CPU
6 x DDR5 DIMM 6 x DDR5 DIMM
4 x Micron CMM DDR4
CXL 32GT/s
NUMA 0
NUMA 4
NUMA 1
NUMA 2
NUMA 3
50% DRAM, 50% CXL 64% DRAM, 34% CXL 80% DRAM, 20% CXL
NPS4
NPS2
NPS1