CPU Memory Hierarchy and Caching Techniques

Memory Hierarchy
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers

Processor-Memory Performance Gap
2
Gap grew 50%
per year

Why Memory Hierarchy?
 Applications want unlimited amounts of memory
with low latency
 Fast memory is more expensive per bit
 Solution
 Organize memory system into a hierarchy
 Entire addressable memory space available in largest,
slowest memory
 Incrementally smaller & faster memories
 Temporal & spatial locality ensures that nearly all
references can be found in smaller memories
 Gives illusion of a large, fast memory being presented
to processor 3

Why Hierarchical Design?
 Becomes more crucial with multi-core
processors
 Aggregate peak bandwidth grows with no of cores
 Intel Core i7 can generate 2 references per core per clock
 4 cores and 3.2 GHz clock
 25.6 billion 64-bit data references/second +
12.8 billion 128-bit instruction references
= 409.6 GB/s
 DRAM bandwidth is only 6% of this (25 GB/s)
 Requires
 Multi-port, pipelined caches
 2 levels of cache per core
 Shared third-level cache on chip 5

Core i7 Die & Major Components
6
Source: Intel

Performance vs. Power
 High-end microprocessors have >10 MB on-chip
cache
 Consumes large amount of chip area & power budget
 Leakage current – when not operating
 Active current – when operating
 Major limiting factor for processors used in
mobile devices
7

Definitions – Blocks
 Multiple blocks are moved between levels in the
hierarchy
 Spatial locality  efficiency
 Blocks are tagged with memory address
 Tags are searched parallel
8
Source: http://archive.arstechnica.com/paedia/c/caching/m-caching-5.html

Definitions – Associativity
 Defines where blocks can be placed in a cache
9

Pentium 4 vs. Opteron Memory
Hierarchy
10
CPU Pentium 4 (3.2
GHz)
Opteron (2.8 GHz)
Instruction
Cache
Trace Cache 8K
micro-ops)
2-way associative,
64 KB, 64B block
Data Cache 8-way associative,
16 KB, 64B block,
inclusive in L2
2-way associative,
64 KB, 64B block,
exclusive to L2
L2 Cache 8-way associative, 2
MB, 128B block
16-way associative,
1 MB, 64B block
Prefetch 8 streams to L2 1 stream to L2
Memory 200 MHz x 64 bits 200 MHz x 128 bits

Definitions – Updating Cache
 Write-through
 Update cache block & all other levels below
 Use write buffers to speed up
 Write-back
 Update cache block
 Update lower level when replacing cached block
 Use write buffers to speed up
11

Definitions – Replacing Cached Blocks
 Cache replacement policies
 Random
 Least Recently Used (LRU)
 Need to track last access time
 Least Frequently Used (LFU)
 Need to track no of accesses
 First In First Out (FIFO)
12

Definitions – Cache Misses
 When required items is not found in cache
 Miss rate – fraction of cache accesses that result
in a failure
 Types of misses
 Compulsory – 1st access to a block
 Capacity – limited cache capacity force blocked to be
removed from a cache & later retrieved
 Conflict – if placement strategy is not fully associative
 Average memory access time
= Hit time + Miss rate x Miss penalty 13

Definitions – Cache Misses (Cont.)
 Memory stall cycles
= Instruction count x Fraction of memory access per
instructions x Miss rate x Miss Penalty
 Fraction of memory access per instructions
= Instruction memory access per instruction + Data memory
access per instruction
 Example
 50% instructions are load & store. Miss rate is 2% &
penalty is 25 clock cycles. Suppose CPI is 1. How fast
can this be if all instructions are cache hits?
IC x (1 + 0.75) x CC = 1.75
IC x 1 x CC 14

Cache Performance Metrics
 Hit time
 Miss rate
 Miss penalty
 Cache bandwidth
 Power consumption
15

6 Basic Cache Optimization Techniques
1. Larger block sizes
 Reduce compulsory misses
 Increase capacity & conflict misses
 Increase miss penalty
 Choosing a correct block size is challenging
2. Larger total cache capacity to reduce miss rate
 Reduce misses
 Increase hit time
 Increase power consumption & cost
3. Higher no of cache levels
 Reduce overall memory access time 16

6 Basic Cache Optimization Techniques
(Cont.)
4. Higher associativity
 Reduce conflict misses
 Increase power consumption
5. Giving priority to read misses over writes
 Allow reads to check write buffer
 Reduce miss penalty
6. Avoiding address translation in cache indexing
 Virtual to physical address mapping
 Reduce hit time
17

10 Advanced Cache Optimization
Techniques
 5 categories
1. Reducing hit time
2. Increasing cache bandwidth
3. Reducing miss penalty
4. Reducing miss rate
5. Reducing miss penalty or miss rate via parallelism
18

Advanced Optimizations 1
 Small & simple 1st level caches
 Recently size of L1 cache increased either slightly or
not at all
 Critical timing path in a cache hit
 addressing tag memory, then
 comparing tags, then
 selecting correct set
 Direct-mapped caches can overlap tag compare &
transmission of data
 Improve hit time
 Lower associativity reduces power because fewer
cache lines are accessed
19

L1 Size & Associativity – Access Time
20

L1 Size & Associativity – Energy
21

 Way Prediction
 Given access to the current block, predict which block
to access next
 Improve hit time
 Mis-prediction increase hit time
 Prediction accuracy
 > 90% for 2-way
 > 80% for 4-way
 Instruction cache has better accuracy than Data cache
 First used on MIPS R10000 in mid-90s
 Used on ARM Cortex-A8
22

 Pipeline cache access
 Enable L1 cache access to be multiple cycles
 Examples
 Pentium – 1 cycle
 Pentium Pro to Pentium III – 2 cycles
 Pentium 4 to Core i7 – 4 cycles
 Improve bandwidth
 Makes it easier to increase associativity
 Increases branch mis-prediction penalty
23

 Nonblocking Caches
 Allow hits before previous
misses complete
 “Hit under miss”
 “Hit under multiple miss”
 L2 must support this
 In general, processors can
hide L1 miss penalty but
not L2 miss penalty
 Increase bandwidth
24

 Multibanked Caches
 Organize cache as independent banks to support
simultaneous access
 Examples
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 & 8 banks for L2
 Interleave banks according to block address
 Increase bandwidth
25

 Critical Word First, Early Restart
 Critical word first
 Request missed word from memory first
 Send it to processor as soon as it arrives
 Early restart
 Request words in normal order
 Send missed work to processor as soon as it arrives
 Effectiveness depends on block size & likelihood of
another access to portion of the block that has not yet
been fetched
26

Advanced Optimizations 7 - 10
 Merging Write Buffer
 Compiler Optimizations
 Examples
 Loop Interchange – Swap nested loops to access memory in
sequential order
 Instead of accessing entire rows or columns, subdivide matrices into
blocks
 Reduce miss rate
 Hardware Prefetching
 Fetch 2 blocks on miss
 Reduce miss penalty or miss rate
 Compiler Prefetching
 Reduce miss penalty or miss rate 27

Memory Technologies
 Performance metrics
 Latency is concern of cache
 Bandwidth is concern of multiprocessors & I/O
 Access time
 Time between read request & when desired word arrives
 Cycle time
 Minimum time between unrelated requests to memory
 DRAM used for main memory
 SRAM used for cache
29

Memory Technology (Cont.)
 Amdahl
 Memory capacity should grow linearly with processor
speed
 Unfortunately, memory capacity & speed hasn’t kept
pace with processors
 Some optimizations
 Multiple accesses to same row
 Synchronous DRAM (SDRAM)
 Added clock to DRAM interface
 Burst mode with critical word first
 Wider interfaces
 Double data rate (DDR)
 Multiple banks on each DRAM device 30

DRAM Optimizations
31
MB/sec = Clock rate x 2 x 8 bytes

DRAM Power Consumption
 Reducing power in DRAMs
 Lower voltage
 Low power mode (ignores clock, continues to refresh)
32

Flash Memory
 Type of EEPROM
 Must be erased (in blocks) before being
overwritten
 Non volatile
 Limited no of write cycles
 Cheaper than DRAM, more expensive than disk
 Slower than SRAM, faster than disk
33

Modern Memory Hierarchy
34
Source: http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/

Intel Optane Non-volatile Memory
35
Source: www.forbes.com/sites/tomcoughlin/2018/06/11/intel-optane-finally-on-dimms/#5792e114190b

Intel Optane (Cont.)
36
Source: www.anandtech.com/show/9541/intel-announces-optane-storage-brand-for-3d-xpoint-products

Virtual Memory
 Each process has its own address space
 Protection via virtual memory
 Keeps processes in their own memory space
 Role of architecture
 Provide user mode & supervisor mode
 Protect certain aspects of CPU state
 Provide mechanisms for switching between user
mode & supervisor mode
 Provide mechanisms to limit memory accesses
 Provide TLB to translate addresses
37

Paging Hardware With TLB
 Parallel search on TLB
 Address translation (p, d)
 If p is in associative register,
get frame # out
 Otherwise get frame # from
page table in memory

Summary
 Caching techniques are continuing to evolve
 Combination of techniques are combined
 Cache sizes are unlikely to increase significantly
 Better performance when programs are
optimized based on cache architecture
39

CPU Memory Hierarchy and Caching Techniques

Recommended

Recommended

More Related Content

Similar to CPU Memory Hierarchy and Caching Techniques

Similar to CPU Memory Hierarchy and Caching Techniques (20)

More from Dilum Bandara

More from Dilum Bandara (20)

Recently uploaded

Recently uploaded (20)

CPU Memory Hierarchy and Caching Techniques