Boost Fertility New Invention Ups Success Rates.pdf
CPU Memory Hierarchy and Caching Techniques
1. Memory Hierarchy
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers
3. Why Memory Hierarchy?
Applications want unlimited amounts of memory
with low latency
Fast memory is more expensive per bit
Solution
Organize memory system into a hierarchy
Entire addressable memory space available in largest,
slowest memory
Incrementally smaller & faster memories
Temporal & spatial locality ensures that nearly all
references can be found in smaller memories
Gives illusion of a large, fast memory being presented
to processor 3
5. Why Hierarchical Design?
Becomes more crucial with multi-core
processors
Aggregate peak bandwidth grows with no of cores
Intel Core i7 can generate 2 references per core per clock
4 cores and 3.2 GHz clock
25.6 billion 64-bit data references/second +
12.8 billion 128-bit instruction references
= 409.6 GB/s
DRAM bandwidth is only 6% of this (25 GB/s)
Requires
Multi-port, pipelined caches
2 levels of cache per core
Shared third-level cache on chip 5
7. Performance vs. Power
High-end microprocessors have >10 MB on-chip
cache
Consumes large amount of chip area & power budget
Leakage current – when not operating
Active current – when operating
Major limiting factor for processors used in
mobile devices
7
8. Definitions – Blocks
Multiple blocks are moved between levels in the
hierarchy
Spatial locality efficiency
Blocks are tagged with memory address
Tags are searched parallel
8
Source: http://archive.arstechnica.com/paedia/c/caching/m-caching-5.html
10. Pentium 4 vs. Opteron Memory
Hierarchy
10
CPU Pentium 4 (3.2
GHz)
Opteron (2.8 GHz)
Instruction
Cache
Trace Cache 8K
micro-ops)
2-way associative,
64 KB, 64B block
Data Cache 8-way associative,
16 KB, 64B block,
inclusive in L2
2-way associative,
64 KB, 64B block,
exclusive to L2
L2 Cache 8-way associative, 2
MB, 128B block
16-way associative,
1 MB, 64B block
Prefetch 8 streams to L2 1 stream to L2
Memory 200 MHz x 64 bits 200 MHz x 128 bits
11. Definitions – Updating Cache
Write-through
Update cache block & all other levels below
Use write buffers to speed up
Write-back
Update cache block
Update lower level when replacing cached block
Use write buffers to speed up
11
12. Definitions – Replacing Cached Blocks
Cache replacement policies
Random
Least Recently Used (LRU)
Need to track last access time
Least Frequently Used (LFU)
Need to track no of accesses
First In First Out (FIFO)
12
13. Definitions – Cache Misses
When required items is not found in cache
Miss rate – fraction of cache accesses that result
in a failure
Types of misses
Compulsory – 1st access to a block
Capacity – limited cache capacity force blocked to be
removed from a cache & later retrieved
Conflict – if placement strategy is not fully associative
Average memory access time
= Hit time + Miss rate x Miss penalty 13
14. Definitions – Cache Misses (Cont.)
Memory stall cycles
= Instruction count x Fraction of memory access per
instructions x Miss rate x Miss Penalty
Fraction of memory access per instructions
= Instruction memory access per instruction + Data memory
access per instruction
Example
50% instructions are load & store. Miss rate is 2% &
penalty is 25 clock cycles. Suppose CPI is 1. How fast
can this be if all instructions are cache hits?
IC x (1 + 0.75) x CC = 1.75
IC x 1 x CC 14
16. 6 Basic Cache Optimization Techniques
1. Larger block sizes
Reduce compulsory misses
Increase capacity & conflict misses
Increase miss penalty
Choosing a correct block size is challenging
2. Larger total cache capacity to reduce miss rate
Reduce misses
Increase hit time
Increase power consumption & cost
3. Higher no of cache levels
Reduce overall memory access time 16
17. 6 Basic Cache Optimization Techniques
(Cont.)
4. Higher associativity
Reduce conflict misses
Increase hit time
Increase power consumption
5. Giving priority to read misses over writes
Allow reads to check write buffer
Reduce miss penalty
6. Avoiding address translation in cache indexing
Virtual to physical address mapping
Reduce hit time
17
18. 10 Advanced Cache Optimization
Techniques
5 categories
1. Reducing hit time
2. Increasing cache bandwidth
3. Reducing miss penalty
4. Reducing miss rate
5. Reducing miss penalty or miss rate via parallelism
18
19. Advanced Optimizations 1
Small & simple 1st level caches
Recently size of L1 cache increased either slightly or
not at all
Critical timing path in a cache hit
addressing tag memory, then
comparing tags, then
selecting correct set
Direct-mapped caches can overlap tag compare &
transmission of data
Improve hit time
Lower associativity reduces power because fewer
cache lines are accessed
19
22. Advanced Optimizations 2
Way Prediction
Given access to the current block, predict which block
to access next
Improve hit time
Mis-prediction increase hit time
Prediction accuracy
> 90% for 2-way
> 80% for 4-way
Instruction cache has better accuracy than Data cache
First used on MIPS R10000 in mid-90s
Used on ARM Cortex-A8
22
23. Advanced Optimizations 3
Pipeline cache access
Enable L1 cache access to be multiple cycles
Examples
Pentium – 1 cycle
Pentium Pro to Pentium III – 2 cycles
Pentium 4 to Core i7 – 4 cycles
Improve bandwidth
Makes it easier to increase associativity
Increase hit time
Increases branch mis-prediction penalty
23
24. Advanced Optimizations 4
Nonblocking Caches
Allow hits before previous
misses complete
“Hit under miss”
“Hit under multiple miss”
L2 must support this
In general, processors can
hide L1 miss penalty but
not L2 miss penalty
Increase bandwidth
24
25. Advanced Optimizations 5
Multibanked Caches
Organize cache as independent banks to support
simultaneous access
Examples
ARM Cortex-A8 supports 1-4 banks for L2
Intel i7 supports 4 banks for L1 & 8 banks for L2
Interleave banks according to block address
Increase bandwidth
25
26. Advanced Optimizations 6
Critical Word First, Early Restart
Critical word first
Request missed word from memory first
Send it to processor as soon as it arrives
Early restart
Request words in normal order
Send missed work to processor as soon as it arrives
Reduce miss penalty
Effectiveness depends on block size & likelihood of
another access to portion of the block that has not yet
been fetched
26
27. Advanced Optimizations 7 - 10
Merging Write Buffer
Reduce miss penalty
Compiler Optimizations
Examples
Loop Interchange – Swap nested loops to access memory in
sequential order
Instead of accessing entire rows or columns, subdivide matrices into
blocks
Reduce miss rate
Hardware Prefetching
Fetch 2 blocks on miss
Reduce miss penalty or miss rate
Compiler Prefetching
Reduce miss penalty or miss rate 27
29. Memory Technologies
Performance metrics
Latency is concern of cache
Bandwidth is concern of multiprocessors & I/O
Access time
Time between read request & when desired word arrives
Cycle time
Minimum time between unrelated requests to memory
DRAM used for main memory
SRAM used for cache
29
30. Memory Technology (Cont.)
Amdahl
Memory capacity should grow linearly with processor
speed
Unfortunately, memory capacity & speed hasn’t kept
pace with processors
Some optimizations
Multiple accesses to same row
Synchronous DRAM (SDRAM)
Added clock to DRAM interface
Burst mode with critical word first
Wider interfaces
Double data rate (DDR)
Multiple banks on each DRAM device 30
32. DRAM Power Consumption
Reducing power in DRAMs
Lower voltage
Low power mode (ignores clock, continues to refresh)
32
33. Flash Memory
Type of EEPROM
Must be erased (in blocks) before being
overwritten
Non volatile
Limited no of write cycles
Cheaper than DRAM, more expensive than disk
Slower than SRAM, faster than disk
33
37. Virtual Memory
Each process has its own address space
Protection via virtual memory
Keeps processes in their own memory space
Role of architecture
Provide user mode & supervisor mode
Protect certain aspects of CPU state
Provide mechanisms for switching between user
mode & supervisor mode
Provide mechanisms to limit memory accesses
Provide TLB to translate addresses
37
38. Paging Hardware With TLB
Parallel search on TLB
Address translation (p, d)
If p is in associative register,
get frame # out
Otherwise get frame # from
page table in memory
39. Summary
Caching techniques are continuing to evolve
Combination of techniques are combined
Cache sizes are unlikely to increase significantly
Better performance when programs are
optimized based on cache architecture
39