This document discusses optimizations to cache memory hierarchies to improve performance. It introduces the concepts of multi-level caches, where adding a second-level cache can reduce the miss penalty from the first-level cache. The average memory access time equation is extended to account for multiple cache levels. Examples are provided to show how a second-level cache can lower the average memory access time and reduce stalls per instruction compared to a single-level cache.
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Memory Hierarchy Optimization Techniques for Enhancing Performance
1. 1
Memory Hierarchy Design and Optimizations
School of Computer Engineering
KIIT University
24-11-2023
2. 2
Introduction
Even a sophisticated processor may
perform well below an ordinary processor:
Unless supported by matching performance by the
memory system.
The focus of this module:
Study how memory system performance has
been enhanced through various innovations
and optimizations.
3. 3
Here we focus on L1/L2/L3 caches, virtual
memory and main memory
Typical Memory Hierarchy
Proc/Regs
L1-Cache
L2-Cache
Memory
Disk, Tape, etc.
Bigger Faster
L3-Cache (optional)
4. 4
What is the Role of a Cache?
A small, fast storage used to improve average
access time to a slow memory.
Caches have no “inherent value”,
only try to close performance gap
Improves memory system performance:
Exploits spatial and temporal locality
Temporal locality refers to the reuse of specific
data, and/or resources, within a relatively small
time duration. Spatial locality (also termed data
locality) refers to the use of data elements
within relatively close storage locations.
5. 5
Four Basic Questions
Q1: Where can a block be placed in the cache?
(Block placement)
Fully Associative, Set Associative, Direct Mapped
Q2: How is a block found if it is in the cache?
(Block identification)
Tag/Block
Q3: Which block should be replaced on a
miss?
(Block replacement)
Random, LRU
Q4: What happens on a write?
(Write strategy)
Write Back or Write Through (with Write Buffer)
6. 6
Block Placement
If a block has only one possible place
in the cache: direct mapped
If a block can be placed anywhere:
fully associative
If a block can be placed in a
restricted subset of the possible
places: set associative
If there are n blocks in each subset: n-
way set associative
Note that direct-mapped = 1-way set
associative
7. 7
Trade-offs
n-way set associative becomes
increasingly difficult (costly) to
implement for large n
Most caches today are either 1-way (direct
mapped), 2-way, or 4-way set associative
The larger n the lower the likelihood of
thrashing
e.g., two blocks competing for the same
block frame and being accessed in sequence
over and over
8. 8
Block Identification
cont…
Given an address, how do we find
where it goes in the cache?
This is done by first breaking
down an address into three parts
offset of the address in
the cache block
Index of the set
tag used for
identifying a match
block offset
set index
tag
Block address
9. 9
Block Identification
cont…
Consider the following system
Addresses are on 64 bits
Memory is byte-addressable
Block frame size is 26 = 64 bytes
Cache is 64 MByte (226 bytes)
Consists of 220 block frames
Direct-mapped
For each cache block brought in from memory,
there is a single possible frame among the 220
available
A 64-bit address can be decomposed as
follows: ( 64-6) memory having 6bit offset address)
58-bit block address
6-bit block offset
20-bit set index
38-bit tag
10. 10
Block Identification
cont…
0...00
set index
20 bits
0...01
0...10
0...11
. . .
1...01
1...10
1...11
2
20
block
frames
tag
38 bits
cache block
26 bits
All addresses with
similar 20 set index bits
“compete” for a single
block frame
12. 12
Block Identification
0...00
set index
20 bits
0...01
0...10
0...11
. . .
1...01
1...10
1...11
2
20
block
frames
tag
58 bits
cache block
26 bits
@
Find the set:
Compare the tag:
?
If no match: miss
If match: hit and
access the byte at
the desired offset
Address from CPU
13. 13
Cache Write Policies
Write-through: Information is written
to both the block in the cache and the
block in memory
Write-back: Information is written
back to memory only when a block
frame is replaced:
Uses a “dirty” bit to indicate whether a
block was actually written to,
Saves unnecessary writes to memory
when a block is “clean”
14. 14
Trade-offs
Write back
Faster because writes occur at the
speed of the cache, not the memory.
Faster because multiple writes to
the same block is written back to
memory only once, uses less memory
bandwidth.
Write through
Easier to implement
15. 15
Memory System
Performance
Memory system performance is largely
captured by three parameters,
Latency, Bandwidth, Average memory access time
(AMAT).
Latency:
The time it takes from the issue of a memory
request to the time the data is available at the
processor.
Bandwidth:
The rate at which data can be pumped to the
processor by the memory system.
16. 16
Average Memory Access
Time (AMAT)
• AMAT: The average time it takes for
the processor to get a data item it
requests.
• The time it takes to get requested
data to the processor can vary:
– due to the memory hierarchy.
• AMAT can be expressed as:
A
M
A
T
C
a
c
h
e
h
i
t
t
i
m
e
M
i
s
s
r
a
t
e
M
i
s
s
p
e
n
a
l
t
y
17. 17
Cache Performance
Parameters
Performance of a cache is largely
determined by:
Cache miss rate: number of cache
misses divided by number of accesses.
Cache hit time: the time between
sending address and data returning from
cache.
Cache miss penalty: the extra processor
stall cycles caused by access to the
next-level cache.
18. 18
If a direct mapped cache has a hit rate of 95%,
a hit time of 4 ns, and a miss penalty of 100 ns,
what is the AMAT?
AMAT = Hit time + Miss rate x Miss penalty =
4 + 0.05 x 100 = 9 ns
If replacing the cache with a 2-way set
associative increases the hit rate to 97%, but
increases the hit time to 5 ns, what is the new
AMAT?
AMAT = Hit time + Miss rate x Miss penalty =
5 + 0.03 x 100 = 8 ns
20. 20
Unified cache (mixed cache): Data and
instructions are stored together (von
Neuman architecture)
Split cache: Data and instructions are
stored separately (Harvard
architecture)
Why do instructions caches have a
lower miss ratio?
Unified vs Split Caches
21. 21
Unified vs Split Caches
A Load or Store instruction requires two
memory accesses:
One for the instruction itself
One for the data
Therefore, unified cache causes a structural
hazard!
Modern processors use separate data and
instruction L1 caches:
As opposed to “unified” or “mixed” caches
The CPU sends simultaneously:
Instruction and data address to the two ports .
Both caches can be configured differently
Size, associativity, etc.
22. 22
Unified vs Split Caches
Separate Instruction and Data caches:
Avoids structural hazard
Also each cache can be tailored specific to need.
Processor
I-Cache-1
Unified
Cache-1
Unified
Cache-2
D-Cache-1
Processor
Unified
Cache-2
Unified Cache Split Cache
24. 24
Which has the lower average memory access time?
Split cache : 16 KB instructions + 16 KB data
Unified cache: 32 KB (instructions + data)
Assumptions
Use miss rates from previous chart
Miss penalty is 50 cycles
Hit time is 1 cycle
75% of the total memory accesses for
instructions and 25% of the total memory
accesses for data
On the unified cache, a load or store hit takes
an extra cycle, since there is only one port for
instructions and data
25. 25
Average memory-access time = Hit time + Miss rate x Miss penalty
AMAT = %instr x (instr hit time + instr miss rate x instr miss penalty) +
%data x (data hit time + data miss rate x data miss penalty)
For the split cache:
AMAT = 75% x (1 + 0.64% x 50) + 25% (1 + 6.47% x 50) = 2.05
For the unified cache
AMAT = 75% x (1 + 1.99% x 50) + 25% x (2 + 1.99% x 50) = 2.24
The unified cache has a longer AMAT, even though its miss rate is
lower, due to conflicts for instruction and data hazards.
26. 26
26
Cache performance:princeton(Unified)
Archiceture
CPUtime = Instruction count x CPI x Clock cycle time
CPIexecution = CPI with ideal memory
CPI = CPIexecution + Mem Stall cycles per instruction
(Memory stall cycles: Number of cycles during which processor is stalled waiting for a
memory access.)
CPUtime = Instruction Count x (CPIexecution + Mem Stall cycles per instruction) x Clock
cycle time
Mem Stall cycles per instruction = Mem accesses per instruction x Miss rate x
Miss penalty
CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty)
x Clock cycle time
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time
27. 27
Assuming the following execution and cache parameters:
Cache miss penalty = 50 cycles
Normal instruction execution CPI ignoring memory stalls = 2.0 cycles
Miss rate = 2%
Average memory references/instruction = 1.33
CPU time =
IC x [CPI execution + Memory accesses/instruction x Miss rate x
Miss penalty ] x Clock cycle time
CPUtime with cache = IC x (2.0 + (1.33 x 2% x 50)) x clock cycle time
= IC x 3.33 x Clock cycle time
28. 28
Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per
cycle) with a single level of cache.
CPIexecution = 1.1
Instruction mix: 50% arith/logic, 30% load/store, 20%
control
Assume a cache miss rate of 1.5% and a miss penalty of 50
cycles.
CPI = CPIexecution + mem stalls per instruction
Mem Stalls per instruction = Mem accesses per instruction
x Miss rate x Miss penalty
Mem accesses per instruction = 1 + .3 = 1.3
Mem Stalls per instruction = 1.3 x .015 x 50 = 0.975
CPI = 1.1 + .975 = 2.075
The ideal memory CPU with no misses is 2.075/1.1 = 1.88 times
faster
29. 29
Memory Stall CPI
= Miss per inst × miss penalty
= % Memory Access/Instr × Miss rate × Miss
Penalty
Example: Assume 20% memory acc/instruction, 2%
miss rate, 400-cycle miss penalty. How much is
memory stall CPI?
Memory Stall CPI= 0.2*0.02*400=1.6 cycles
30. 30
Performance Example
Suppose:
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no
misses) CPI = 1.1
50% arith/logic, 30% load/store, 20% control
10% of data memory operations get 50 cycles
miss penalty
1% of instruction memory operations also get 50
cycles miss penalty
Compute AMAT.
31. 31
Performance Example
CPI = ideal CPI + average stalls per
instruction = 1.1(cycles/ins) + [ 0.30
(DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)]
+[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
33. 33
How to Improve Cache
Performance?
Average memory access time (AMAT) is a common metric
to analyze memory system performance. AMAT uses hit
time, miss penalty, and miss rate to measure memory
performance. It accounts for the fact that hits and
misses affect memory system performance differently.
1. Reduce miss rate.
2. Reduce miss penalty.
3. Reduce hit time.
Hit latency (H) is the time to hit in the cache. Miss rate
(MR) is the frequency of cache misses, while average miss
penalty (AMP) is the cost of a cache miss in terms of time
y
M
M i
Hit
AM A
34. 34
6 basic cache optimizations technique
• Reducing miss penalty
• Multilevel caches to reduce miss penalty
• Giving priority to read misses over writes
to reduce miss penalty
• Reducing miss rate
• Larger block size to reduce miss rate
• Bigger caches to reduce miss rate
• Higher associativity to reduce miss rate
• Reducing cache hit time
• Avoiding address translation during
indexing of the cache to reduce hit time
37. 37
Reducing Miss Penalty: Multi-
Level Cache
Add a second-level cache.
L2 Equations:
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss
RateL2 × Miss PenaltyL2)
AMAT can be extended recursively to multiple layers
of the memory hierarchy.
38. 38
Multi-Level Cache: Some
Definitions
Local miss rate— misses in this cache
divided by the total number of memory
accesses to this cache (Miss rateL2)
Global miss rate—misses in this cache
divided by the total number of memory
accesses generated by the CPU
Local Miss RateL1 x Local Miss RateL2
L1 Global miss rate = L1 Local miss rate
40. 40
Performance Improvement Due
to L2 Cache
Assume:
• For 1000 instructions:
– 40 misses in L1,
– 20 misses in L2
• L1 hit time: 1 cycle,
• L2 hit time: 10 cycles,
• L2 miss penalty=100
• 1.5 memory references per instruction
• Assume ideal CPI=1.0
Find: Local miss rate, AMAT, stall cycles per
instruction, and those without L2 cache.
41. 41
Example : Solution
With L2 cache:
Local miss rate = 50%
AMAT=1+4%X(10+50%X100)=3.4
Average Memory Stalls per Instruction =
(3.4-1.0)x1.5=3.6
Without L2 cache:
AMAT=1+4%X100=5
Average Memory Stalls per Inst=(5-1.0)x1.5=6
Perf. Improv. with L2 =(6+1)/(3.6+1)=52%
42. 42
• Reducing miss penalty
• Giving priority to read misses
over writes to reduce miss
penalty
43. 43
Reducing Miss Penalty : Read
Priority over Write on Miss
In a write-back scheme:
Normally a dirty block is stored in a
write buffer temporarily.
Usual:
Write all blocks from the write buffer
to memory, and then do the read.
Instead:
Check write buffer first, if not found,
then initiate read.
CPU stall cycles would be less.
44. 44
Reducing Miss Penalty :
Read Priority over Write on Miss
A write buffer with a write through:
Allows cache writes to occur at the
speed of the cache.
Write buffer however complicates
memory access:
They may hold the updated value of a
location needed on a read miss.
45. 45
Reducing Miss Penalty :
Read Priority over Write on Miss
Write-through with write buffers:
Read priority over write: Check write
buffer contents before read;
if no conflicts, let the memory access
continue.
Write priority over read: Waiting for
write buffer to first empty, can
increase read miss penalty.
46. 46
• Reducing miss rate
• Larger block size to reduce miss rate
• Bigger caches to reduce miss rate
• Higher associativity to reduce miss
rate
47. Reducing Misses
Classifying Misses: 3 Cs
Compulsory—The first access to a block is not in the cache, so
the block must be brought into the cache. These are also
called cold start misses or first reference misses.
(Misses in infinite cache)
Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due
to blocks being discarded and later retrieved.
(Misses due to size of cache)
Conflict—If the block-placement strategy is set associative
or direct mapped, conflict misses (in addition to compulsory
and capacity misses) will occur because a block can be
discarded and later retrieved if too many blocks map to its
set. These are also called collision misses or interference
misses.
(Misses due to associative and size of cache)
48. Bigger caches
One way to decrease misses is to increase the
cache size
Reduces capacity and conflict misses
No effect on compulsory misses
However a larger cache may increase the hit time
larger cache => larger access time
If cache is too large, can’t fit it on the same chip as
processor.
49. nots rate is tk size
Take advantage of spatial locality
Decreases compulsory misses
However, lardvantages
May increase the miss penalty
(need to get more data)
May increase hit time (need to
read more data from cache )
ncreasing the block size can
help, but don’t overdo it.
Larger block size
50. Increasing associativity helps reduce conflict
misses
2:1 Cache Rule:
The miss rate of a direct mapped cache of size N is
about equal to the miss rate of a 2-way set
associative cache of size N/2
For example, the miss rate of a 32 Kbyte direct
mapped cache is about equal to the miss rate of a 16
Kbyte 2-way set associative cache
Disadvantages of higher associativity
Need to do large number of comparisons
Need n-to-1 multiplexor for n-way set associative
Could increase hit time
Increasing associativity
54. 54
Hit Time Reduction : Simultaneous
Tag Comparison and and Data Reading
• After indexing:
– Tag can be compared
and at the same time
block can be fetched.
– If it’s a miss --- then no
harm done, miss must be
dealt with.
To Next Lower Level In
Hierarchy
DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator