Memory Hierarchy Optimization Techniques for Enhancing Performance

1
Memory Hierarchy Design and Optimizations
School of Computer Engineering
KIIT University
24-11-2023

2
Introduction
 Even a sophisticated processor may
perform well below an ordinary processor:
 Unless supported by matching performance by the
memory system.
 The focus of this module:
 Study how memory system performance has
been enhanced through various innovations
and optimizations.

3
 Here we focus on L1/L2/L3 caches, virtual
memory and main memory
Typical Memory Hierarchy
Proc/Regs
L1-Cache
L2-Cache
Memory
Disk, Tape, etc.
Bigger Faster
L3-Cache (optional)

4
What is the Role of a Cache?
 A small, fast storage used to improve average
access time to a slow memory.
 Caches have no “inherent value”,
only try to close performance gap
 Improves memory system performance:
 Exploits spatial and temporal locality
 Temporal locality refers to the reuse of specific
data, and/or resources, within a relatively small
time duration. Spatial locality (also termed data
locality) refers to the use of data elements
within relatively close storage locations.

5
Four Basic Questions
 Q1: Where can a block be placed in the cache?
(Block placement)
 Fully Associative, Set Associative, Direct Mapped
 Q2: How is a block found if it is in the cache?
(Block identification)
 Tag/Block
 Q3: Which block should be replaced on a
miss?
(Block replacement)
 Random, LRU
 Q4: What happens on a write?
(Write strategy)
 Write Back or Write Through (with Write Buffer)

6
Block Placement
 If a block has only one possible place
in the cache: direct mapped
 If a block can be placed anywhere:
fully associative
 If a block can be placed in a
restricted subset of the possible
places: set associative
 If there are n blocks in each subset: n-
way set associative
 Note that direct-mapped = 1-way set
associative

7
Trade-offs
 n-way set associative becomes
increasingly difficult (costly) to
implement for large n
 Most caches today are either 1-way (direct
mapped), 2-way, or 4-way set associative
 The larger n the lower the likelihood of
thrashing
 e.g., two blocks competing for the same
block frame and being accessed in sequence
over and over

8
Block Identification
cont…
 Given an address, how do we find
where it goes in the cache?
 This is done by first breaking
down an address into three parts
offset of the address in
the cache block
Index of the set
tag used for
identifying a match
block offset
set index
tag
Block address

9
cont…
 Consider the following system
 Addresses are on 64 bits
 Memory is byte-addressable
 Block frame size is 26 = 64 bytes
 Cache is 64 MByte (226 bytes)
 Consists of 220 block frames
 Direct-mapped
 For each cache block brought in from memory,
there is a single possible frame among the 220
available
 A 64-bit address can be decomposed as
follows: ( 64-6) memory having 6bit offset address)
58-bit block address
6-bit block offset
20-bit set index
38-bit tag

10
cont…
0...00
set index
20 bits
0...01
0...10
0...11
. . .
1...01
1...10
1...11
2
20
block
frames
tag
38 bits
cache block
26 bits
All addresses with
similar 20 set index bits
“compete” for a single
block frame

11
cont…
0...00
set index
20 bits
0...01
0...10
0...11
. . .
1...01
1...10
1...11
2
20
block
frames
tag
38 bits
cache block
26 bits
Address from CPU
@
Find the set:
?

12
0...00
set index
20 bits
0...01
0...10
0...11
. . .
1...01
1...10
1...11
2
20
block
frames
tag
58 bits
cache block
26 bits
@
Find the set:
Compare the tag:
?
If no match: miss
If match: hit and
access the byte at
the desired offset
Address from CPU

13
Cache Write Policies
 Write-through: Information is written
to both the block in the cache and the
block in memory
 Write-back: Information is written
back to memory only when a block
frame is replaced:
 Uses a “dirty” bit to indicate whether a
block was actually written to,
 Saves unnecessary writes to memory
when a block is “clean”

14
Trade-offs
 Write back
 Faster because writes occur at the
speed of the cache, not the memory.
 Faster because multiple writes to
the same block is written back to
memory only once, uses less memory
bandwidth.
 Write through
 Easier to implement

15
Memory System
Performance
 Memory system performance is largely
captured by three parameters,
 Latency, Bandwidth, Average memory access time
(AMAT).
 Latency:
 The time it takes from the issue of a memory
request to the time the data is available at the
processor.
 Bandwidth:
 The rate at which data can be pumped to the
processor by the memory system.

16
Average Memory Access
Time (AMAT)
• AMAT: The average time it takes for
the processor to get a data item it
requests.
• The time it takes to get requested
data to the processor can vary:
– due to the memory hierarchy.
• AMAT can be expressed as:
A
M
A
T
C
a
c
h
e
h
i
t
t
i
m
e
M
i
s
s
r
a
t
e
M
i
s
s
p
e
n
a
l
t
y




17
Cache Performance
Parameters
 Performance of a cache is largely
determined by:
 Cache miss rate: number of cache
misses divided by number of accesses.
 Cache hit time: the time between
sending address and data returning from
cache.
 Cache miss penalty: the extra processor
stall cycles caused by access to the
next-level cache.

18
 If a direct mapped cache has a hit rate of 95%,
a hit time of 4 ns, and a miss penalty of 100 ns,
what is the AMAT?
AMAT = Hit time + Miss rate x Miss penalty =
4 + 0.05 x 100 = 9 ns
 If replacing the cache with a 2-way set
associative increases the hit rate to 97%, but
increases the hit time to 5 ns, what is the new
AMAT?
AMAT = Hit time + Miss rate x Miss penalty =
5 + 0.03 x 100 = 8 ns

19
Memory Hierarchy
Optimizations

20
 Unified cache (mixed cache): Data and
instructions are stored together (von
Neuman architecture)
 Split cache: Data and instructions are
stored separately (Harvard
architecture)
 Why do instructions caches have a
lower miss ratio?
Unified vs Split Caches

21
 A Load or Store instruction requires two
memory accesses:
 One for the instruction itself
 One for the data
 Therefore, unified cache causes a structural
hazard!
 Modern processors use separate data and
instruction L1 caches:
 As opposed to “unified” or “mixed” caches
 The CPU sends simultaneously:
 Instruction and data address to the two ports .
 Both caches can be configured differently
 Size, associativity, etc.

22
 Separate Instruction and Data caches:
 Avoids structural hazard
 Also each cache can be tailored specific to need.
Processor
I-Cache-1
Unified
Cache-1
Unified
Cache-2
D-Cache-1
Processor
Unified
Cache-2
Unified Cache Split Cache

23
Typical Cache Performance Data Using SPEC92

24
 Which has the lower average memory access time?
 Split cache : 16 KB instructions + 16 KB data
 Unified cache: 32 KB (instructions + data)
 Assumptions
 Use miss rates from previous chart
 Miss penalty is 50 cycles
 Hit time is 1 cycle
 75% of the total memory accesses for
instructions and 25% of the total memory
accesses for data
 On the unified cache, a load or store hit takes
an extra cycle, since there is only one port for
instructions and data

25
Average memory-access time = Hit time + Miss rate x Miss penalty
AMAT = %instr x (instr hit time + instr miss rate x instr miss penalty) +
%data x (data hit time + data miss rate x data miss penalty)
For the split cache:
AMAT = 75% x (1 + 0.64% x 50) + 25% (1 + 6.47% x 50) = 2.05
For the unified cache
AMAT = 75% x (1 + 1.99% x 50) + 25% x (2 + 1.99% x 50) = 2.24
The unified cache has a longer AMAT, even though its miss rate is
lower, due to conflicts for instruction and data hazards.

26
26
Cache performance:princeton(Unified)
Archiceture
CPUtime = Instruction count x CPI x Clock cycle time
CPIexecution = CPI with ideal memory
CPI = CPIexecution + Mem Stall cycles per instruction
(Memory stall cycles: Number of cycles during which processor is stalled waiting for a
memory access.)
CPUtime = Instruction Count x (CPIexecution + Mem Stall cycles per instruction) x Clock
cycle time
Mem Stall cycles per instruction = Mem accesses per instruction x Miss rate x
Miss penalty
CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty)
x Clock cycle time
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time

27
Assuming the following execution and cache parameters:
Cache miss penalty = 50 cycles
Normal instruction execution CPI ignoring memory stalls = 2.0 cycles
Miss rate = 2%
Average memory references/instruction = 1.33
CPU time =
IC x [CPI execution + Memory accesses/instruction x Miss rate x
Miss penalty ] x Clock cycle time
CPUtime with cache = IC x (2.0 + (1.33 x 2% x 50)) x clock cycle time
= IC x 3.33 x Clock cycle time


28
Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per
cycle) with a single level of cache.
CPIexecution = 1.1
Instruction mix: 50% arith/logic, 30% load/store, 20%
control
Assume a cache miss rate of 1.5% and a miss penalty of 50
cycles.
CPI = CPIexecution + mem stalls per instruction
Mem Stalls per instruction = Mem accesses per instruction
x Miss rate x Miss penalty
Mem accesses per instruction = 1 + .3 = 1.3
Mem Stalls per instruction = 1.3 x .015 x 50 = 0.975
CPI = 1.1 + .975 = 2.075
The ideal memory CPU with no misses is 2.075/1.1 = 1.88 times
faster

29
Memory Stall CPI
= Miss per inst × miss penalty
= % Memory Access/Instr × Miss rate × Miss
Penalty
Example: Assume 20% memory acc/instruction, 2%
miss rate, 400-cycle miss penalty. How much is
memory stall CPI?
Memory Stall CPI= 0.2*0.02*400=1.6 cycles

30
Performance Example
 Suppose:
 Clock Rate = 200 MHz (5 ns per cycle), Ideal (no
misses) CPI = 1.1
 50% arith/logic, 30% load/store, 20% control
 10% of data memory operations get 50 cycles
miss penalty
 1% of instruction memory operations also get 50
cycles miss penalty
 Compute AMAT.

31
Performance Example
 CPI = ideal CPI + average stalls per
instruction = 1.1(cycles/ins) + [ 0.30
(DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)]
+[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1

33
How to Improve Cache
Performance?
Average memory access time (AMAT) is a common metric
to analyze memory system performance. AMAT uses hit
time, miss penalty, and miss rate to measure memory
performance. It accounts for the fact that hits and
misses affect memory system performance differently.
1. Reduce miss rate.
2. Reduce miss penalty.
3. Reduce hit time.
Hit latency (H) is the time to hit in the cache. Miss rate
(MR) is the frequency of cache misses, while average miss
penalty (AMP) is the cost of a cache miss in terms of time
y
M
M i
Hit
AM A




34
6 basic cache optimizations technique
• Reducing miss penalty
• Multilevel caches to reduce miss penalty
• Giving priority to read misses over writes
to reduce miss penalty
• Reducing miss rate
• Larger block size to reduce miss rate
• Bigger caches to reduce miss rate
• Higher associativity to reduce miss rate
• Reducing cache hit time
• Avoiding address translation during
indexing of the cache to reduce hit time

37
Reducing Miss Penalty: Multi-
Level Cache
 Add a second-level cache.
 L2 Equations:
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss
RateL2 × Miss PenaltyL2)
AMAT can be extended recursively to multiple layers
of the memory hierarchy.

38
Multi-Level Cache: Some
Definitions
 Local miss rate— misses in this cache
divided by the total number of memory
accesses to this cache (Miss rateL2)
 Global miss rate—misses in this cache
divided by the total number of memory
accesses generated by the CPU
 Local Miss RateL1 x Local Miss RateL2
 L1 Global miss rate = L1 Local miss rate

40
Performance Improvement Due
to L2 Cache
Assume:
• For 1000 instructions:
– 40 misses in L1,
– 20 misses in L2
• L1 hit time: 1 cycle,
• L2 hit time: 10 cycles,
• L2 miss penalty=100
• 1.5 memory references per instruction
• Assume ideal CPI=1.0
Find: Local miss rate, AMAT, stall cycles per
instruction, and those without L2 cache.

41
Example : Solution
 With L2 cache:
 Local miss rate = 50%
 AMAT=1+4%X(10+50%X100)=3.4
 Average Memory Stalls per Instruction =
(3.4-1.0)x1.5=3.6
 Without L2 cache:
 AMAT=1+4%X100=5
 Average Memory Stalls per Inst=(5-1.0)x1.5=6
 Perf. Improv. with L2 =(6+1)/(3.6+1)=52%

42
• Reducing miss penalty
• Giving priority to read misses
over writes to reduce miss
penalty

43
Reducing Miss Penalty : Read
Priority over Write on Miss
 In a write-back scheme:
 Normally a dirty block is stored in a
write buffer temporarily.
 Usual:
 Write all blocks from the write buffer
to memory, and then do the read.
 Instead:
 Check write buffer first, if not found,
then initiate read.
 CPU stall cycles would be less.

44
Reducing Miss Penalty :
Read Priority over Write on Miss
 A write buffer with a write through:
 Allows cache writes to occur at the
speed of the cache.
 Write buffer however complicates
memory access:
 They may hold the updated value of a
location needed on a read miss.

45
Reducing Miss Penalty :
Read Priority over Write on Miss
 Write-through with write buffers:
Read priority over write: Check write
buffer contents before read;
if no conflicts, let the memory access
continue.
Write priority over read: Waiting for
write buffer to first empty, can
increase read miss penalty.

46
• Reducing miss rate
• Larger block size to reduce miss rate
• Bigger caches to reduce miss rate
• Higher associativity to reduce miss
rate

Reducing Misses
 Classifying Misses: 3 Cs
 Compulsory—The first access to a block is not in the cache, so
the block must be brought into the cache. These are also
called cold start misses or first reference misses.
(Misses in infinite cache)
 Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due
to blocks being discarded and later retrieved.
(Misses due to size of cache)
 Conflict—If the block-placement strategy is set associative
or direct mapped, conflict misses (in addition to compulsory
and capacity misses) will occur because a block can be
discarded and later retrieved if too many blocks map to its
set. These are also called collision misses or interference
misses.
(Misses due to associative and size of cache)

Bigger caches
 One way to decrease misses is to increase the
cache size
 Reduces capacity and conflict misses
 No effect on compulsory misses
 However a larger cache may increase the hit time
 larger cache => larger access time
 If cache is too large, can’t fit it on the same chip as
processor.

 nots rate is tk size
 Take advantage of spatial locality
 Decreases compulsory misses
 However, lardvantages
 May increase the miss penalty
(need to get more data)
 May increase hit time (need to
read more data from cache )
ncreasing the block size can
help, but don’t overdo it.
Larger block size

 Increasing associativity helps reduce conflict
misses
 2:1 Cache Rule:
 The miss rate of a direct mapped cache of size N is
about equal to the miss rate of a 2-way set
associative cache of size N/2
 For example, the miss rate of a 32 Kbyte direct
mapped cache is about equal to the miss rate of a 16
Kbyte 2-way set associative cache
 Disadvantages of higher associativity
 Need to do large number of comparisons
 Need n-to-1 multiplexor for n-way set associative
 Could increase hit time
Increasing associativity

52
• Reducing cache hit time
• Avoiding address translation during
indexing of the cache to reduce hit
time

54
Hit Time Reduction : Simultaneous
Tag Comparison and and Data Reading
• After indexing:
– Tag can be compared
and at the same time
block can be fetched.
– If it’s a miss --- then no
harm done, miss must be
dealt with.
To Next Lower Level In
Hierarchy
DATA
TAGS
One Cache line of Data
Tag and Comparator
Tag and Comparator
Tag and Comparator
Tag and Comparator

Memory Hierarchy Optimization Techniques for Enhancing Performance

Recommended

Recommended

More Related Content

Similar to Memory Hierarchy Optimization Techniques for Enhancing Performance

Similar to Memory Hierarchy Optimization Techniques for Enhancing Performance (20)

Recently uploaded

Recently uploaded (20)

Memory Hierarchy Optimization Techniques for Enhancing Performance