SlideShare a Scribd company logo
1 of 54
1
Memory Hierarchy Design and Optimizations
School of Computer Engineering
KIIT University
24-11-2023
2
Introduction
 Even a sophisticated processor may
perform well below an ordinary processor:
 Unless supported by matching performance by the
memory system.
 The focus of this module:
 Study how memory system performance has
been enhanced through various innovations
and optimizations.
3
 Here we focus on L1/L2/L3 caches, virtual
memory and main memory
Typical Memory Hierarchy
Proc/Regs
L1-Cache
L2-Cache
Memory
Disk, Tape, etc.
Bigger Faster
L3-Cache (optional)
4
What is the Role of a Cache?
 A small, fast storage used to improve average
access time to a slow memory.
 Caches have no “inherent value”,
only try to close performance gap
 Improves memory system performance:
 Exploits spatial and temporal locality
 Temporal locality refers to the reuse of specific
data, and/or resources, within a relatively small
time duration. Spatial locality (also termed data
locality) refers to the use of data elements
within relatively close storage locations.
5
Four Basic Questions
 Q1: Where can a block be placed in the cache?
(Block placement)
 Fully Associative, Set Associative, Direct Mapped
 Q2: How is a block found if it is in the cache?
(Block identification)
 Tag/Block
 Q3: Which block should be replaced on a
miss?
(Block replacement)
 Random, LRU
 Q4: What happens on a write?
(Write strategy)
 Write Back or Write Through (with Write Buffer)
6
Block Placement
 If a block has only one possible place
in the cache: direct mapped
 If a block can be placed anywhere:
fully associative
 If a block can be placed in a
restricted subset of the possible
places: set associative
 If there are n blocks in each subset: n-
way set associative
 Note that direct-mapped = 1-way set
associative
7
Trade-offs
 n-way set associative becomes
increasingly difficult (costly) to
implement for large n
 Most caches today are either 1-way (direct
mapped), 2-way, or 4-way set associative
 The larger n the lower the likelihood of
thrashing
 e.g., two blocks competing for the same
block frame and being accessed in sequence
over and over
8
Block Identification
cont…
 Given an address, how do we find
where it goes in the cache?
 This is done by first breaking
down an address into three parts
offset of the address in
the cache block
Index of the set
tag used for
identifying a match
block offset
set index
tag
Block address
9
Block Identification
cont…
 Consider the following system
 Addresses are on 64 bits
 Memory is byte-addressable
 Block frame size is 26 = 64 bytes
 Cache is 64 MByte (226 bytes)
 Consists of 220 block frames
 Direct-mapped
 For each cache block brought in from memory,
there is a single possible frame among the 220
available
 A 64-bit address can be decomposed as
follows: ( 64-6) memory having 6bit offset address)
58-bit block address
6-bit block offset
20-bit set index
38-bit tag
10
Block Identification
cont…
0...00
set index
20 bits
0...01
0...10
0...11
. . .
1...01
1...10
1...11
2
20
block
frames
tag
38 bits
cache block
26 bits
All addresses with
similar 20 set index bits
“compete” for a single
block frame
11
Block Identification
cont…
0...00
set index
20 bits
0...01
0...10
0...11
. . .
1...01
1...10
1...11
2
20
block
frames
tag
38 bits
cache block
26 bits
Address from CPU
@
Find the set:
?
12
Block Identification
0...00
set index
20 bits
0...01
0...10
0...11
. . .
1...01
1...10
1...11
2
20
block
frames
tag
58 bits
cache block
26 bits
@
Find the set:
Compare the tag:
?
If no match: miss
If match: hit and
access the byte at
the desired offset
Address from CPU
13
Cache Write Policies
 Write-through: Information is written
to both the block in the cache and the
block in memory
 Write-back: Information is written
back to memory only when a block
frame is replaced:
 Uses a “dirty” bit to indicate whether a
block was actually written to,
 Saves unnecessary writes to memory
when a block is “clean”
14
Trade-offs
 Write back
 Faster because writes occur at the
speed of the cache, not the memory.
 Faster because multiple writes to
the same block is written back to
memory only once, uses less memory
bandwidth.
 Write through
 Easier to implement
15
Memory System
Performance
 Memory system performance is largely
captured by three parameters,
 Latency, Bandwidth, Average memory access time
(AMAT).
 Latency:
 The time it takes from the issue of a memory
request to the time the data is available at the
processor.
 Bandwidth:
 The rate at which data can be pumped to the
processor by the memory system.
16
Average Memory Access
Time (AMAT)
• AMAT: The average time it takes for
the processor to get a data item it
requests.
• The time it takes to get requested
data to the processor can vary:
– due to the memory hierarchy.
• AMAT can be expressed as:
A
M
A
T
C
a
c
h
e
h
i
t
t
i
m
e
M
i
s
s
r
a
t
e
M
i
s
s
p
e
n
a
l
t
y



17
Cache Performance
Parameters
 Performance of a cache is largely
determined by:
 Cache miss rate: number of cache
misses divided by number of accesses.
 Cache hit time: the time between
sending address and data returning from
cache.
 Cache miss penalty: the extra processor
stall cycles caused by access to the
next-level cache.
18
 If a direct mapped cache has a hit rate of 95%,
a hit time of 4 ns, and a miss penalty of 100 ns,
what is the AMAT?
AMAT = Hit time + Miss rate x Miss penalty =
4 + 0.05 x 100 = 9 ns
 If replacing the cache with a 2-way set
associative increases the hit rate to 97%, but
increases the hit time to 5 ns, what is the new
AMAT?
AMAT = Hit time + Miss rate x Miss penalty =
5 + 0.03 x 100 = 8 ns
19
Memory Hierarchy
Optimizations
20
 Unified cache (mixed cache): Data and
instructions are stored together (von
Neuman architecture)
 Split cache: Data and instructions are
stored separately (Harvard
architecture)
 Why do instructions caches have a
lower miss ratio?
Unified vs Split Caches
21
Unified vs Split Caches
 A Load or Store instruction requires two
memory accesses:
 One for the instruction itself
 One for the data
 Therefore, unified cache causes a structural
hazard!
 Modern processors use separate data and
instruction L1 caches:
 As opposed to “unified” or “mixed” caches
 The CPU sends simultaneously:
 Instruction and data address to the two ports .
 Both caches can be configured differently
 Size, associativity, etc.
22
Unified vs Split Caches
 Separate Instruction and Data caches:
 Avoids structural hazard
 Also each cache can be tailored specific to need.
Processor
I-Cache-1
Unified
Cache-1
Unified
Cache-2
D-Cache-1
Processor
Unified
Cache-2
Unified Cache Split Cache
23
Typical Cache Performance Data Using SPEC92
24
 Which has the lower average memory access time?
 Split cache : 16 KB instructions + 16 KB data
 Unified cache: 32 KB (instructions + data)
 Assumptions
 Use miss rates from previous chart
 Miss penalty is 50 cycles
 Hit time is 1 cycle
 75% of the total memory accesses for
instructions and 25% of the total memory
accesses for data
 On the unified cache, a load or store hit takes
an extra cycle, since there is only one port for
instructions and data
25
Average memory-access time = Hit time + Miss rate x Miss penalty
AMAT = %instr x (instr hit time + instr miss rate x instr miss penalty) +
%data x (data hit time + data miss rate x data miss penalty)
For the split cache:
AMAT = 75% x (1 + 0.64% x 50) + 25% (1 + 6.47% x 50) = 2.05
For the unified cache
AMAT = 75% x (1 + 1.99% x 50) + 25% x (2 + 1.99% x 50) = 2.24
The unified cache has a longer AMAT, even though its miss rate is
lower, due to conflicts for instruction and data hazards.
26
26
Cache performance:princeton(Unified)
Archiceture
CPUtime = Instruction count x CPI x Clock cycle time
CPIexecution = CPI with ideal memory
CPI = CPIexecution + Mem Stall cycles per instruction
(Memory stall cycles: Number of cycles during which processor is stalled waiting for a
memory access.)
CPUtime = Instruction Count x (CPIexecution + Mem Stall cycles per instruction) x Clock
cycle time
Mem Stall cycles per instruction = Mem accesses per instruction x Miss rate x
Miss penalty
CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty)
x Clock cycle time
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time
27
Assuming the following execution and cache parameters:
Cache miss penalty = 50 cycles
Normal instruction execution CPI ignoring memory stalls = 2.0 cycles
Miss rate = 2%
Average memory references/instruction = 1.33
CPU time =
IC x [CPI execution + Memory accesses/instruction x Miss rate x
Miss penalty ] x Clock cycle time
CPUtime with cache = IC x (2.0 + (1.33 x 2% x 50)) x clock cycle time
= IC x 3.33 x Clock cycle time

28
Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per
cycle) with a single level of cache.
CPIexecution = 1.1
Instruction mix: 50% arith/logic, 30% load/store, 20%
control
Assume a cache miss rate of 1.5% and a miss penalty of 50
cycles.
CPI = CPIexecution + mem stalls per instruction
Mem Stalls per instruction = Mem accesses per instruction
x Miss rate x Miss penalty
Mem accesses per instruction = 1 + .3 = 1.3
Mem Stalls per instruction = 1.3 x .015 x 50 = 0.975
CPI = 1.1 + .975 = 2.075
The ideal memory CPU with no misses is 2.075/1.1 = 1.88 times
faster
29
Memory Stall CPI
= Miss per inst × miss penalty
= % Memory Access/Instr × Miss rate × Miss
Penalty
Example: Assume 20% memory acc/instruction, 2%
miss rate, 400-cycle miss penalty. How much is
memory stall CPI?
Memory Stall CPI= 0.2*0.02*400=1.6 cycles
30
Performance Example
 Suppose:
 Clock Rate = 200 MHz (5 ns per cycle), Ideal (no
misses) CPI = 1.1
 50% arith/logic, 30% load/store, 20% control
 10% of data memory operations get 50 cycles
miss penalty
 1% of instruction memory operations also get 50
cycles miss penalty
 Compute AMAT.
31
Performance Example
 CPI = ideal CPI + average stalls per
instruction = 1.1(cycles/ins) + [ 0.30
(DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)]
+[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
32
Cache memory
Optimizations
33
How to Improve Cache
Performance?
Average memory access time (AMAT) is a common metric
to analyze memory system performance. AMAT uses hit
time, miss penalty, and miss rate to measure memory
performance. It accounts for the fact that hits and
misses affect memory system performance differently.
1. Reduce miss rate.
2. Reduce miss penalty.
3. Reduce hit time.
Hit latency (H) is the time to hit in the cache. Miss rate
(MR) is the frequency of cache misses, while average miss
penalty (AMP) is the cost of a cache miss in terms of time
y
M
M i
Hit
AM A



34
6 basic cache optimizations technique
• Reducing miss penalty
• Multilevel caches to reduce miss penalty
• Giving priority to read misses over writes
to reduce miss penalty
• Reducing miss rate
• Larger block size to reduce miss rate
• Bigger caches to reduce miss rate
• Higher associativity to reduce miss rate
• Reducing cache hit time
• Avoiding address translation during
indexing of the cache to reduce hit time
35
Multi-Level Cache
36
37
Reducing Miss Penalty: Multi-
Level Cache
 Add a second-level cache.
 L2 Equations:
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss
RateL2 × Miss PenaltyL2)
AMAT can be extended recursively to multiple layers
of the memory hierarchy.
38
Multi-Level Cache: Some
Definitions
 Local miss rate— misses in this cache
divided by the total number of memory
accesses to this cache (Miss rateL2)
 Global miss rate—misses in this cache
divided by the total number of memory
accesses generated by the CPU
 Local Miss RateL1 x Local Miss RateL2
 L1 Global miss rate = L1 Local miss rate
39
40
Performance Improvement Due
to L2 Cache
Assume:
• For 1000 instructions:
– 40 misses in L1,
– 20 misses in L2
• L1 hit time: 1 cycle,
• L2 hit time: 10 cycles,
• L2 miss penalty=100
• 1.5 memory references per instruction
• Assume ideal CPI=1.0
Find: Local miss rate, AMAT, stall cycles per
instruction, and those without L2 cache.
41
Example : Solution
 With L2 cache:
 Local miss rate = 50%
 AMAT=1+4%X(10+50%X100)=3.4
 Average Memory Stalls per Instruction =
(3.4-1.0)x1.5=3.6
 Without L2 cache:
 AMAT=1+4%X100=5
 Average Memory Stalls per Inst=(5-1.0)x1.5=6
 Perf. Improv. with L2 =(6+1)/(3.6+1)=52%
42
• Reducing miss penalty
• Giving priority to read misses
over writes to reduce miss
penalty
43
Reducing Miss Penalty : Read
Priority over Write on Miss
 In a write-back scheme:
 Normally a dirty block is stored in a
write buffer temporarily.
 Usual:
 Write all blocks from the write buffer
to memory, and then do the read.
 Instead:
 Check write buffer first, if not found,
then initiate read.
 CPU stall cycles would be less.
44
Reducing Miss Penalty :
Read Priority over Write on Miss
 A write buffer with a write through:
 Allows cache writes to occur at the
speed of the cache.
 Write buffer however complicates
memory access:
 They may hold the updated value of a
location needed on a read miss.
45
Reducing Miss Penalty :
Read Priority over Write on Miss
 Write-through with write buffers:
Read priority over write: Check write
buffer contents before read;
if no conflicts, let the memory access
continue.
Write priority over read: Waiting for
write buffer to first empty, can
increase read miss penalty.
46
• Reducing miss rate
• Larger block size to reduce miss rate
• Bigger caches to reduce miss rate
• Higher associativity to reduce miss
rate
Reducing Misses
 Classifying Misses: 3 Cs
 Compulsory—The first access to a block is not in the cache, so
the block must be brought into the cache. These are also
called cold start misses or first reference misses.
(Misses in infinite cache)
 Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due
to blocks being discarded and later retrieved.
(Misses due to size of cache)
 Conflict—If the block-placement strategy is set associative
or direct mapped, conflict misses (in addition to compulsory
and capacity misses) will occur because a block can be
discarded and later retrieved if too many blocks map to its
set. These are also called collision misses or interference
misses.
(Misses due to associative and size of cache)
Bigger caches
 One way to decrease misses is to increase the
cache size
 Reduces capacity and conflict misses
 No effect on compulsory misses
 However a larger cache may increase the hit time
 larger cache => larger access time
 If cache is too large, can’t fit it on the same chip as
processor.
 nots rate is tk size
 Take advantage of spatial locality
 Decreases compulsory misses
 However, lardvantages
 May increase the miss penalty
(need to get more data)
 May increase hit time (need to
read more data from cache )
ncreasing the block size can
help, but don’t overdo it.
Larger block size
 Increasing associativity helps reduce conflict
misses
 2:1 Cache Rule:
 The miss rate of a direct mapped cache of size N is
about equal to the miss rate of a 2-way set
associative cache of size N/2
 For example, the miss rate of a 32 Kbyte direct
mapped cache is about equal to the miss rate of a 16
Kbyte 2-way set associative cache
 Disadvantages of higher associativity
 Need to do large number of comparisons
 Need n-to-1 multiplexor for n-way set associative
 Could increase hit time
Increasing associativity
51
52
• Reducing cache hit time
• Avoiding address translation during
indexing of the cache to reduce hit
time
53
54
Hit Time Reduction : Simultaneous
Tag Comparison and and Data Reading
• After indexing:
– Tag can be compared
and at the same time
block can be fetched.
– If it’s a miss --- then no
harm done, miss must be
dealt with.
To Next Lower Level In
Hierarchy
DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator

More Related Content

Similar to Memory Hierarchy Optimization Techniques for Enhancing Performance

Computer memory book notes
Computer memory book notes Computer memory book notes
Computer memory book notes Navtej Duhoon
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimizationK Gowsic Gowsic
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureShweta Ghate
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata EvonCanales257
 
Cache.pptx
Cache.pptxCache.pptx
Cache.pptxVCETCSE
 
Computer System Architecture
Computer System ArchitectureComputer System Architecture
Computer System ArchitectureBrenda Debra
 
lecture-2-3_Memory.pdf,describing memory
lecture-2-3_Memory.pdf,describing memorylecture-2-3_Memory.pdf,describing memory
lecture-2-3_Memory.pdf,describing memoryfloraaluoch3
 
SO-Memoria.pdf
SO-Memoria.pdfSO-Memoria.pdf
SO-Memoria.pdfKadu37
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureHaris456
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxARIV4
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cacheVISHAL DONGA
 

Similar to Memory Hierarchy Optimization Techniques for Enhancing Performance (20)

Computer memory book notes
Computer memory book notes Computer memory book notes
Computer memory book notes
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimization
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer Architechture
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Cache.pptx
Cache.pptxCache.pptx
Cache.pptx
 
Computer System Architecture
Computer System ArchitectureComputer System Architecture
Computer System Architecture
 
lecture-2-3_Memory.pdf,describing memory
lecture-2-3_Memory.pdf,describing memorylecture-2-3_Memory.pdf,describing memory
lecture-2-3_Memory.pdf,describing memory
 
SO-Memoria.pdf
SO-Memoria.pdfSO-Memoria.pdf
SO-Memoria.pdf
 
SO-Memoria.pdf
SO-Memoria.pdfSO-Memoria.pdf
SO-Memoria.pdf
 
Chapter 5 a
Chapter 5 aChapter 5 a
Chapter 5 a
 
7_mem_cache.ppt
7_mem_cache.ppt7_mem_cache.ppt
7_mem_cache.ppt
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
 
Cache Memory.pptx
Cache Memory.pptxCache Memory.pptx
Cache Memory.pptx
 
Week5
Week5Week5
Week5
 
Chapter 5 b
Chapter 5  bChapter 5  b
Chapter 5 b
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
 

Recently uploaded

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 

Recently uploaded (20)

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 

Memory Hierarchy Optimization Techniques for Enhancing Performance

  • 1. 1 Memory Hierarchy Design and Optimizations School of Computer Engineering KIIT University 24-11-2023
  • 2. 2 Introduction  Even a sophisticated processor may perform well below an ordinary processor:  Unless supported by matching performance by the memory system.  The focus of this module:  Study how memory system performance has been enhanced through various innovations and optimizations.
  • 3. 3  Here we focus on L1/L2/L3 caches, virtual memory and main memory Typical Memory Hierarchy Proc/Regs L1-Cache L2-Cache Memory Disk, Tape, etc. Bigger Faster L3-Cache (optional)
  • 4. 4 What is the Role of a Cache?  A small, fast storage used to improve average access time to a slow memory.  Caches have no “inherent value”, only try to close performance gap  Improves memory system performance:  Exploits spatial and temporal locality  Temporal locality refers to the reuse of specific data, and/or resources, within a relatively small time duration. Spatial locality (also termed data locality) refers to the use of data elements within relatively close storage locations.
  • 5. 5 Four Basic Questions  Q1: Where can a block be placed in the cache? (Block placement)  Fully Associative, Set Associative, Direct Mapped  Q2: How is a block found if it is in the cache? (Block identification)  Tag/Block  Q3: Which block should be replaced on a miss? (Block replacement)  Random, LRU  Q4: What happens on a write? (Write strategy)  Write Back or Write Through (with Write Buffer)
  • 6. 6 Block Placement  If a block has only one possible place in the cache: direct mapped  If a block can be placed anywhere: fully associative  If a block can be placed in a restricted subset of the possible places: set associative  If there are n blocks in each subset: n- way set associative  Note that direct-mapped = 1-way set associative
  • 7. 7 Trade-offs  n-way set associative becomes increasingly difficult (costly) to implement for large n  Most caches today are either 1-way (direct mapped), 2-way, or 4-way set associative  The larger n the lower the likelihood of thrashing  e.g., two blocks competing for the same block frame and being accessed in sequence over and over
  • 8. 8 Block Identification cont…  Given an address, how do we find where it goes in the cache?  This is done by first breaking down an address into three parts offset of the address in the cache block Index of the set tag used for identifying a match block offset set index tag Block address
  • 9. 9 Block Identification cont…  Consider the following system  Addresses are on 64 bits  Memory is byte-addressable  Block frame size is 26 = 64 bytes  Cache is 64 MByte (226 bytes)  Consists of 220 block frames  Direct-mapped  For each cache block brought in from memory, there is a single possible frame among the 220 available  A 64-bit address can be decomposed as follows: ( 64-6) memory having 6bit offset address) 58-bit block address 6-bit block offset 20-bit set index 38-bit tag
  • 10. 10 Block Identification cont… 0...00 set index 20 bits 0...01 0...10 0...11 . . . 1...01 1...10 1...11 2 20 block frames tag 38 bits cache block 26 bits All addresses with similar 20 set index bits “compete” for a single block frame
  • 11. 11 Block Identification cont… 0...00 set index 20 bits 0...01 0...10 0...11 . . . 1...01 1...10 1...11 2 20 block frames tag 38 bits cache block 26 bits Address from CPU @ Find the set: ?
  • 12. 12 Block Identification 0...00 set index 20 bits 0...01 0...10 0...11 . . . 1...01 1...10 1...11 2 20 block frames tag 58 bits cache block 26 bits @ Find the set: Compare the tag: ? If no match: miss If match: hit and access the byte at the desired offset Address from CPU
  • 13. 13 Cache Write Policies  Write-through: Information is written to both the block in the cache and the block in memory  Write-back: Information is written back to memory only when a block frame is replaced:  Uses a “dirty” bit to indicate whether a block was actually written to,  Saves unnecessary writes to memory when a block is “clean”
  • 14. 14 Trade-offs  Write back  Faster because writes occur at the speed of the cache, not the memory.  Faster because multiple writes to the same block is written back to memory only once, uses less memory bandwidth.  Write through  Easier to implement
  • 15. 15 Memory System Performance  Memory system performance is largely captured by three parameters,  Latency, Bandwidth, Average memory access time (AMAT).  Latency:  The time it takes from the issue of a memory request to the time the data is available at the processor.  Bandwidth:  The rate at which data can be pumped to the processor by the memory system.
  • 16. 16 Average Memory Access Time (AMAT) • AMAT: The average time it takes for the processor to get a data item it requests. • The time it takes to get requested data to the processor can vary: – due to the memory hierarchy. • AMAT can be expressed as: A M A T C a c h e h i t t i m e M i s s r a t e M i s s p e n a l t y   
  • 17. 17 Cache Performance Parameters  Performance of a cache is largely determined by:  Cache miss rate: number of cache misses divided by number of accesses.  Cache hit time: the time between sending address and data returning from cache.  Cache miss penalty: the extra processor stall cycles caused by access to the next-level cache.
  • 18. 18  If a direct mapped cache has a hit rate of 95%, a hit time of 4 ns, and a miss penalty of 100 ns, what is the AMAT? AMAT = Hit time + Miss rate x Miss penalty = 4 + 0.05 x 100 = 9 ns  If replacing the cache with a 2-way set associative increases the hit rate to 97%, but increases the hit time to 5 ns, what is the new AMAT? AMAT = Hit time + Miss rate x Miss penalty = 5 + 0.03 x 100 = 8 ns
  • 20. 20  Unified cache (mixed cache): Data and instructions are stored together (von Neuman architecture)  Split cache: Data and instructions are stored separately (Harvard architecture)  Why do instructions caches have a lower miss ratio? Unified vs Split Caches
  • 21. 21 Unified vs Split Caches  A Load or Store instruction requires two memory accesses:  One for the instruction itself  One for the data  Therefore, unified cache causes a structural hazard!  Modern processors use separate data and instruction L1 caches:  As opposed to “unified” or “mixed” caches  The CPU sends simultaneously:  Instruction and data address to the two ports .  Both caches can be configured differently  Size, associativity, etc.
  • 22. 22 Unified vs Split Caches  Separate Instruction and Data caches:  Avoids structural hazard  Also each cache can be tailored specific to need. Processor I-Cache-1 Unified Cache-1 Unified Cache-2 D-Cache-1 Processor Unified Cache-2 Unified Cache Split Cache
  • 23. 23 Typical Cache Performance Data Using SPEC92
  • 24. 24  Which has the lower average memory access time?  Split cache : 16 KB instructions + 16 KB data  Unified cache: 32 KB (instructions + data)  Assumptions  Use miss rates from previous chart  Miss penalty is 50 cycles  Hit time is 1 cycle  75% of the total memory accesses for instructions and 25% of the total memory accesses for data  On the unified cache, a load or store hit takes an extra cycle, since there is only one port for instructions and data
  • 25. 25 Average memory-access time = Hit time + Miss rate x Miss penalty AMAT = %instr x (instr hit time + instr miss rate x instr miss penalty) + %data x (data hit time + data miss rate x data miss penalty) For the split cache: AMAT = 75% x (1 + 0.64% x 50) + 25% (1 + 6.47% x 50) = 2.05 For the unified cache AMAT = 75% x (1 + 1.99% x 50) + 25% x (2 + 1.99% x 50) = 2.24 The unified cache has a longer AMAT, even though its miss rate is lower, due to conflicts for instruction and data hazards.
  • 26. 26 26 Cache performance:princeton(Unified) Archiceture CPUtime = Instruction count x CPI x Clock cycle time CPIexecution = CPI with ideal memory CPI = CPIexecution + Mem Stall cycles per instruction (Memory stall cycles: Number of cycles during which processor is stalled waiting for a memory access.) CPUtime = Instruction Count x (CPIexecution + Mem Stall cycles per instruction) x Clock cycle time Mem Stall cycles per instruction = Mem accesses per instruction x Miss rate x Miss penalty CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time
  • 27. 27 Assuming the following execution and cache parameters: Cache miss penalty = 50 cycles Normal instruction execution CPI ignoring memory stalls = 2.0 cycles Miss rate = 2% Average memory references/instruction = 1.33 CPU time = IC x [CPI execution + Memory accesses/instruction x Miss rate x Miss penalty ] x Clock cycle time CPUtime with cache = IC x (2.0 + (1.33 x 2% x 50)) x clock cycle time = IC x 3.33 x Clock cycle time 
  • 28. 28 Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per cycle) with a single level of cache. CPIexecution = 1.1 Instruction mix: 50% arith/logic, 30% load/store, 20% control Assume a cache miss rate of 1.5% and a miss penalty of 50 cycles. CPI = CPIexecution + mem stalls per instruction Mem Stalls per instruction = Mem accesses per instruction x Miss rate x Miss penalty Mem accesses per instruction = 1 + .3 = 1.3 Mem Stalls per instruction = 1.3 x .015 x 50 = 0.975 CPI = 1.1 + .975 = 2.075 The ideal memory CPU with no misses is 2.075/1.1 = 1.88 times faster
  • 29. 29 Memory Stall CPI = Miss per inst × miss penalty = % Memory Access/Instr × Miss rate × Miss Penalty Example: Assume 20% memory acc/instruction, 2% miss rate, 400-cycle miss penalty. How much is memory stall CPI? Memory Stall CPI= 0.2*0.02*400=1.6 cycles
  • 30. 30 Performance Example  Suppose:  Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1  50% arith/logic, 30% load/store, 20% control  10% of data memory operations get 50 cycles miss penalty  1% of instruction memory operations also get 50 cycles miss penalty  Compute AMAT.
  • 31. 31 Performance Example  CPI = ideal CPI + average stalls per instruction = 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] +[ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1
  • 33. 33 How to Improve Cache Performance? Average memory access time (AMAT) is a common metric to analyze memory system performance. AMAT uses hit time, miss penalty, and miss rate to measure memory performance. It accounts for the fact that hits and misses affect memory system performance differently. 1. Reduce miss rate. 2. Reduce miss penalty. 3. Reduce hit time. Hit latency (H) is the time to hit in the cache. Miss rate (MR) is the frequency of cache misses, while average miss penalty (AMP) is the cost of a cache miss in terms of time y M M i Hit AM A   
  • 34. 34 6 basic cache optimizations technique • Reducing miss penalty • Multilevel caches to reduce miss penalty • Giving priority to read misses over writes to reduce miss penalty • Reducing miss rate • Larger block size to reduce miss rate • Bigger caches to reduce miss rate • Higher associativity to reduce miss rate • Reducing cache hit time • Avoiding address translation during indexing of the cache to reduce hit time
  • 36. 36
  • 37. 37 Reducing Miss Penalty: Multi- Level Cache  Add a second-level cache.  L2 Equations: AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 × Miss PenaltyL2) AMAT can be extended recursively to multiple layers of the memory hierarchy.
  • 38. 38 Multi-Level Cache: Some Definitions  Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2)  Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU  Local Miss RateL1 x Local Miss RateL2  L1 Global miss rate = L1 Local miss rate
  • 39. 39
  • 40. 40 Performance Improvement Due to L2 Cache Assume: • For 1000 instructions: – 40 misses in L1, – 20 misses in L2 • L1 hit time: 1 cycle, • L2 hit time: 10 cycles, • L2 miss penalty=100 • 1.5 memory references per instruction • Assume ideal CPI=1.0 Find: Local miss rate, AMAT, stall cycles per instruction, and those without L2 cache.
  • 41. 41 Example : Solution  With L2 cache:  Local miss rate = 50%  AMAT=1+4%X(10+50%X100)=3.4  Average Memory Stalls per Instruction = (3.4-1.0)x1.5=3.6  Without L2 cache:  AMAT=1+4%X100=5  Average Memory Stalls per Inst=(5-1.0)x1.5=6  Perf. Improv. with L2 =(6+1)/(3.6+1)=52%
  • 42. 42 • Reducing miss penalty • Giving priority to read misses over writes to reduce miss penalty
  • 43. 43 Reducing Miss Penalty : Read Priority over Write on Miss  In a write-back scheme:  Normally a dirty block is stored in a write buffer temporarily.  Usual:  Write all blocks from the write buffer to memory, and then do the read.  Instead:  Check write buffer first, if not found, then initiate read.  CPU stall cycles would be less.
  • 44. 44 Reducing Miss Penalty : Read Priority over Write on Miss  A write buffer with a write through:  Allows cache writes to occur at the speed of the cache.  Write buffer however complicates memory access:  They may hold the updated value of a location needed on a read miss.
  • 45. 45 Reducing Miss Penalty : Read Priority over Write on Miss  Write-through with write buffers: Read priority over write: Check write buffer contents before read; if no conflicts, let the memory access continue. Write priority over read: Waiting for write buffer to first empty, can increase read miss penalty.
  • 46. 46 • Reducing miss rate • Larger block size to reduce miss rate • Bigger caches to reduce miss rate • Higher associativity to reduce miss rate
  • 47. Reducing Misses  Classifying Misses: 3 Cs  Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in infinite cache)  Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses due to size of cache)  Conflict—If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses due to associative and size of cache)
  • 48. Bigger caches  One way to decrease misses is to increase the cache size  Reduces capacity and conflict misses  No effect on compulsory misses  However a larger cache may increase the hit time  larger cache => larger access time  If cache is too large, can’t fit it on the same chip as processor.
  • 49.  nots rate is tk size  Take advantage of spatial locality  Decreases compulsory misses  However, lardvantages  May increase the miss penalty (need to get more data)  May increase hit time (need to read more data from cache ) ncreasing the block size can help, but don’t overdo it. Larger block size
  • 50.  Increasing associativity helps reduce conflict misses  2:1 Cache Rule:  The miss rate of a direct mapped cache of size N is about equal to the miss rate of a 2-way set associative cache of size N/2  For example, the miss rate of a 32 Kbyte direct mapped cache is about equal to the miss rate of a 16 Kbyte 2-way set associative cache  Disadvantages of higher associativity  Need to do large number of comparisons  Need n-to-1 multiplexor for n-way set associative  Could increase hit time Increasing associativity
  • 51. 51
  • 52. 52 • Reducing cache hit time • Avoiding address translation during indexing of the cache to reduce hit time
  • 53. 53
  • 54. 54 Hit Time Reduction : Simultaneous Tag Comparison and and Data Reading • After indexing: – Tag can be compared and at the same time block can be fetched. – If it’s a miss --- then no harm done, miss must be dealt with. To Next Lower Level In Hierarchy DATA TAGS One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator