SlideShare a Scribd company logo
1 of 39
Memory Hierarchy
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers
Processor-Memory Performance Gap
2
Gap grew 50%
per year
Why Memory Hierarchy?
 Applications want unlimited amounts of memory
with low latency
 Fast memory is more expensive per bit
 Solution
 Organize memory system into a hierarchy
 Entire addressable memory space available in largest,
slowest memory
 Incrementally smaller & faster memories
 Temporal & spatial locality ensures that nearly all
references can be found in smaller memories
 Gives illusion of a large, fast memory being presented
to processor 3
Memory Hierarchy
4
Why Hierarchical Design?
 Becomes more crucial with multi-core
processors
 Aggregate peak bandwidth grows with no of cores
 Intel Core i7 can generate 2 references per core per clock
 4 cores and 3.2 GHz clock
 25.6 billion 64-bit data references/second +
12.8 billion 128-bit instruction references
= 409.6 GB/s
 DRAM bandwidth is only 6% of this (25 GB/s)
 Requires
 Multi-port, pipelined caches
 2 levels of cache per core
 Shared third-level cache on chip 5
Core i7 Die & Major Components
6
Source: Intel
Performance vs. Power
 High-end microprocessors have >10 MB on-chip
cache
 Consumes large amount of chip area & power budget
 Leakage current – when not operating
 Active current – when operating
 Major limiting factor for processors used in
mobile devices
7
Definitions – Blocks
 Multiple blocks are moved between levels in the
hierarchy
 Spatial locality  efficiency
 Blocks are tagged with memory address
 Tags are searched parallel
8
Source: http://archive.arstechnica.com/paedia/c/caching/m-caching-5.html
Definitions – Associativity
 Defines where blocks can be placed in a cache
9
Pentium 4 vs. Opteron Memory
Hierarchy
10
CPU Pentium 4 (3.2
GHz)
Opteron (2.8 GHz)
Instruction
Cache
Trace Cache 8K
micro-ops)
2-way associative,
64 KB, 64B block
Data Cache 8-way associative,
16 KB, 64B block,
inclusive in L2
2-way associative,
64 KB, 64B block,
exclusive to L2
L2 Cache 8-way associative, 2
MB, 128B block
16-way associative,
1 MB, 64B block
Prefetch 8 streams to L2 1 stream to L2
Memory 200 MHz x 64 bits 200 MHz x 128 bits
Definitions – Updating Cache
 Write-through
 Update cache block & all other levels below
 Use write buffers to speed up
 Write-back
 Update cache block
 Update lower level when replacing cached block
 Use write buffers to speed up
11
Definitions – Replacing Cached Blocks
 Cache replacement policies
 Random
 Least Recently Used (LRU)
 Need to track last access time
 Least Frequently Used (LFU)
 Need to track no of accesses
 First In First Out (FIFO)
12
Definitions – Cache Misses
 When required items is not found in cache
 Miss rate – fraction of cache accesses that result
in a failure
 Types of misses
 Compulsory – 1st access to a block
 Capacity – limited cache capacity force blocked to be
removed from a cache & later retrieved
 Conflict – if placement strategy is not fully associative
 Average memory access time
= Hit time + Miss rate x Miss penalty 13
Definitions – Cache Misses (Cont.)
 Memory stall cycles
= Instruction count x Fraction of memory access per
instructions x Miss rate x Miss Penalty
 Fraction of memory access per instructions
= Instruction memory access per instruction + Data memory
access per instruction
 Example
 50% instructions are load & store. Miss rate is 2% &
penalty is 25 clock cycles. Suppose CPI is 1. How fast
can this be if all instructions are cache hits?
IC x (1 + 0.75) x CC = 1.75
IC x 1 x CC 14
Cache Performance Metrics
 Hit time
 Miss rate
 Miss penalty
 Cache bandwidth
 Power consumption
15
6 Basic Cache Optimization Techniques
1. Larger block sizes
 Reduce compulsory misses
 Increase capacity & conflict misses
 Increase miss penalty
 Choosing a correct block size is challenging
2. Larger total cache capacity to reduce miss rate
 Reduce misses
 Increase hit time
 Increase power consumption & cost
3. Higher no of cache levels
 Reduce overall memory access time 16
6 Basic Cache Optimization Techniques
(Cont.)
4. Higher associativity
 Reduce conflict misses
 Increase hit time
 Increase power consumption
5. Giving priority to read misses over writes
 Allow reads to check write buffer
 Reduce miss penalty
6. Avoiding address translation in cache indexing
 Virtual to physical address mapping
 Reduce hit time
17
10 Advanced Cache Optimization
Techniques
 5 categories
1. Reducing hit time
2. Increasing cache bandwidth
3. Reducing miss penalty
4. Reducing miss rate
5. Reducing miss penalty or miss rate via parallelism
18
Advanced Optimizations 1
 Small & simple 1st level caches
 Recently size of L1 cache increased either slightly or
not at all
 Critical timing path in a cache hit
 addressing tag memory, then
 comparing tags, then
 selecting correct set
 Direct-mapped caches can overlap tag compare &
transmission of data
 Improve hit time
 Lower associativity reduces power because fewer
cache lines are accessed
19
L1 Size & Associativity – Access Time
20
L1 Size & Associativity – Energy
21
Advanced Optimizations 2
 Way Prediction
 Given access to the current block, predict which block
to access next
 Improve hit time
 Mis-prediction increase hit time
 Prediction accuracy
 > 90% for 2-way
 > 80% for 4-way
 Instruction cache has better accuracy than Data cache
 First used on MIPS R10000 in mid-90s
 Used on ARM Cortex-A8
22
Advanced Optimizations 3
 Pipeline cache access
 Enable L1 cache access to be multiple cycles
 Examples
 Pentium – 1 cycle
 Pentium Pro to Pentium III – 2 cycles
 Pentium 4 to Core i7 – 4 cycles
 Improve bandwidth
 Makes it easier to increase associativity
 Increase hit time
 Increases branch mis-prediction penalty
23
Advanced Optimizations 4
 Nonblocking Caches
 Allow hits before previous
misses complete
 “Hit under miss”
 “Hit under multiple miss”
 L2 must support this
 In general, processors can
hide L1 miss penalty but
not L2 miss penalty
 Increase bandwidth
24
Advanced Optimizations 5
 Multibanked Caches
 Organize cache as independent banks to support
simultaneous access
 Examples
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 & 8 banks for L2
 Interleave banks according to block address
 Increase bandwidth
25
Advanced Optimizations 6
 Critical Word First, Early Restart
 Critical word first
 Request missed word from memory first
 Send it to processor as soon as it arrives
 Early restart
 Request words in normal order
 Send missed work to processor as soon as it arrives
 Reduce miss penalty
 Effectiveness depends on block size & likelihood of
another access to portion of the block that has not yet
been fetched
26
Advanced Optimizations 7 - 10
 Merging Write Buffer
 Reduce miss penalty
 Compiler Optimizations
 Examples
 Loop Interchange – Swap nested loops to access memory in
sequential order
 Instead of accessing entire rows or columns, subdivide matrices into
blocks
 Reduce miss rate
 Hardware Prefetching
 Fetch 2 blocks on miss
 Reduce miss penalty or miss rate
 Compiler Prefetching
 Reduce miss penalty or miss rate 27
Summary of Techniques
28
Memory Technologies
 Performance metrics
 Latency is concern of cache
 Bandwidth is concern of multiprocessors & I/O
 Access time
 Time between read request & when desired word arrives
 Cycle time
 Minimum time between unrelated requests to memory
 DRAM used for main memory
 SRAM used for cache
29
Memory Technology (Cont.)
 Amdahl
 Memory capacity should grow linearly with processor
speed
 Unfortunately, memory capacity & speed hasn’t kept
pace with processors
 Some optimizations
 Multiple accesses to same row
 Synchronous DRAM (SDRAM)
 Added clock to DRAM interface
 Burst mode with critical word first
 Wider interfaces
 Double data rate (DDR)
 Multiple banks on each DRAM device 30
DRAM Optimizations
31
MB/sec = Clock rate x 2 x 8 bytes
DRAM Power Consumption
 Reducing power in DRAMs
 Lower voltage
 Low power mode (ignores clock, continues to refresh)
32
Flash Memory
 Type of EEPROM
 Must be erased (in blocks) before being
overwritten
 Non volatile
 Limited no of write cycles
 Cheaper than DRAM, more expensive than disk
 Slower than SRAM, faster than disk
33
Modern Memory Hierarchy
34
Source: http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/
Intel Optane Non-volatile Memory
35
Source: www.forbes.com/sites/tomcoughlin/2018/06/11/intel-optane-finally-on-dimms/#5792e114190b
Intel Optane (Cont.)
36
Source: www.anandtech.com/show/9541/intel-announces-optane-storage-brand-for-3d-xpoint-products
Virtual Memory
 Each process has its own address space
 Protection via virtual memory
 Keeps processes in their own memory space
 Role of architecture
 Provide user mode & supervisor mode
 Protect certain aspects of CPU state
 Provide mechanisms for switching between user
mode & supervisor mode
 Provide mechanisms to limit memory accesses
 Provide TLB to translate addresses
37
Paging Hardware With TLB
 Parallel search on TLB
 Address translation (p, d)
 If p is in associative register,
get frame # out
 Otherwise get frame # from
page table in memory
Summary
 Caching techniques are continuing to evolve
 Combination of techniques are combined
 Cache sizes are unlikely to increase significantly
 Better performance when programs are
optimized based on cache architecture
39

More Related Content

Similar to CPU Memory Hierarchy and Caching Techniques

Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsateeq ateeq
 
Computer architecture for HNDIT
Computer architecture for HNDITComputer architecture for HNDIT
Computer architecture for HNDITtjunicornfx
 
Kiến trúc máy tính - COE 301 - Memory.ppt
Kiến trúc máy tính - COE 301 - Memory.pptKiến trúc máy tính - COE 301 - Memory.ppt
Kiến trúc máy tính - COE 301 - Memory.pptTriTrang4
 
Ways to reduce misses
Ways to reduce missesWays to reduce misses
Ways to reduce missesnellins
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryBudditha Hettige
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization2022002857mbit
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore ComputersA B Shinde
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cacheVISHAL DONGA
 
PPT_on_Cache_Partitioning_Techniques.pdf
PPT_on_Cache_Partitioning_Techniques.pdfPPT_on_Cache_Partitioning_Techniques.pdf
PPT_on_Cache_Partitioning_Techniques.pdfGnanavi2
 
lecture_11.pptx
lecture_11.pptxlecture_11.pptx
lecture_11.pptxyewid98102
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallQ1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallMemory Fabric Forum
 

Similar to CPU Memory Hierarchy and Caching Techniques (20)

Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Intelligent ram
Intelligent ramIntelligent ram
Intelligent ram
 
Computer architecture for HNDIT
Computer architecture for HNDITComputer architecture for HNDIT
Computer architecture for HNDIT
 
Chapter 5 a
Chapter 5 aChapter 5 a
Chapter 5 a
 
Kiến trúc máy tính - COE 301 - Memory.ppt
Kiến trúc máy tính - COE 301 - Memory.pptKiến trúc máy tính - COE 301 - Memory.ppt
Kiến trúc máy tính - COE 301 - Memory.ppt
 
Ways to reduce misses
Ways to reduce missesWays to reduce misses
Ways to reduce misses
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary Memory
 
computer-memory
computer-memorycomputer-memory
computer-memory
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization
 
Cache memory
Cache memoryCache memory
Cache memory
 
Coa presentation3
Coa presentation3Coa presentation3
Coa presentation3
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
 
PPT_on_Cache_Partitioning_Techniques.pdf
PPT_on_Cache_Partitioning_Techniques.pdfPPT_on_Cache_Partitioning_Techniques.pdf
PPT_on_Cache_Partitioning_Techniques.pdf
 
lecture_11.pptx
lecture_11.pptxlecture_11.pptx
lecture_11.pptx
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallQ1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
 
Chapter 5 b
Chapter 5  bChapter 5  b
Chapter 5 b
 

More from Dilum Bandara

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningDilum Bandara
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeDilum Bandara
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCADilum Bandara
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsDilum Bandara
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresDilum Bandara
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixDilum Bandara
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersDilum Bandara
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level ParallelismDilum Bandara
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesDilum Bandara
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionDilum Bandara
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPDilum Bandara
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery NetworksDilum Bandara
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingDilum Bandara
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband CommunicationDilum Bandara
 

More from Dilum Bandara (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
 
Mobile Services
Mobile ServicesMobile Services
Mobile Services
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband Communication
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

CPU Memory Hierarchy and Caching Techniques

  • 1. Memory Hierarchy CS4342 Advanced Computer Architecture Dilum Bandara Dilum.Bandara@uom.lk Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 3. Why Memory Hierarchy?  Applications want unlimited amounts of memory with low latency  Fast memory is more expensive per bit  Solution  Organize memory system into a hierarchy  Entire addressable memory space available in largest, slowest memory  Incrementally smaller & faster memories  Temporal & spatial locality ensures that nearly all references can be found in smaller memories  Gives illusion of a large, fast memory being presented to processor 3
  • 5. Why Hierarchical Design?  Becomes more crucial with multi-core processors  Aggregate peak bandwidth grows with no of cores  Intel Core i7 can generate 2 references per core per clock  4 cores and 3.2 GHz clock  25.6 billion 64-bit data references/second + 12.8 billion 128-bit instruction references = 409.6 GB/s  DRAM bandwidth is only 6% of this (25 GB/s)  Requires  Multi-port, pipelined caches  2 levels of cache per core  Shared third-level cache on chip 5
  • 6. Core i7 Die & Major Components 6 Source: Intel
  • 7. Performance vs. Power  High-end microprocessors have >10 MB on-chip cache  Consumes large amount of chip area & power budget  Leakage current – when not operating  Active current – when operating  Major limiting factor for processors used in mobile devices 7
  • 8. Definitions – Blocks  Multiple blocks are moved between levels in the hierarchy  Spatial locality  efficiency  Blocks are tagged with memory address  Tags are searched parallel 8 Source: http://archive.arstechnica.com/paedia/c/caching/m-caching-5.html
  • 9. Definitions – Associativity  Defines where blocks can be placed in a cache 9
  • 10. Pentium 4 vs. Opteron Memory Hierarchy 10 CPU Pentium 4 (3.2 GHz) Opteron (2.8 GHz) Instruction Cache Trace Cache 8K micro-ops) 2-way associative, 64 KB, 64B block Data Cache 8-way associative, 16 KB, 64B block, inclusive in L2 2-way associative, 64 KB, 64B block, exclusive to L2 L2 Cache 8-way associative, 2 MB, 128B block 16-way associative, 1 MB, 64B block Prefetch 8 streams to L2 1 stream to L2 Memory 200 MHz x 64 bits 200 MHz x 128 bits
  • 11. Definitions – Updating Cache  Write-through  Update cache block & all other levels below  Use write buffers to speed up  Write-back  Update cache block  Update lower level when replacing cached block  Use write buffers to speed up 11
  • 12. Definitions – Replacing Cached Blocks  Cache replacement policies  Random  Least Recently Used (LRU)  Need to track last access time  Least Frequently Used (LFU)  Need to track no of accesses  First In First Out (FIFO) 12
  • 13. Definitions – Cache Misses  When required items is not found in cache  Miss rate – fraction of cache accesses that result in a failure  Types of misses  Compulsory – 1st access to a block  Capacity – limited cache capacity force blocked to be removed from a cache & later retrieved  Conflict – if placement strategy is not fully associative  Average memory access time = Hit time + Miss rate x Miss penalty 13
  • 14. Definitions – Cache Misses (Cont.)  Memory stall cycles = Instruction count x Fraction of memory access per instructions x Miss rate x Miss Penalty  Fraction of memory access per instructions = Instruction memory access per instruction + Data memory access per instruction  Example  50% instructions are load & store. Miss rate is 2% & penalty is 25 clock cycles. Suppose CPI is 1. How fast can this be if all instructions are cache hits? IC x (1 + 0.75) x CC = 1.75 IC x 1 x CC 14
  • 15. Cache Performance Metrics  Hit time  Miss rate  Miss penalty  Cache bandwidth  Power consumption 15
  • 16. 6 Basic Cache Optimization Techniques 1. Larger block sizes  Reduce compulsory misses  Increase capacity & conflict misses  Increase miss penalty  Choosing a correct block size is challenging 2. Larger total cache capacity to reduce miss rate  Reduce misses  Increase hit time  Increase power consumption & cost 3. Higher no of cache levels  Reduce overall memory access time 16
  • 17. 6 Basic Cache Optimization Techniques (Cont.) 4. Higher associativity  Reduce conflict misses  Increase hit time  Increase power consumption 5. Giving priority to read misses over writes  Allow reads to check write buffer  Reduce miss penalty 6. Avoiding address translation in cache indexing  Virtual to physical address mapping  Reduce hit time 17
  • 18. 10 Advanced Cache Optimization Techniques  5 categories 1. Reducing hit time 2. Increasing cache bandwidth 3. Reducing miss penalty 4. Reducing miss rate 5. Reducing miss penalty or miss rate via parallelism 18
  • 19. Advanced Optimizations 1  Small & simple 1st level caches  Recently size of L1 cache increased either slightly or not at all  Critical timing path in a cache hit  addressing tag memory, then  comparing tags, then  selecting correct set  Direct-mapped caches can overlap tag compare & transmission of data  Improve hit time  Lower associativity reduces power because fewer cache lines are accessed 19
  • 20. L1 Size & Associativity – Access Time 20
  • 21. L1 Size & Associativity – Energy 21
  • 22. Advanced Optimizations 2  Way Prediction  Given access to the current block, predict which block to access next  Improve hit time  Mis-prediction increase hit time  Prediction accuracy  > 90% for 2-way  > 80% for 4-way  Instruction cache has better accuracy than Data cache  First used on MIPS R10000 in mid-90s  Used on ARM Cortex-A8 22
  • 23. Advanced Optimizations 3  Pipeline cache access  Enable L1 cache access to be multiple cycles  Examples  Pentium – 1 cycle  Pentium Pro to Pentium III – 2 cycles  Pentium 4 to Core i7 – 4 cycles  Improve bandwidth  Makes it easier to increase associativity  Increase hit time  Increases branch mis-prediction penalty 23
  • 24. Advanced Optimizations 4  Nonblocking Caches  Allow hits before previous misses complete  “Hit under miss”  “Hit under multiple miss”  L2 must support this  In general, processors can hide L1 miss penalty but not L2 miss penalty  Increase bandwidth 24
  • 25. Advanced Optimizations 5  Multibanked Caches  Organize cache as independent banks to support simultaneous access  Examples  ARM Cortex-A8 supports 1-4 banks for L2  Intel i7 supports 4 banks for L1 & 8 banks for L2  Interleave banks according to block address  Increase bandwidth 25
  • 26. Advanced Optimizations 6  Critical Word First, Early Restart  Critical word first  Request missed word from memory first  Send it to processor as soon as it arrives  Early restart  Request words in normal order  Send missed work to processor as soon as it arrives  Reduce miss penalty  Effectiveness depends on block size & likelihood of another access to portion of the block that has not yet been fetched 26
  • 27. Advanced Optimizations 7 - 10  Merging Write Buffer  Reduce miss penalty  Compiler Optimizations  Examples  Loop Interchange – Swap nested loops to access memory in sequential order  Instead of accessing entire rows or columns, subdivide matrices into blocks  Reduce miss rate  Hardware Prefetching  Fetch 2 blocks on miss  Reduce miss penalty or miss rate  Compiler Prefetching  Reduce miss penalty or miss rate 27
  • 29. Memory Technologies  Performance metrics  Latency is concern of cache  Bandwidth is concern of multiprocessors & I/O  Access time  Time between read request & when desired word arrives  Cycle time  Minimum time between unrelated requests to memory  DRAM used for main memory  SRAM used for cache 29
  • 30. Memory Technology (Cont.)  Amdahl  Memory capacity should grow linearly with processor speed  Unfortunately, memory capacity & speed hasn’t kept pace with processors  Some optimizations  Multiple accesses to same row  Synchronous DRAM (SDRAM)  Added clock to DRAM interface  Burst mode with critical word first  Wider interfaces  Double data rate (DDR)  Multiple banks on each DRAM device 30
  • 31. DRAM Optimizations 31 MB/sec = Clock rate x 2 x 8 bytes
  • 32. DRAM Power Consumption  Reducing power in DRAMs  Lower voltage  Low power mode (ignores clock, continues to refresh) 32
  • 33. Flash Memory  Type of EEPROM  Must be erased (in blocks) before being overwritten  Non volatile  Limited no of write cycles  Cheaper than DRAM, more expensive than disk  Slower than SRAM, faster than disk 33
  • 34. Modern Memory Hierarchy 34 Source: http://blog.teachbook.com.au/index.php/2012/02/memory-hierarchy/
  • 35. Intel Optane Non-volatile Memory 35 Source: www.forbes.com/sites/tomcoughlin/2018/06/11/intel-optane-finally-on-dimms/#5792e114190b
  • 36. Intel Optane (Cont.) 36 Source: www.anandtech.com/show/9541/intel-announces-optane-storage-brand-for-3d-xpoint-products
  • 37. Virtual Memory  Each process has its own address space  Protection via virtual memory  Keeps processes in their own memory space  Role of architecture  Provide user mode & supervisor mode  Protect certain aspects of CPU state  Provide mechanisms for switching between user mode & supervisor mode  Provide mechanisms to limit memory accesses  Provide TLB to translate addresses 37
  • 38. Paging Hardware With TLB  Parallel search on TLB  Address translation (p, d)  If p is in associative register, get frame # out  Otherwise get frame # from page table in memory
  • 39. Summary  Caching techniques are continuing to evolve  Combination of techniques are combined  Cache sizes are unlikely to increase significantly  Better performance when programs are optimized based on cache architecture 39