SlideShare a Scribd company logo
1 of 34
Thread-Level Parallelism
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
Outline
 Multi-processors
 Shared-memory architectures
 Memory synchronization
2
Increasing Importance of
Multiprocessing
 Inability to exploit more ILP
 Power & silicon costs grow faster than performance
 Growing interest in
 High-end servers as cloud computing & software as a
service
 Data-intensive applications
 Lower interest in increasing desktop performance
 Better understanding of how to use multiprocessors
effectively
 Replication of a core is relatively easy than a
completely new design
3
Vectors, MMX, GPUs vs.
Multiprocessors
 Vectors, MMX, GPUs
 SIMD
 Multiprocessors
 MIMD
4
Multiprocessor Architecture
 MIMD multiprocessor with n processors
 Need n threads or processors to keep it fully utilized
 Thread-Level parallelism
 Uses MIMD model
 Have multiple program counters
 Targeted for tightly-coupled shared-memory
multiprocessors
 Communication among threads through shared
memory
5
Definition – Grain Size
 Amount of computation assigned to each thread
 Threads can be used for data-level parallelism,
but overheads may outweigh benefits
6
Symmetric Multiprocessors (SMP)
 Small no of cores
 Share single memory
with uniform memory
latency
 A.k.a. Uniform Memory
Access (UMA)
7
Distributed Shared Memory (DSM)
 Memory distributed among processors
 Processors connected via direct (switched) & non-direct
(multi-hop) interconnection networks
 A.k.a. Non-Uniform Memory Access/latency (NUMA)
8
Cache Coherence
9
Core A Core B
Cache Cache
RAM
Cache Coherence (Cont.)
 Processors may see different values through
their caches
 Caching shared data introduces new problems
 In a coherent memory system read of a data
item returns the most recently written value
 2 aspects
 Coherence
 Consistency
10
1. Coherence
 Defines behavior of reads & writes to the same
memory location
 What value can be returned by a read
 All reads by any processor must return most recently
written value
 Writes to same location by any 2 processors are seen
in same order by all processors
 Serialized writes to same location
11
2. Consistency
 Defines behavior of reads & writes with respect
to access to other memory locations
 When a written value will be returned by a read
 If a processor writes location x followed by
location y, any processor that sees new value of
y must also see new value of x
 Serialized writes different locations
12
Enforcing Coherence
 Coherent caches provide
 Migration – movement of data
 Replication – multiple copies of data
 Cache coherence protocols
 Snooping
 Each core tracks sharing status of each of its blocks
 Distributed
 Directory based
 Sharing status of each block kept in a single location
 Centralized
13
Snooping Coherence Protocol
 Write invalidate protocol
 On write, invalidate all other cached copies
 Use bus itself to serialize
 Write can’t complete until bus access is obtained
 Concurrent writes?
 One who obtains bus wins
 Most common implementation
14
Snooping Coherence Protocol (Cont.)
 Write update protocol
 A.k.a. Write broadcast protocol
 On write, update all copies
 Need more bandwidth
15
Snooping Coherence Protocol –
Implementation Techniques
 Locating an item when a read miss occurs
 Write-through cache
 Recent copy in memory
 Write-back cache
 Every processor snoops every address placed on shared bus
 If a processor finds it has a dirty block, updated block is sent to
requesting processor
 Cache lines marked as shared or
exclusive/modified
 Only writes to shared lines need an invalidate
broadcast
 After this, line is marked as exclusive
16
Snoopy Coherence Protocol – State
Transition Example
17
Snoopy Coherence Protocol – State
Transition Example
18
Snoopy Coherence Protocol – Issues
 Operations are not atomic
 e.g., detect miss, acquire bus, receive a response
 Applies to both reads & writes
 Creates possibility of deadlock & races
 Actual Snoopy Coherence Protocols are more
complicated
19
Coherence Protocols – Extensions
 Shared memory bus &
snooping bandwidth is
bottleneck for scaling
symmetric
multiprocessors
 Duplicating tags
 Place directory in
outermost cache
 Use crossbars or point-to-
point networks with
banked memory
20
Coherence Protocols – Example
 AMD Opteron
 Memory directly connected
to each multicore chip in
NUMA-like organization
 Implement coherence
protocol using point-to-
point links
 Use explicit
acknowledgements to
order operations
21
Source: www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html
Cache Coherence – Performance
 Coherence influences cache miss rate
 Coherence misses
 True sharing misses
 Write to shared block (transmission of invalidation)
 Read an invalidated block
 False sharing misses
 Read an unmodified word in an invalidated block
22
Performance Study – Commercial
Workload
23
Directory-Based Cache Coherence
 Sharing status of each physical memory block
kept in a single location
 Approaches
 Central directory for memory or common cache
 For Symmetric Multiprocessors (SMPs)
 Distributed directory
 For Distributed Shared Memory (DSM) systems
 Overcomes single point of contention in SMPs
24
Source: www.icsa.inf.ed.ac.uk/cgi-bin/hase/dir-
cache-m.pl?cd-t.html,cd-f.html,menu1.html
Directory Protocols
 For each block, maintain state
 Shared
 1 or more nodes have the block cached, value in memory is
up-to-date
 Set of node IDs
 Uncached
 Modified
 Exactly 1 node has a copy of the cache block, value in
memory is out-of-date
 Owner node ID
 Directory maintains block states & sends
invalidation messages
25
Directory Protocols (Cont.)
26
Directory Protocols (Cont.)
 For uncached block
 Read miss
 Requesting node is sent requested data & is made the only
sharing node, block is now shared
 Write miss
 Requesting node is sent requested data & becomes the sharing
node, block is now exclusive (modified)
 For shared block
 Read miss
 Requesting node is sent requested data from memory, node is
added to sharing set
 Write miss
 Requesting node is sent value, all nodes in sharing set are sent
invalidate messages, sharing set only contains requesting
node, block is now exclusive 27
Directory Protocols (Cont.)
 For exclusive block
 Read miss
 Owner is sent a data fetch message, block becomes shared,
owner sends data to directory, data written back to memory,
sharers set contains old owner & requestor
 Data write back
 Block becomes uncached, sharer set is empty
 Write miss
 Message is sent to old owner to invalidate & send value to
directory, requestor becomes new owner, block remains
exclusive
28
Directory Protocols (Cont.)
29
Synchronization Operations
 Basic building blocks
 Atomic exchange
 Swaps register with memory location
 Test-and-set
 Sets under condition
 Fetch-and-increment
 Reads original value from memory & increments it in memory
 Requires memory read & write in uninterruptable
instruction
 load linked/store conditional
 If contents of memory location specified by load linked are
changed before store conditional to same address, store
conditional fails 30
Implementing Locks Using Coherence
 Spin lock
 If no coherence
DADDUI R2,R0,#1
lockit: EXCH R2,0(R1) ;atomic exchange
BNEZ R2,lockit ;already locked?
 If coherence
lockit: LD R2,0(R1) ;load of lock
BNEZ R2,lockit ;not available-spin
DADDUI R2,R0,#1 ;load locked value
EXCH R2,0(R1); swap
BNEZ R2,lockit; branch if lock wasn’t 0
31
Implementing Locks – Advantages
 Reduces memory traffic
 During each iteration, current lock value can be read from cache
 Locality of accessing lock
32
33
Summary
 Multi-processors
 Create new problems, e.g., cache coherency
 Aspects of cache coherence
 Coherence
 Consistency
 Shared-memory architectures
 Snooping
 Directory based
 Memory synchronization
34

More Related Content

Similar to Introduction to Thread Level Parallelism

Similar to Introduction to Thread Level Parallelism (20)

Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
Bab 4
Bab 4Bab 4
Bab 4
 
Topic 4- processes.pptx
Topic 4- processes.pptxTopic 4- processes.pptx
Topic 4- processes.pptx
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
Chapter 9 OS
Chapter 9 OSChapter 9 OS
Chapter 9 OS
 
SoC-2012-pres-2
SoC-2012-pres-2SoC-2012-pres-2
SoC-2012-pres-2
 
Sinfonia
Sinfonia Sinfonia
Sinfonia
 
Chorus - Distributed Operating System [ case study ]
Chorus - Distributed Operating System [ case study ]Chorus - Distributed Operating System [ case study ]
Chorus - Distributed Operating System [ case study ]
 
22CS201 COA
22CS201 COA22CS201 COA
22CS201 COA
 
Google file system
Google file systemGoogle file system
Google file system
 
Google
GoogleGoogle
Google
 
Distributed shared memory ch 5
Distributed shared memory ch 5Distributed shared memory ch 5
Distributed shared memory ch 5
 
Chapter04 new
Chapter04 newChapter04 new
Chapter04 new
 
CH08.pdf
CH08.pdfCH08.pdf
CH08.pdf
 
Cs8493 unit 3
Cs8493 unit 3Cs8493 unit 3
Cs8493 unit 3
 
CS6401 OPERATING SYSTEMS Unit 3
CS6401 OPERATING SYSTEMS Unit 3CS6401 OPERATING SYSTEMS Unit 3
CS6401 OPERATING SYSTEMS Unit 3
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating system
 
Cs8493 unit 3
Cs8493 unit 3Cs8493 unit 3
Cs8493 unit 3
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
6.distributed shared memory
6.distributed shared memory6.distributed shared memory
6.distributed shared memory
 

More from Dilum Bandara

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningDilum Bandara
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeDilum Bandara
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCADilum Bandara
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsDilum Bandara
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresDilum Bandara
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixDilum Bandara
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersDilum Bandara
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesDilum Bandara
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesDilum Bandara
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionDilum Bandara
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPDilum Bandara
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery NetworksDilum Bandara
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingDilum Bandara
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband CommunicationDilum Bandara
 

More from Dilum Bandara (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
 
Mobile Services
Mobile ServicesMobile Services
Mobile Services
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband Communication
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Introduction to Thread Level Parallelism

  • 1. Thread-Level Parallelism CS4342 Advanced Computer Architecture Dilum Bandara Dilum.Bandara@uom.lk Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 2. Outline  Multi-processors  Shared-memory architectures  Memory synchronization 2
  • 3. Increasing Importance of Multiprocessing  Inability to exploit more ILP  Power & silicon costs grow faster than performance  Growing interest in  High-end servers as cloud computing & software as a service  Data-intensive applications  Lower interest in increasing desktop performance  Better understanding of how to use multiprocessors effectively  Replication of a core is relatively easy than a completely new design 3
  • 4. Vectors, MMX, GPUs vs. Multiprocessors  Vectors, MMX, GPUs  SIMD  Multiprocessors  MIMD 4
  • 5. Multiprocessor Architecture  MIMD multiprocessor with n processors  Need n threads or processors to keep it fully utilized  Thread-Level parallelism  Uses MIMD model  Have multiple program counters  Targeted for tightly-coupled shared-memory multiprocessors  Communication among threads through shared memory 5
  • 6. Definition – Grain Size  Amount of computation assigned to each thread  Threads can be used for data-level parallelism, but overheads may outweigh benefits 6
  • 7. Symmetric Multiprocessors (SMP)  Small no of cores  Share single memory with uniform memory latency  A.k.a. Uniform Memory Access (UMA) 7
  • 8. Distributed Shared Memory (DSM)  Memory distributed among processors  Processors connected via direct (switched) & non-direct (multi-hop) interconnection networks  A.k.a. Non-Uniform Memory Access/latency (NUMA) 8
  • 9. Cache Coherence 9 Core A Core B Cache Cache RAM
  • 10. Cache Coherence (Cont.)  Processors may see different values through their caches  Caching shared data introduces new problems  In a coherent memory system read of a data item returns the most recently written value  2 aspects  Coherence  Consistency 10
  • 11. 1. Coherence  Defines behavior of reads & writes to the same memory location  What value can be returned by a read  All reads by any processor must return most recently written value  Writes to same location by any 2 processors are seen in same order by all processors  Serialized writes to same location 11
  • 12. 2. Consistency  Defines behavior of reads & writes with respect to access to other memory locations  When a written value will be returned by a read  If a processor writes location x followed by location y, any processor that sees new value of y must also see new value of x  Serialized writes different locations 12
  • 13. Enforcing Coherence  Coherent caches provide  Migration – movement of data  Replication – multiple copies of data  Cache coherence protocols  Snooping  Each core tracks sharing status of each of its blocks  Distributed  Directory based  Sharing status of each block kept in a single location  Centralized 13
  • 14. Snooping Coherence Protocol  Write invalidate protocol  On write, invalidate all other cached copies  Use bus itself to serialize  Write can’t complete until bus access is obtained  Concurrent writes?  One who obtains bus wins  Most common implementation 14
  • 15. Snooping Coherence Protocol (Cont.)  Write update protocol  A.k.a. Write broadcast protocol  On write, update all copies  Need more bandwidth 15
  • 16. Snooping Coherence Protocol – Implementation Techniques  Locating an item when a read miss occurs  Write-through cache  Recent copy in memory  Write-back cache  Every processor snoops every address placed on shared bus  If a processor finds it has a dirty block, updated block is sent to requesting processor  Cache lines marked as shared or exclusive/modified  Only writes to shared lines need an invalidate broadcast  After this, line is marked as exclusive 16
  • 17. Snoopy Coherence Protocol – State Transition Example 17
  • 18. Snoopy Coherence Protocol – State Transition Example 18
  • 19. Snoopy Coherence Protocol – Issues  Operations are not atomic  e.g., detect miss, acquire bus, receive a response  Applies to both reads & writes  Creates possibility of deadlock & races  Actual Snoopy Coherence Protocols are more complicated 19
  • 20. Coherence Protocols – Extensions  Shared memory bus & snooping bandwidth is bottleneck for scaling symmetric multiprocessors  Duplicating tags  Place directory in outermost cache  Use crossbars or point-to- point networks with banked memory 20
  • 21. Coherence Protocols – Example  AMD Opteron  Memory directly connected to each multicore chip in NUMA-like organization  Implement coherence protocol using point-to- point links  Use explicit acknowledgements to order operations 21 Source: www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html
  • 22. Cache Coherence – Performance  Coherence influences cache miss rate  Coherence misses  True sharing misses  Write to shared block (transmission of invalidation)  Read an invalidated block  False sharing misses  Read an unmodified word in an invalidated block 22
  • 23. Performance Study – Commercial Workload 23
  • 24. Directory-Based Cache Coherence  Sharing status of each physical memory block kept in a single location  Approaches  Central directory for memory or common cache  For Symmetric Multiprocessors (SMPs)  Distributed directory  For Distributed Shared Memory (DSM) systems  Overcomes single point of contention in SMPs 24 Source: www.icsa.inf.ed.ac.uk/cgi-bin/hase/dir- cache-m.pl?cd-t.html,cd-f.html,menu1.html
  • 25. Directory Protocols  For each block, maintain state  Shared  1 or more nodes have the block cached, value in memory is up-to-date  Set of node IDs  Uncached  Modified  Exactly 1 node has a copy of the cache block, value in memory is out-of-date  Owner node ID  Directory maintains block states & sends invalidation messages 25
  • 27. Directory Protocols (Cont.)  For uncached block  Read miss  Requesting node is sent requested data & is made the only sharing node, block is now shared  Write miss  Requesting node is sent requested data & becomes the sharing node, block is now exclusive (modified)  For shared block  Read miss  Requesting node is sent requested data from memory, node is added to sharing set  Write miss  Requesting node is sent value, all nodes in sharing set are sent invalidate messages, sharing set only contains requesting node, block is now exclusive 27
  • 28. Directory Protocols (Cont.)  For exclusive block  Read miss  Owner is sent a data fetch message, block becomes shared, owner sends data to directory, data written back to memory, sharers set contains old owner & requestor  Data write back  Block becomes uncached, sharer set is empty  Write miss  Message is sent to old owner to invalidate & send value to directory, requestor becomes new owner, block remains exclusive 28
  • 30. Synchronization Operations  Basic building blocks  Atomic exchange  Swaps register with memory location  Test-and-set  Sets under condition  Fetch-and-increment  Reads original value from memory & increments it in memory  Requires memory read & write in uninterruptable instruction  load linked/store conditional  If contents of memory location specified by load linked are changed before store conditional to same address, store conditional fails 30
  • 31. Implementing Locks Using Coherence  Spin lock  If no coherence DADDUI R2,R0,#1 lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked?  If coherence lockit: LD R2,0(R1) ;load of lock BNEZ R2,lockit ;not available-spin DADDUI R2,R0,#1 ;load locked value EXCH R2,0(R1); swap BNEZ R2,lockit; branch if lock wasn’t 0 31
  • 32. Implementing Locks – Advantages  Reduces memory traffic  During each iteration, current lock value can be read from cache  Locality of accessing lock 32
  • 33. 33
  • 34. Summary  Multi-processors  Create new problems, e.g., cache coherency  Aspects of cache coherence  Coherence  Consistency  Shared-memory architectures  Snooping  Directory based  Memory synchronization 34

Editor's Notes

  1. If no coherence – lock in memory, keep reading it If coherence – lock in cache, keep reading from cache (local copy), until it get changes. If we do EXCH on 1st line it require getting exclusive access to cache.