SlideShare a Scribd company logo
1 of 187
1
Modern Computer
Architectures
Module-4:
Thread and Process-
Level Parallelism
Ms. Nibedita Adhikari
Dept. of CSE, PIET, Rourkela
2
3
Introduction
● Initial computer performance improvements
came from use of:
– Innovative manufacturing techniques.
● In later years,
– Most improvements came from exploitation of ILP.
– Both software and hardware techniques are being
used.
– Pipelining, dynamic instruction scheduling, out of
order execution, VLIW, vector processing, etc.
● ILP now appears fully exploited:
– Further performance improvements from ILP
appears limited.
4
Thread and Process-
Level Parallelism
● The way to achieve higher performance:
– Of late, exploitation of thread and process-
level parallelism is being focused.
● Exploit parallelism existing across
multiple processes or threads:
– Cannot be exploited by any ILP processor.
● Consider a banking application:
– Individual transactions can be executed in
parallel.
5
Processes versus Threads
(Oh my!)
● Processes:
– A process is a program in execution.
– An application normally consists of
multiple processes.
● Threads:
– A process consists of one of more
threads.
– Threads belonging to the same process
share data, and code space.
6
Single and Multithreaded
Processes
7
How can Threads be
Created?
● By using any of the popular
thread libraries:
– POSIX Pthreads
– Win32 threads
– Java threads, etc.
8
User Threads
● Thread management done in user
space.
● User threads are supported and
managed without kernel support.
– Invisible to the kernel.
– If one thread blocks, entire
process blocks.
– Limited benefits of threading.
9
Kernel Threads
● Kernel threads supported and
managed directly by the OS.
– Kernel creates Light Weight Processes
(LWPs).
● Most modern OS support kernel
threads:
– Windows XP/2000
– Solaris
– Linux
– Mac OS, etc.
10
Benefits of Threading
● Responsiveness:
– Threads share code, and data.
– Thread creation and switching
therefore much more efficient than
that for processes;
● As an example in Solaris:
– Creating threads 30x less costly
than processes.
– Context switching about 5x faster
than processes.
11
Benefits of Threading
cont…
● Truly concurrent execution:
–Possible with processors
supporting concurrent
execution of threads: SMP,
multi-core, SMT
(hyperthreading), etc.
12
High Performance Computer
Architectures
Lecture 28: Thread-
Level Parallelism
13
A Case for Processor Support
for Thread-level Parallelism
• Using pure ILP, execution unit
utilization is only about 20%-25%:
– Utilization limited by control dependency,
Cache misses during memory access, etc.
– It is rare for units to be even
reasonably busy on the average.
● In pure ILP:
– At any time only one thread is under
execution.
14
A Case for Processor Support
for Thread-level Parallelism
● Utilization of execution units can be
improved:
– Have several threads under execution:
● called active threads in PIII.
– Execute several threads at the same
time:
● SMP, SMT, and Multi-core processors.
15
Threads in Applications
● Threads are natural to a wide
ranging set of applications:
– Often more or less independent.
– Though share data among
themselves to some extent.
– Also, synchronize sometimes
among themselves.
16
A Few Thread Examples
● Independent threads occur
naturally in several applications:
– Web server: different http
requests are the threads.
– File server
– Name server
– Banking: independent transactions
– Desktop applications: file loading,
display, computations, etc. can be
threads.
17
Reflection on Threading
● To think of it:
– Threading is inherent to any
server application.
● Threads are also easily
identifiable in traditional
applications:
– Banking, Scientific computations,
etc.
18
Thread-level Parallelism
--- Cons
● Threads have to be identified by
the programmer:
– No rules exist as to what can be a
meaningful thread.
– Threads can not possibly be
identified by any automatic static or
dynamic analysis of code.
– Burden on programmer: requires
careful thinking and programming.
19
Thread-level Parallelism
--- Cons cont…
● Threads with severe
dependencies:
– May make multithreading an
exercise in futility.
● Also not as “programmer
friendly” as ILP.
20
Thread Vs. Process-
Level Parallelism
● Threads are light weight (or fine-
grained):
– Threads share address space, data, files etc.
– Even when extent of data sharing and
synchronization is low: Exploitation of
thread-level parallelism meaningful only when
communication latency is low.
– Consequently, shared memory architectures
(UMA) are a popular way to exploit thread-
level parallelism.
21
Thread Vs. Process-
Level Parallelism cont…
● Processes are coarse-grained:
– Communication to computation
requirement is lower.
– DSM (Distributed Shared
Memory), Clusters, Grids, etc.
are meaningful.
22
Focus of Next Few
Lectures
● Shared memory multiprocessors
– Cache coherency
– Synchronization: spin locks
● The recent phenomenon of threading
support in uniprocessors.
● Distributed memory multiprocessors
– DSM
– Clusters (Discussed in Module 6)
– Grids (Discussed in Module 6)
23
A Broad Classification of
Computers
● Shared-memory multiprocessors
– Also called UMA
● Distributed memory computers
– Also called NUMA:
● Distributed Shared-memory (DSM)
architectures
● Clusters
● Grids, etc.
24
UMA vs. NUMA
Computers
Cache
P1
Cache
P2
Cache
Pn
Cache
P1
Cache
P2
Cache
Pn
Network
Main
Memory
Main
Memory
Main
Memory
Main
Memory
Bus
(a) UMA Model (b) NUMA Model
Latency = 100s of ns
Latency = several
milliseconds to seconds
25
Distributed Memory
Computers
● Distributed memory computers use:
– Message Passing Model
● Explicit message send and receive
instructions have to be written by the
programmer.
– Send: specifies local buffer + receiving
process (id) on remote computer (address).
– Receive: specifies sending process on
remote computer + local buffer to place
data.
26
Advantages of Message-
Passing Communication
● Hardware for communication and
synchronization are much simpler:
– Compared to communication in a shared memory
model.
● Explicit communication:
– Programs simpler to understand, helps to reduce
maintenance and development costs.
● Synchronization is implicit:
– Naturally associated with sending/receiving
messages.
– Easier to debug.
27
Disadvantages of Message-
Passing Communication
● Programmer has to write explicit
message passing constructs.
– Also, precisely identify the
processes (or threads) with which
communication is to occur.
● Explicit calls to operating
system:
– Higher overhead.
28
MPI: A Message Passing
Standard
● A (de facto standard) developed
by a group of industry and academic
professionals:
– Aim is to foster portability and
widespread use.
● Defines routines, and not
implementations:
– Several free implementations exist.
– Synchronous and asynchronous modes.
29
DSM
● Physically separate memories are
accessed as one logical address space.
● Processors running on a multi-
computer system share their memory.
– Implemented by operating system.
● DSM multiprocessors are NUMA:
– Access time depends on the exact
location of the data.
30
Distributed Shared-Memory
Architecture (DSM)
● Underlying mechanism is message
passing:
– Shared memory convenience provided to
the programmer by the operating system.
– Basically, an operating system facility
takes care of message passing implicitly.
● Advantage of DSM:
– Ease of programming
31
Disadvantage of DSM
● High communication cost:
– A program not specifically optimized
for DSM by the programmer shall
perform extremely poorly.
– Data (variables) accessed by specific
program segments have to be
collocated.
– Useful only for process-level (coarse-
grained) parallelism.
32
SVM: Shared Virtual
Memory
● Supporting DSM on top of an
inherently message passing
system is inefficient.
● A possible solution is SVM.
● Virtual memory mechanism is used
to share objects at the page level.
33
Communication Overhead:
Example 1
● An application is running on a 32 node
multiprocessor
● It incurs a latency of 400ns to handle a
reference (read/write) to memory.
● Processor clock rate is 1 GHz; IPC
(instructions per cycle) = 2.
● How much faster will be a computation,
if there is no communication versus if
0.2% of the instructions involve
reference to a memory location?
34
Communication Overhead:
Solution 1
● CPI = 0.5
● Effective CPI with 0.2% remote
references = Base CPI + memory
request rate * memory request cost
● Effective CPI with 0.2% remote
references = 0.5 + 0.002 * 400 ns
= 0.5 + 0.8 = 1.3
● A program having no memory references
will be (1- 1/1.3)*100= 23% faster.
35
Communication Overhead:
Example 2
● An application running on a 32 node DSM.
● It incurs a latency of 400 ms to handle a
reference (read/write) to a remote memory.
● Processor clock rate is 1 GHz; IPC
(instructions per cycle) = 2.
● How much faster will be a computation, on a
multiprocessor system compared to the DSM
if 0.2% of the instructions involve reference
to a remote memory? Assume no local
memory references.
36
Communication Overhead:
Solution 2
● CPI = 0.5
● Effective CPI with 0.2% remote references =
Base CPI + remote request rate * remote
request cost
● Effective CPI with 0.2% remote references
= 0.5 + 0.002 * 400 * 1000ns = 0.5 + 800 =
800.5
● A multiprocessor would be: 800.5/1.3 = 658
times faster.
● Performance figures of NUMA may be worse:
– If we take data dependency and synchronization
aspects into consideration.
37
Modern Computer
Architectures
Lecture 29: Symmetric
Multiprocessors
(SMPs)
38
Symmetric Multiprocessors
(SMPs)
● SMPs are a popular shared memory
multiprocessor architecture:
– Processors share Memory and I/O
– Bus based: access time for all memory locations is
equal --- “Symmetric MP”
P P P P
Cache Cache Cache Cache
Main memory I/O system
Bus
39
SMPs: Some Insights
● In any multiprocessor, main memory
access is a bottleneck:
– Multilevel caches reduce the memory demand
of a processor.
– Multilevel caches in fact make it possible for
more than one processor to meaningfully
share the memory bus.
– Hence multilevel caches are a must in a
multiprocessor!
40
Different SMP
Organizations
● Processor and cache on separate
extension boards (1980s):
– Plugged on to the backplane.
● Integrated on the main board (1990s):
– 4 or 6 processors placed per board.
● Integrated on the same chip (multi-core)
(2000s):
– Dual core (IBM, Intel, AMD)
– Quad core
41
Pros of SMPs
● Ease of programming:
–Especially when communication
patterns are complex or vary
dynamically during execution.
42
Cons of SMPs
● As the number of processors increases,
contention for the bus increases.
– Scalability of the SMP model restricted.
– One way out may be to use switches
(crossbar, multistage networks, etc.)
instead of a bus.
– Switches set up parallel point-to-point
connections.
– Again switches are not without any
disadvantages: make implementation of
cache coherence difficult.
43
SMPs
● Even programs not using multithreading
(conventional programs):
– Experience a performance increase on SMPs
– Reason: Kernel routines handling interrupts
etc. run on a separate processor.
● Multicore processors are now common
place:
– Pentium 4 Extreme Edition, Xeon, Athlon64,
DEC Alpha, UltraSparc…
44
Why Multicores?
● Can you recollect the constraints on
further increase in circuit complexity:
– Clock skew and temperature.
● Use of more complex techniques to
improve single-thread performance is
limited.
● Any additional transistors have to be
used in a different core.
45
Why Multicores?
Cont…
● Multiple cores on the same
physical packaging:
– Execute different threads.
– Switched off, if no thread to
execute (power saving).
– Dual core, quad core, etc.
46
Cache Organizations for
Multicores
● L1 caches are always private to a core
● L2 caches can be private or shared
– which is better?
P4
P3
P2
P1
L1
L1
L1
L1
L2
L2
L2
L2
P4
P3
P2
P1
L1
L1
L1
L1
L2
47
L2 Organizations
● Advantages of a shared L2 cache:
– Efficient dynamic use of space by each core
– Data shared by multiple cores is not
replicated.
– Every block has a fixed “home” – hence, easy
to find the latest copy.
● Advantages of a private L2 cache:
– Quick access to private L2
– Private bus to private L2, less contention.
48
An Important Problem with
Shared-Memory: Coherence
● When shared data are cached:
– These are replicated in multiple
caches.
– The data in the caches of different
processors may become inconsistent.
● How to enforce cache coherency?
– How does a processor know changes in
the caches of other processors?
49
The Cache Coherency
Problem
P1 P2 P3
U:5 U:5
U:5
1
4
U:? U:? U:7
2
3
5
What value will P1 and P2 read?
1 3
U:
?
50
Cache Coherence Solutions
(Protocols)
● The key to maintain cache coherence:
– Track the state of sharing of every
data block.
● Based on this idea, following can be
an overall solution:
– Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.
51
Basic Idea Behind Cache
Coherency Protocols
P P P P
Cache Cache Cache Cache
Main memory I/O system
Bus
52
Pros and Cons of the
Solution
● Pro:
–Consistency maintenance becomes
transparent to programmers,
compilers, as well as to the
operating system.
● Con:
–Increased hardware complexity .
53
Two Important Cache
Coherency Protocols
● Snooping protocol:
– Each cache “snoops” the bus to find out
which data is being used by whom.
● Directory-based protocol:
– Keep track of the sharing state of each
data block using a directory.
– A directory is a centralized register for
all memory blocks.
– Allows coherency protocol to avoid
broadcasts.
54
Snoopy and Directory-
Based Protocols
P P P P
Cache Cache Cache Cache
Main memory I/O system
Bus
55
Snooping vs. Directory-
based Protocols
● Snooping protocol reduces memory
traffic.
– More efficient.
● Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
– Even when there is a shared bus, scalability
is a problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses.
56
Snooping Protocol
● As soon as a request for any data block
by a processor is put out on the bus:
– Other processors “snoop” to check if they
have a copy and respond accordingly.
● Works well with bus interconnection:
– All transmissions on a bus are essentially
broadcast:
● Snooping is therefore effortless.
– Dominates almost all small scale machines.
57
Categories of Snoopy
Protocols
● Essentially two types:
– Write Invalidate Protocol
– Write Broadcast Protocol
● Write invalidate protocol:
– When one processor writes to its cache, all
other processors having a copy of that
data block invalidate that block.
● Write broadcast:
– When one processor writes to its cache, all
other processors having a copy of that
data block update that block with the
recent written value.
58
Write Invalidate Vs.
Write Update Protocols
P P P P
Cache Cache Cache Cache
Main memory I/O system
Bus
59
Write Invalidate Protocol
● Handling a write to shared data:
– An invalidate command is sent on bus ---
all caches snoop and invalidate any copies
they have.
● Handling a read Miss:
– Write-through: memory is always up-to-
date.
– Write-back: snooping finds most recent
copy.
60
Write Invalidate in Write
Through Caches
● Simple implementation.
● Writes:
– Write to shared data: broadcast on bus,
processors snoop, and update any copies.
– Read miss: memory is always up-to-date.
● Concurrent writes:
– Write serialization automatically achieved
since bus serializes requests.
– Bus provides the basic arbitration support.
61
Write Invalidate versus
Broadcast cont…
● Invalidate exploits spatial locality:
–Only one bus transaction for any
number of writes to the same block.
–Obviously, more efficient.
● Broadcast has lower latency for
writes and reads:
–As compared to invalidate.
62
An Example Snoopy
Protocol
● Assume:
– Invalidation protocol, write-back cache.
● Each block of memory is in one of the
following states:
– Shared: Clean in all caches and up-to-date
in memory, block can be read.
– Exclusive: cache has the only copy, it is
writeable, and dirty.
– Invalid: Data present in the block obsolete,
cannot be used.
63
Modern Computer
Architectures
Lecture 30: Cache
Coherence Protocols
64
Implementation of the
Snooping Protocol
● A cache controller at every processor
would implement the protocol:
– Has to perform specific actions:
● When the local processor requests certain
things.
● Also, certain actions are required when certain
address appears on the bus.
– Exact actions of the cache controller
depends on the state of the cache block.
– Two FSMs can show the different types of
actions to be performed by a controller.
65
Snoopy-Cache State
Machine-I
● State machine
considering only
CPU requests
a each cache
block.
Invalid
Shared
(read/o
nly)
Exclusive
(read/wr
ite)
CPU Read
CPU Write
CPU Read hit
Place read miss
on bus
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus
CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
66
Snoopy-Cache State
Machine-II
● State machine
considering only
bus requests
for each cache
block.
Invalid
Shared
(read/o
nly)
Exclusive
(read/wr
ite)
Write Back
Block; (abort
memory access)
Write miss
for this block
Read miss
for this block
Write miss
for this block
Write Back
Block; (abort
memory
access)
67
Place read miss
on bus
Combined Snoopy-Cache
State Machine
● State machine
considering both
CPU requests
and bus requests
for each
cache block.
Invalid
Shared
(read/o
nly)
Exclusive
(read/wr
ite)
CPU Read
CPU Write
CPU Read hit
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
Write miss
for this block
Write Back
Block; Abort
memory access.
Write miss
for this block
Read miss
for this block
Write Back
Block; (abort
memory access)
68
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block,
initial cache state is invalid
69
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
70
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
71
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
72
Example
P1 P2 Bus Memor
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Valu
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
73
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block,
but A1 != A2
74
Cache Misses in SMPs
● Overall cache performance is a
combination of:
– Uniprocessor cache misses
– Misses due to invalidations caused by
coherency protocols (coherency misses).
● Changes to some parameters can affect
the two types of misses in different ways:
– Processor count
– Cache size
– Block size
75
Coherence Misses
● The 4th C: Misses occurring due
to coherency protocols.
● Example:
– First write by a processor to a
shared cache block.
– Causes invalidation to establish
ownership of the block.
76
Coherence Misses
● Coherence misses:
– True sharing
– False sharing
● False sharing misses occur
because an entire cache block has
a single valid bit.
– False sharing misses can be avoided
if the unit of sharing is a word.
77
Coherence Miss: Examples
Time P1 P2
1 Write X1 True Sharing
2 Read X2 False Sharing
3 Write X1 False Sharing
4 Write X2 False Sharing
5 Read X2 True Sharing
X1 and X2 belong to the same block
78
Increase in Number of
Processors
● Coherence misses (both True
and False) increase.
● Capacity misses decrease.
● Overall increase in miss rate:
– Resulting in increase in AMAT.
79
Increase in Block Size
● True sharing misses decrease.
– Increase in block size from 32B to
256B reduces true sharing misses
by half.
– Cause: Spatial locality in access.
● Compulsory misses decrease.
● False sharing misses increase.
● Conflict misses increase.
80
Some Issues in Implementing
Snooping Caches
● Additional circuitry needed in a cache
controller.
● Controller continuously snoops on address
bus:
– If address matches tag, either invalidate or
update.
● Since every bus transaction checks cache
tags, could interfere with CPU activities:
– Solution 1: Duplicate set of tags for L1
caches to allow checks in parallel with CPU.
– Solution 2: Duplicate tags on L2 cache.
81
A Commercial
Implementation
● Intel Pentium Xeon (PIII and PIV)
are cache coherent multiprocessors:
– Implements snooping protocol.
– Larger on chip caches to reduce bus
contentions.
– The chipset contains an external
memory controller that connects the
shared processor memory bus with the
memory chips.
82
Modern Computer
Architectures
Lecture 31: Cache
Coherence Protocols
(Cont…)
83
NUMA Computers:
Directory-Based Solution
Interconnection Network
Proc
+Cache
Memory
Dir
I/O
Proc
+Cache
Memory
Dir
I/O
Proc
+Cache
Memory
Dir
I/O
Proc
+Cache
Memory
Dir I/O
Proc
+Cache
Memory
Dir I/O
Proc
+Cache
Memory
Dir I/O
84
Shared Virtual Memory in
DSMs
● In SVM processes appear as if they
are sharing their entire virtual
address space:
– Great convenience to the programmers.
– In effect, the operating system takes
care of moving around the pages
transparently.
– Pages are the unit of sharing.
– Pages are the units of coherence.
85
Shared Virtual Memory in
DSMs
● OS can easily allow pages to be
replicated in read-only fashion:
– Virtual memory can protect pages from
being written.
● When a process writes to a page:
– Traps to OS
– Pages in read-only state at other nodes are
invalidated.
● False sharing can be high:
– Leads to lower performance.
86
Directory-based Solution
● In NUMA computers:
– Messages have long latency.
– Also, broadcast is inefficient --- all
messages have explicit responses.
● Main memory controller to keep track of:
– Which processors are having cached copies
of which memory locations.
● On a write,
– Only need to inform users, not everyone
● On a dirty read,
– Forward to owner
87
Directory Protocol
● Three states as in Snoopy Protocol
– Shared: 1 or more processors have data,
memory is up-to-date.
– Uncached: No processor has the block.
– Exclusive: 1 processor (owner) has the block.
● In addition to cache state,
– Must track which processors have data when
in the shared state.
– Usually implemented using bit vector, 1 if
processor has copy.
88
Directory Behavior
● On a read:
– Unused:
● give (exclusive) copy to requester
● record owner
– Exclusive or shared:
● send share message to current exclusive
owner
● record owner
● return value
– Exclusive dirty:
● forward read request to exclusive owner.
89
Directory Behavior
● On Write
– Send invalidate messages to all
hosts caching values.
● On Write-Thru/Write-back
– Update value.
90
CPU-Cache State Machine
● State machine
for CPU requests
for each
memory block
● Invalid state
if in
memory
Fetch/Invalidate
or Miss due to
address conflict:
send Data Write Back message
to home directory
Invalidate
or Miss due to
address conflict:
Invalid
Shared
(read/o
nly)
Exclusive
(read/wri
te)
CPU Read
CPU Read hit
Send Read Miss
message
CPU Write:
Send Write Miss
msg to h.d.
CPU Write:
Send
Write Miss message
to home directory
CPU read hit
CPU write hit
Fetch: send
Data Write Back message
to home directory
91
State Transition Diagram
for the Directory
● Tracks all copies of memory block.
● Same states as the transition diagram
for an individual cache.
● Memory controller actions:
– Update of directory state
– Send msgs to statisfy requests.
– Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message.
92
Directory State Machine
● State machine
for Directory
requests for each
memory block
● Uncached state
if in memory
Data Write Back:
Sharers = {}
(Write back block)
Uncached
Shared
(read
only)
Exclusive
(read/wri
te)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss:
send Invalidate
to Sharers;
then Sharers = {P};
send Data Value
Reply msg
Write Miss:
Sharers = {P};
send Data
Value Reply
msg
Read miss:
Sharers += {P};
send Fetch;
send Data Value Reply
msg to remote cache
(Write back block)
Read miss:
Sharers += {P};
send Data Value Reply
Write Miss:
Sharers = {P};
send Fetch/Invalidate;
send Data Value Reply
msg to remote cache
93
Example
P1 P2 Bus Directory Memory
Step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
94
Example
P1 P2 Bus Directory Memory
Step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
95
Example
P1 P2 Bus Directory Memory
step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
96
Example
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memory
Step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar.A1 RdMs P2 A1
Shar.A1 10 Ftch P1 A1 10 10
Shar.A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
10
10
P2: Write 40 to A2 10
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
Write Back
97
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memory
step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar.A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 10
Shar.A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 10
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
A1
98
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memo
step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMsP1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar.A1 RdMsP2 A1
Shar.A1 10 Ftch P1 A1 10 10
Shar.A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
Excl.A1 20 WrMsP2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 WrMsP2 A2 A2 Excl. {P2} 0
WrBk P2 A1 20 A1Unca. {} 20
Excl.A2 40 DaRp P2 A2 0 A2 Excl.
{P2} 0
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
A1
99
Implementation Issues in
Directory-Based Protocols
● When the number of processors
are large:
– Directory can become a bottleneck.
– Directories can be distributed
among different memory modules.
– Different directory accesses go to
different locations.
100
Modern Computer
Architectures
Lecture 32:
MultiProcessor
Programming Support
101
Multiprocessor
Programming Support
● A key programming support provided by a
processor:
– Synchronization.
● Why Synchronize?
– To let different processes use shared data
when it is safe.
● In uniprocessors, synchronization
support is provided through:
– Atomic “fetch and update” instructions.
102
Objectives of
Synchronization Algorithms
● Reduce latency:
– How quickly can an application get the
lock in the absence of competition?
● Reduce waiting time
● Reduce contention:
– How to design a scheme to reduce the
contention?
103
Synchronization
● Other atomic operations to
read-modify-write a memory
location:
● test-and-set
● fetch-and-store(R, <mem>)
● fetch-and-add(<mem>, <value>)
● compare-and-swap(<mem>,
<cmp-val>, <stor-val>)
104
Popular Atomic
Synchronization Primitives
● Atomic exchange: interchange a value in a
register for a value in memory:
0 => synchronization variable is free
1 => synchronization variable is locked and
unavailable
– Set register to 1 & swap
● Can be used to implement other
synchronization primitives.
105
Synchronization
Primitives cont...
● Test-and-set: Tests a value and
sets it:
– Only if the value passes the test.
● Fetch-and-increment: It returns
the value of a memory location
and atomically increments it.
– 0 => synchronization variable is free
106
Synchronization Issues
● In multiprocessors:
–Traditional atomic “fetch and update”
operations are inefficient.
–One of the culprits is a coherent cache.
● As a result, for SMPs:
–Synchronization can become a
bottleneck.
–Techniques to reduce contention and
latency of synchronization required.
107
Synchronization
Algorithms
● Locks
– Exclusive locks
– Shared locks
● Barriers
barrier
108
Synchronization
Primitive for SMPs
● Atomic exchange in SMP:
– Very inefficient to have both read and
write in the same instruction.
– Use separate instructions instead.
● Load linked (LL) + store conditional (SC)
– Load linked returns the initial value.
– Store conditional returns:
● 1 if it succeeds (no other store to same memory
location since writing)
● 0 otherwise
109
Atomic Exchange using LL
and SC
● New value in the store register
indicates success in getting lock:
– 0 if the processor succeeded in setting
the lock (first processor to set lock).
– 1 if other processor had already
claimed access.
– The central idea is to make exchange
operation indivisible.
110
LL and SC
● LL and SC should execute “effectively
atomically”:
– As if the load and store together are
completed atomically.
– No other store to the same location, no
context switch, interrupts, etc.
● Implemented through a link register:
– Stores address of instruction doing LL
– Invalidated if any other instruction does SC
– Invalidated under process switch.
111
Permitted Instructions
Between LL and SC
● Care must be taken about which
instructions can be placed between
an LL SC pair:
– Few register instructions can only
be used.
– It is otherwise possible to have a
starvation situation where an SC
instruction is never successful.
112
Other Synchronization
Primitives Using LL and SC
● Atomic swap:
try: or R3,R4,R0 ; mov exchange value
ll R2,0(R1) ; load linked
sc R3,0(R1) ; store conditional
beqz R3,try ; branch store fails (R3 = 0)
mov R4,R2 ; put load value in R4
● Fetch & increment:
try: ll R2,0(R1); load linked
daddui R2,R2,#1; increment
sc R2,0(R1); store conditional
beqz R2,try; branch store fails (R2 = 0)
113
Spin Locks
● Some times a process or thread needs a
certain data item for a very short time:
– E.g. updating a counter value.
● If a traditional lock is used in this case:
– The contending processes would suffer
context switch.
– This would be more expensive (1000s of
cycles), compared if the contending
processes had simply busy-waited (10s of
cycles).
114
Spin Lock Illustration
● A spin lock leads to busy wait for P2:
– Prevents context switch.
Spin Lock
Busy Wait
P1 P2
Critical
Section
115
Spin Lock Implementation
daddui R2,R0, #1
lockit: exch R2,0(R1); atomic exchange
bnez R2,lockit; already locked?
● Not very efficient:
– On each write attempt to 0(R1), a memory
transaction is generated.
● Can be made more efficient by using
cache coherency:
– Spin on a cached copy of R2 until the value
changes from 1 to 0.
– Bus transactions can be avoided.
116
Problem With the Spin
Lock Algorithm
● Frequent polling gets you the
lock faster:
– But slows everyone else down.
● An efficient scheme:
– Poll on a cached copy.
117
An Efficient Spin Lock
Implementation
● Problem: Every exchange includes a write,
– Invalidates all other copies;
– Generates considerable bus traffic.
● Solution: start by simply repeatedly
reading the variable; when it changes, then
try storing:
● lockit: ll R2,0(R1);load var
bnez R2,lockit;not free=>spin
daddui R2,R0,R1; locked value
sc R2,0(R1); store
beqz R2,lockit;already locked?
118
Example
● 10 processors simultaneously
try to set a spin lock at time 0:
– Determine the number of bus
transactions required for all
processors to acquire the lock
once.
119
Solution
● Assume that at a certain time, i
processes are contending:
– i of ll operations
– i of sc operations
– 1 store to release the lock.
● Total of 2*i+1 bus transactions
● (2*i+1) = n*n+2*n
120
Barriers
● A set of n processes all leave the
synchronization region at once:
– “at the same time” is hard in a parallel system.
– Sufficient if no process leaves until all
process arrive.
● Can be achieved by a a busy wait on shared
memory:
– Creates large number of bus transactions
● To avoid this:
– Use a cache update protocol.
– Processors spin on the cached value.
121
Barrier Implementation
● Can be implemented using 2 spin
locks:
– One to protect the counter.
– One to hold the processes until
the last process arrives at the
barrier.
● Assume that lock and unlock
provide the basic spin locks.
122
Barrier Implementation
● lock(counterlock);
● if(count==0) release=0;
● count++;
● unlock(counterlock);
● if(count==total){
– count=0; release=1;}
● else spin(release==1);
123
Barrier Implementation:
Loop Hole
● It is possible that one process
races ahead:
– The fast process resets the
release flag and traps the
remaining processes.
● Solution:
– Sense-reversing barrier (read
up).
124
Example
● Assume 10 processes try to
synchronize by executing a
barrier simultaneously.
– Determine the number of bus
transactions required for the
processes to reach and leave the
barrier.
125
Solution
● For the ith process:
– The number of bus transactions
is 3*i+4
● For n processes:
–(3*i+4) = (3*n*n+11*n)/2 -1
126
Efficient Implementation
of Barrier
● We need a primitive to
efficiently increment the
barrier count.
– Queuing locks can be used for
improving the performance of a
barrier.
127
Queuing Locks
● Each arriving processor is kept
track of in a queue structure.
– Signal next waiter when a process
is done.
0 1 ………. p-1
flags
array
current
lock holder
queuelast
{has-lock,
must-wait}
hl mw
128
Modern Computer
Architectures
Lecture 33:
Multithreading in
Uniprocessors
129
Introduction
● If you were plowing a field, which of
the following would you rather use:
A strong ox or 1024 chickens?
--- Seymour Cray
● The answer would be different if
you are considering a computing
problem.
130
Multithreading Within a
Single Processor
● Until now, we considered multiple
threads of an application running on
different processors:
– Can multiple threads execute
concurrently on the same processor?
Yes
● Why is this desirable?
– Inexpensive – one CPU.
– Faster communication among threads.
131
Why Does Multithreading
within a Processor Make
Sense?
● Superscalar processors are now common
place.
● Most of a processor’s functional units
can’t find enough work on an average:
– Peak IPC is 6, average IPC is 1.5!
● Threads share resources:
– We can execute a number of threads
without a corresponding linear increase in
chip area.
132
Analysis of Idle Cycles in a
Superscalar Processor
● Issues multiple instructions every cycle.
– Typically 4.
● Several functional units of each type:
– Adders, Multipliers, Floating Point units, etc.
– Many functional units are idle in many cycles.
– Especially true when there is a cache miss.
● Dispatcher reads instructions, decides which
can run in parallel:
– Number of instructions limited by instruction
dependencies and long-latency operations
133
Analysis of Processor Inefficiency
• Vertical waste is
introduced when the
processor issues no
instructions in a cycle.
• Horizontal waste is
introduced when not
all issue slots can be
filled in a cycle.
• 61% of the wasted
cycles are vertical
waste on avg.
X X X
X X
X
X X X X
X
Issue Slots
Cycles
X full issue
slot
empty issue
slot
vertical waste
= 12 slots
horizontal waste
= 9 slots
134
Multithreading: A Pictorial
Explanation
• Rather than enlarging the
depth of the instruction
window (more speculation
with lowering confidence !):
– Enlarge its “width”.
• Fetch from multiple
threads.
Branch
Branch
Branch
Branch
Branch
future
Issue
Branch
Branch
Branch
Branch
Branch
Issue
135
Multithreading
● Essentially a latency hiding
technique:
– Hides stalls due to cache misses.
– Hides stalls due to data dependency.
● Under cache miss or data
dependency stalls:
– Multithreading provides work to
functional units, keeps them busy.
136
Basic Support for
Multithreading
● Multiple states (contexts) required to
be maintained at the same time.
● One set per each thread:
– Program Counter
– Register File (and Flags)
– Per thread renaming table.
● Since register renaming provides unique
register identifiers:
– Instructions from multiple threads can be
mixed in the data path.
137
Multithreading Support in
Uniprocessors
● In the most basic form:
– Processor interleaves execution of
instructions from different
threads.
● Three types of thread scheduling:
– Coarse-grained multithreading
– Fine-grained multithreading
– Simultaneous multithreading
138
Coarse-Grained
Multithreading
● A selected thread continues to run:
– Thread switch occurs only when an
active thread undergoes long stall (L2
cache miss etc.)
– This form of multithreading only hides
long latency events.
● Easy to implement:
– But, requirement of pipeline flushing on
thread switch makes it inefficient.
139
Coarse-Grained
Multithreading
140
Coarse-grained
Multithreaded Processors
● Example: Sun SPARC II Processor
– Provides hardware context for 4 threads
– One thread reserved for interrupt handling
– Register windows provide fast switching
between 4 sets of 32 GPRs.
● Used in cache-coherent DSMs:
– On a cache miss to a remote memory (takes
100s of cycles) switch to a different thread.
– Network messages etc are handled by the
interrupt handler thread.
141
Fine-Grained
Multithreading
● Few active threads:
– Context switch among the active
threads on every clock cycle.
– Occupancy of the execution core would
be much higher.
● Issue instructions only from a single
thread in a cycle:
– Again may not find enough work every
cycle, but cache misses can be
tolerated.
142
Fine-Grained
Multithreading
● Hides both long and short latency
events.
● Vertical waste are eliminated but
horizontal wastes are not.
– If a thread has little or no
operations to execute, its issue
slot will be underutilized.
143
Fine-Grained
Multithreading
144
Simultaneous Multithreading
(SMT): An Overview
• Converts thread-level parallelism:
–Into instruction-level parallelism.
• Issues instructions from multiple
threads in the same cycle.
–Has the highest probability of finding
work for every issue slot.
• Called Hyper-threading by Intel.
145
Simultaneous Multithreading:
A Conceptual Understanding
● 4-way superscalar: Peak throughput 4 ipc
Superscalar Fine-Grained
Multithreading
Simultaneous
Multithreading
Thread 1
Thread 2
Thread 3
Thread 4
Idle
146
Differences Among
Multithreaded Architectures
Multithreading
Type
Shared Resources Context Switch
Mechanism
Fine-grained All but register file and
control logic/state
Every cycle
Coarse-
grained
All but instruction fetch
buffers, register files and
control logic/state
On long stalls
SMT All but instruction fetch
buffer, return address
stack, register files, control
logic/state, reorder buffer,
store queue.
No switching,
all active
147
SMT-Advantages
● Two main performance limitations
of multithreading:
– Memory stalls
– Pipeline flushes due to incorrect
speculation.
● In SMTs:
– Multiple threads are simultaneously
executed, can hide both these
problems.
148
Anatomy of an SMT
Processor
• Multiple “logical” CPUs.
• One physical CPU:
– ~5% extra silicon to duplicate thread state
information.
• Better than single threading:
– Increased thread-level parallelism.
– Improved processor utilization when one
thread blocks.
• Not as good as two physical CPUs:
– CPU resources are shared, not replicated.
149
Simultaneous Multithreading
(SMT)
Functional
Units
Instruction
Scheduler
Thread1 Thread2
Example: Pentium 4
150
Some Issues in SMT
● To achive multithreading:
– Extend, replicate, and redesign some units of
a superscalar processor.
● Resources replicated:
– States of hardware contexts (registers, PCs)
– Per thread mechanisms for Pipeline flushing
and subroutine returns.
– Per thread branch target buffer and
translation lookaside buffer.
151
SMT Issues
● Resources to be redesigned:
– Instruction fetch unit.
– Processor pipeline.
● Instruction Scheduling:
– Does not require additional
hardware.
– Register renaming same as in
superscalar processors.
152
Commit Unit
(Multiple Instructions per Cycle)
Superscalar Architecture
Reservation
Station
Reservation
Station
Reservation
Station
Reservation
Station
Register File
FP Unit
ALU 1 ALU 2
Branch
Unit
Load/Store
Unit
Instruction
Fetch & Decode Unit
(Multiple Instructions per Cycle)
CU
Multiple Buses
Multiple Buses
PC
153
Commit Unit
(Multiple Instructions per Cycle)
Reservation
Station
Reservation
Station
Reservation
Station
Reservation
Station
Register
File(s)
FP Unit
ALU 1 ALU 2
Branch
Unit
Load/Store
Unit
Instruction
Fetch & Decode Unit
(Multiple Instructions per Cycle)
CU
Multiple Buses
Multiple Buses
PC
Simultaneous Multithreading:
Block Diagram
IO
CU
CU
PC
PC
PC
154
Simultaneous
Multithreading: A Model
● Instruction Fetch Unit:
– Fetch 1 instructions for 2 threads.
– Decode 1 thread till branch/end of
cache line, then jump to the other.
– Highest priority to threads with
fewest instructions in the decode,
renaming, and queue pipeline stages.
– Small hardware addition to track
queue lengths.
155
Simultaneous
Multithreading: Model
● Register File:
– Each thread has 32 registers.
– Register File: 32 * #threads +
rename registers.
● Con: Large register file 
longer access time.
156
Simultaneous Multithreading:
Model Pipeline Format
• Superscalar
• SMT
157
Simultaneous Multithreading:
Model Pipeline Format
● To avoid increase in clock
cycle time:
–SMT pipeline extended to allow
2 cycle register reads and
writes.
● 2 cycle reads/writes increase
branch misprediction penalty.
158
Simultaneous Multithreading:
What to Issue?
● Not exactly the same as superscalars:
– In a superscalar: oldest is the best:
least speculation.
– In SMT not so clear:
● Branch-speculation optimism may vary
across threads.
● Based on this the selection strategies:
– Oldest first.
– Branch speculated last etc…
159
Simultaneous Multithreading:
Compiler Optimizations
● Should try to minimize cache
interference.
● Latency hiding techniques like
speculation should be enhanced.
● Sharing optimization techniques
from multiprocessors changed:
– Data sharing is now good.
160
Caching in SMT
● Same cache shared
among threads:
–Performance degradation
due to cache sharing.
–Possibility of cache
thrashing.
161
Performance Implications
of SMT
● Single thread performance is likely to go
down:
– Caches, branch predictors, registers, etc.
are shared.
● This effect can be mitigated by trying
to prioritize one thread.
● With eight threads in a processor with
many resources:
– SMT can yield throughput improvements of
roughly 2-4.
162
Commercial Examples
● Compaq Alpha 21464 (EV8)
– 4T SMT, June 2001
● Intel Pentium IV (Xeon)
– 2T SMT, 2002
– 10-30% gains reported
● SUN Ultra IV
– 2-core, 2T SMT
● IBM POWER5
– Dual processor core
– 8-way superscalar, SMT
– 24% area growth per core for SMT
163
Pentium4: Hyper-
Threading
● Two threads:
– The operating system operates as if it
is executing on a two-processor system.
● When only one available thread:
– Pentium 4 behaves like a regular single-
threaded superscalar processor.
● Intel claims 30% performance
improvements.
164
Intel MultiCore Architecture
● Improving execution rate of a single-
thread is still considered important:
– Out-of-order execution and speculation.
● MultiCore architecture:
– Can reduce power consumption.
– (14 pipeline stages) is closer to the Pentium
M (12 stages) than the P4 (30 stages).
● Many transistors invested in large
branch predictors:
– To reduce wasted work (power).
165
Processor Power
Consumption
“Surpassed hot-plate
power density in
0.5m; Not too long
to reach nuclear
reactor,”
- Former Intel
Fellow Fred Pollack.
166
Intel’s Dual Core Architectures
● The Pentium D is simply two Pentium 4 cpus:
– Inefficiently paired together to run as dual core.
● The Core Duo is Intel's first generation dual core
processor based upon the Pentium M (a Pentium III-4
hybrid):
– Made mostly for laptops and is much more efficient than
Pentium D.
● The Core 2 Duo is Intel's second generation (hence,
Core 2) processor:
– Made for desktops and laptops designed to be fast while
not consuming nearly as much power as previous CPUs.
● Intel has now dropped the Pentium name in favor of
the Core architecture.
167
Intel Core Processor
168
Intel Core 2 Duo
• Code named “conroe”
• Homogeneous cores
• Bus based on chip
interconnect.
• Shared on-die Cache
Memory.
Classic OOO: Reservation
Stations, Issue ports,
Schedulers…etc Large, shared set
associative, prefetch, etc.
Source: Intel Corp.
169
Intel’s Core 2 Duo
● Launched in July 2006.
● Replacement for Pentium 4 and Pentium D
CPUs.
● Intel claims:
– Conroe provides 40% more performance at
40% less power compared to the Pentium D.
● All Conroe processors are manufactured
with 4 MB L2 cache:
– Due to manufacturing defects, the E6300 and
E6400 versions based on this core have half
their cache disabled, leaving them with only
2 MB of usable L2 cache.
170
Intel Core Processor
Specification
● Speeds:1.06 GHz to 3 GHz
● FSB speeds:533 MT/s to 1333 MT/s
● Process: 0.065 µm (MOSFET channel
length)
● Instruction set:x86, MMX, SSE, SSE2,
SSE3, SSSE3, x86-64 (Streaming SIMD
Extension
● Microarchitecture: Intel Core
microarchitecture
171
Core 2 Duo
Microarchitecture
172
Why Sharing On-Die L2?
• What happens when L2 is too large?
173
Commercial Examples: IBM
POWER5
174
Commercial Examples: IBM
POWER5
● SMT added to Superscalar Micro-
architecture.
● Additional Program Counter (PC).
● GPR/FPR rename mapper expanded
to map second set of registers .
● Completion logic replicated to track
two threads.
175
Commercial Examples: IBM
POWER5
● Includes:
1. Thread Priority Mechanism: 8
levels.
2. Dynamic Thread Switching:
● Used if no instruction ready to run.
● Allocates all machine resources to
one thread at any time.
176
Sun’s Niagara
● Commercial servers require high
thread-level throughput:
– Suffer from cache misses.
● Sun’s Niagara focuses on:
– Simple cores (low power, design
complexity, can accommodate more
cores)
– Fine-grain multi-threading (to
tolerate long memory latencies)
177
Sun’s Niagara
178
Xeon and Opteron
I/O Hub
PCI-E
Bridge
I/O Hub
PCI-E
Bridge
PCI-E
Bridge
PCI-E
Bridge
Memory
Controller
Hub
Dual-Core Dual-Core Dual-Core
Dual-
Core
PCI-E
Bridge
Legacy x86 Architecture
• 20-year old front-side bus architecture
• CPUs, Memory, I/O all share a bus
• A bottleneck to performance
• Faster CPUs or more cores ≠ performance
AMD64
• Direct Connect Architecture eliminates FSB
bottleneck.
179
179
I/O Hub
PCI-E
Bridge
PCI-E
Bridge
PCI-E
Bridge
PCI-E
Bridge
PCI-E
Bridge
I/O Hub
XMB
XMB XMB XMB
Memory
Controller
Hub
Dual-Core Dual-Core Dual-Core
Dual-Core
Dual-
Core
Dual-
Core
Dual-
Core
Dual-
Core
Xeon Vs. Opteron
180
Reducing Power and Cooling Requirements
with Processor Performance States
P-State
HIGH
LOW
P0
2600MHz
1.35V
~95watts
P1
2400MHz
1.30V
~80watts
P2
2200MHz
1.25V
~66watts
P3
2000MHz
1.20V
~55watts
P4
1800MHz
1.15V
~51watts
P5
1000MHz
1.10V
~34watts
PROCESSOR
UTILIZATION
AMD PowerNow!™
Technology with Optimized
Power Management
Multiple performance states for
optimized power management
Dynamically reduces processor
power based on workload
Lowers power consumption
without compromising
performance
Up to 75% processor power
savings.
Example:
AMD Opteron™ processor 2218 series
181
Summary
Cont…
• ILP now appears fully exploited:
–For the last decade or so, the focus has
been on thread-and process-level
parallelism.
• Multiprocessors progressed from
add-on cards, to chips on the mother
board:
–Now available as multicore.
182
Summary
Cont…
● Major issues in multiprocessors:
– Cache coherency and
synchronization.
● Cache coherency:
– The copies of data blocks in individual
cahces may become inconsistent.
183
Summary
Cont…
● Cache coherency:Two popular
protocols:
– Snooping: suitable in SMPs
– Directory-based: Suitable in NUMA
processors.
● Mutithreading in uniprocessors is
another promising approach:
– Simultaneous multithreading (SMT)
184
Future Trends
● Simultaneous and Redundantly
Threaded Processors(SRT):
– Increase reliability with fault
detection and correction.
– Run multiple copies of the same
programme simultaneously.
185
Future Trends
● Software Pre-Execution in SMT:
– In some cases data adress is
extremely hard to predict.
– Use an idle thread of SMT for pre-
execution.
● Speculation:
– More advanced techniques for
speculation.
186
References
[1]J.L. Hennessy & D.A. Patterson, “Computer Architecture:
A Quantitative Approach”. Morgan Kaufmann Publishers,
3rd Edition, 2003
[2]John Paul Shen and Mikko Lipasti, “Modern Processor
Design,” Tata Mc-Graw-Hill, 2005
[3] S. McFarling, "Program Optimization for Instruction
Caches, " Proceedings of the Third International
Conference on Architectural Support for Programming
Languages and Operating Systems, pp. 183--191, April
1989.
187
References
● "Simultaneous Multithreading: Maximizing On-Chip
Parallelism" by Tullsen, Eggers and Levy in ISCA95.
● “Simultaneous Multithreading: Present Developments
and Future Directions” by Miquel Peric, June 2003
● “Simultaneous Multi-threading Implementation in
POWER5 -- IBM's Next Generation POWER
Microprocessor” by IBM, Aug 2004
● “Simultaneous Multithreading: A Platform for Next-
Generation Processors” by Eggers, Emer, Levy, Lo,
Stamm and Tullsen in IEEE Micro, October, 1997.

More Related Content

Similar to Computer system Architecture. This PPT is based on computer system

Cloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basicsCloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basicsmoeincanada007
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Sudarshan Mondal
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processingSyed Zaid Irshad
 
EMBEDDED OS
EMBEDDED OSEMBEDDED OS
EMBEDDED OSAJAL A J
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principlesDhaval Bagal
 
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptKadri20
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Fazli Amin
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.pptshreesha16
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Brief Introduction.ppt
Brief Introduction.pptBrief Introduction.ppt
Brief Introduction.pptMollyZolly
 

Similar to Computer system Architecture. This PPT is based on computer system (20)

Cloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basicsCloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basics
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processing
 
EMBEDDED OS
EMBEDDED OSEMBEDDED OS
EMBEDDED OS
 
Pthread
PthreadPthread
Pthread
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
CA UNIT IV.pptx
CA UNIT IV.pptxCA UNIT IV.pptx
CA UNIT IV.pptx
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principles
 
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
 
CSC204PPTNOTES
CSC204PPTNOTESCSC204PPTNOTES
CSC204PPTNOTES
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Lecture 1 (distributed systems)
Lecture 1 (distributed systems)
 
unit 4.pptx
unit 4.pptxunit 4.pptx
unit 4.pptx
 
unit 4.pptx
unit 4.pptxunit 4.pptx
unit 4.pptx
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.ppt
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Brief Introduction.ppt
Brief Introduction.pptBrief Introduction.ppt
Brief Introduction.ppt
 

Recently uploaded

VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130Suhani Kapoor
 
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...babafaisel
 
Cosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable BricksCosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable Bricksabhishekparmar618
 
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Narsimha murthy
 
SD_The MATATAG Curriculum Training Design.pptx
SD_The MATATAG Curriculum Training Design.pptxSD_The MATATAG Curriculum Training Design.pptx
SD_The MATATAG Curriculum Training Design.pptxjanettecruzeiro1
 
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...Yantram Animation Studio Corporation
 
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130  Available With RoomVIP Kolkata Call Girl Gariahat 👉 8250192130  Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Roomdivyansh0kumar0
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryWilliamVickery6
 
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...home
 
Kindergarten Assessment Questions Via LessonUp
Kindergarten Assessment Questions Via LessonUpKindergarten Assessment Questions Via LessonUp
Kindergarten Assessment Questions Via LessonUpmainac1
 
Cheap Rate Call girls Malviya Nagar 9205541914 shot 1500 night
Cheap Rate Call girls Malviya Nagar 9205541914 shot 1500 nightCheap Rate Call girls Malviya Nagar 9205541914 shot 1500 night
Cheap Rate Call girls Malviya Nagar 9205541914 shot 1500 nightDelhi Call girls
 
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...Suhani Kapoor
 
Fashion trends before and after covid.pptx
Fashion trends before and after covid.pptxFashion trends before and after covid.pptx
Fashion trends before and after covid.pptxVanshNarang19
 
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...Amil baba
 
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`dajasot375
 
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceCALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceanilsa9823
 
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...Amil baba
 
The history of music videos a level presentation
The history of music videos a level presentationThe history of music videos a level presentation
The history of music videos a level presentationamedia6
 
3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdfSwaraliBorhade
 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
 
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
Kala jadu for love marriage | Real amil baba | Famous amil baba | kala jadu n...
 
Cosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable BricksCosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable Bricks
 
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...Revit Understanding Reference Planes and Reference lines in Revit for Family ...
Revit Understanding Reference Planes and Reference lines in Revit for Family ...
 
SD_The MATATAG Curriculum Training Design.pptx
SD_The MATATAG Curriculum Training Design.pptxSD_The MATATAG Curriculum Training Design.pptx
SD_The MATATAG Curriculum Training Design.pptx
 
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
Captivating Charm: Exploring Marseille's Hillside Villas with Our 3D Architec...
 
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130  Available With RoomVIP Kolkata Call Girl Gariahat 👉 8250192130  Available With Room
VIP Kolkata Call Girl Gariahat 👉 8250192130 Available With Room
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William Vickery
 
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
Recommendable # 971589162217 # philippine Young Call Girls in Dubai By Marina...
 
Kindergarten Assessment Questions Via LessonUp
Kindergarten Assessment Questions Via LessonUpKindergarten Assessment Questions Via LessonUp
Kindergarten Assessment Questions Via LessonUp
 
Cheap Rate Call girls Malviya Nagar 9205541914 shot 1500 night
Cheap Rate Call girls Malviya Nagar 9205541914 shot 1500 nightCheap Rate Call girls Malviya Nagar 9205541914 shot 1500 night
Cheap Rate Call girls Malviya Nagar 9205541914 shot 1500 night
 
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
VIP Russian Call Girls in Saharanpur Deepika 8250192130 Independent Escort Se...
 
Fashion trends before and after covid.pptx
Fashion trends before and after covid.pptxFashion trends before and after covid.pptx
Fashion trends before and after covid.pptx
 
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
 
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
Abu Dhabi Call Girls O58993O4O2 Call Girls in Abu Dhabi`
 
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun serviceCALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
CALL ON ➥8923113531 🔝Call Girls Aminabad Lucknow best Night Fun service
 
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
NO1 Trending kala jadu Love Marriage Black Magic Punjab Powerful Black Magic ...
 
The history of music videos a level presentation
The history of music videos a level presentationThe history of music videos a level presentation
The history of music videos a level presentation
 
3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf
 
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
Best VIP Call Girls Noida Sector 47 Call Me: 8448380779
 

Computer system Architecture. This PPT is based on computer system

  • 1. 1 Modern Computer Architectures Module-4: Thread and Process- Level Parallelism Ms. Nibedita Adhikari Dept. of CSE, PIET, Rourkela
  • 2. 2
  • 3. 3 Introduction ● Initial computer performance improvements came from use of: – Innovative manufacturing techniques. ● In later years, – Most improvements came from exploitation of ILP. – Both software and hardware techniques are being used. – Pipelining, dynamic instruction scheduling, out of order execution, VLIW, vector processing, etc. ● ILP now appears fully exploited: – Further performance improvements from ILP appears limited.
  • 4. 4 Thread and Process- Level Parallelism ● The way to achieve higher performance: – Of late, exploitation of thread and process- level parallelism is being focused. ● Exploit parallelism existing across multiple processes or threads: – Cannot be exploited by any ILP processor. ● Consider a banking application: – Individual transactions can be executed in parallel.
  • 5. 5 Processes versus Threads (Oh my!) ● Processes: – A process is a program in execution. – An application normally consists of multiple processes. ● Threads: – A process consists of one of more threads. – Threads belonging to the same process share data, and code space.
  • 7. 7 How can Threads be Created? ● By using any of the popular thread libraries: – POSIX Pthreads – Win32 threads – Java threads, etc.
  • 8. 8 User Threads ● Thread management done in user space. ● User threads are supported and managed without kernel support. – Invisible to the kernel. – If one thread blocks, entire process blocks. – Limited benefits of threading.
  • 9. 9 Kernel Threads ● Kernel threads supported and managed directly by the OS. – Kernel creates Light Weight Processes (LWPs). ● Most modern OS support kernel threads: – Windows XP/2000 – Solaris – Linux – Mac OS, etc.
  • 10. 10 Benefits of Threading ● Responsiveness: – Threads share code, and data. – Thread creation and switching therefore much more efficient than that for processes; ● As an example in Solaris: – Creating threads 30x less costly than processes. – Context switching about 5x faster than processes.
  • 11. 11 Benefits of Threading cont… ● Truly concurrent execution: –Possible with processors supporting concurrent execution of threads: SMP, multi-core, SMT (hyperthreading), etc.
  • 12. 12 High Performance Computer Architectures Lecture 28: Thread- Level Parallelism
  • 13. 13 A Case for Processor Support for Thread-level Parallelism • Using pure ILP, execution unit utilization is only about 20%-25%: – Utilization limited by control dependency, Cache misses during memory access, etc. – It is rare for units to be even reasonably busy on the average. ● In pure ILP: – At any time only one thread is under execution.
  • 14. 14 A Case for Processor Support for Thread-level Parallelism ● Utilization of execution units can be improved: – Have several threads under execution: ● called active threads in PIII. – Execute several threads at the same time: ● SMP, SMT, and Multi-core processors.
  • 15. 15 Threads in Applications ● Threads are natural to a wide ranging set of applications: – Often more or less independent. – Though share data among themselves to some extent. – Also, synchronize sometimes among themselves.
  • 16. 16 A Few Thread Examples ● Independent threads occur naturally in several applications: – Web server: different http requests are the threads. – File server – Name server – Banking: independent transactions – Desktop applications: file loading, display, computations, etc. can be threads.
  • 17. 17 Reflection on Threading ● To think of it: – Threading is inherent to any server application. ● Threads are also easily identifiable in traditional applications: – Banking, Scientific computations, etc.
  • 18. 18 Thread-level Parallelism --- Cons ● Threads have to be identified by the programmer: – No rules exist as to what can be a meaningful thread. – Threads can not possibly be identified by any automatic static or dynamic analysis of code. – Burden on programmer: requires careful thinking and programming.
  • 19. 19 Thread-level Parallelism --- Cons cont… ● Threads with severe dependencies: – May make multithreading an exercise in futility. ● Also not as “programmer friendly” as ILP.
  • 20. 20 Thread Vs. Process- Level Parallelism ● Threads are light weight (or fine- grained): – Threads share address space, data, files etc. – Even when extent of data sharing and synchronization is low: Exploitation of thread-level parallelism meaningful only when communication latency is low. – Consequently, shared memory architectures (UMA) are a popular way to exploit thread- level parallelism.
  • 21. 21 Thread Vs. Process- Level Parallelism cont… ● Processes are coarse-grained: – Communication to computation requirement is lower. – DSM (Distributed Shared Memory), Clusters, Grids, etc. are meaningful.
  • 22. 22 Focus of Next Few Lectures ● Shared memory multiprocessors – Cache coherency – Synchronization: spin locks ● The recent phenomenon of threading support in uniprocessors. ● Distributed memory multiprocessors – DSM – Clusters (Discussed in Module 6) – Grids (Discussed in Module 6)
  • 23. 23 A Broad Classification of Computers ● Shared-memory multiprocessors – Also called UMA ● Distributed memory computers – Also called NUMA: ● Distributed Shared-memory (DSM) architectures ● Clusters ● Grids, etc.
  • 25. 25 Distributed Memory Computers ● Distributed memory computers use: – Message Passing Model ● Explicit message send and receive instructions have to be written by the programmer. – Send: specifies local buffer + receiving process (id) on remote computer (address). – Receive: specifies sending process on remote computer + local buffer to place data.
  • 26. 26 Advantages of Message- Passing Communication ● Hardware for communication and synchronization are much simpler: – Compared to communication in a shared memory model. ● Explicit communication: – Programs simpler to understand, helps to reduce maintenance and development costs. ● Synchronization is implicit: – Naturally associated with sending/receiving messages. – Easier to debug.
  • 27. 27 Disadvantages of Message- Passing Communication ● Programmer has to write explicit message passing constructs. – Also, precisely identify the processes (or threads) with which communication is to occur. ● Explicit calls to operating system: – Higher overhead.
  • 28. 28 MPI: A Message Passing Standard ● A (de facto standard) developed by a group of industry and academic professionals: – Aim is to foster portability and widespread use. ● Defines routines, and not implementations: – Several free implementations exist. – Synchronous and asynchronous modes.
  • 29. 29 DSM ● Physically separate memories are accessed as one logical address space. ● Processors running on a multi- computer system share their memory. – Implemented by operating system. ● DSM multiprocessors are NUMA: – Access time depends on the exact location of the data.
  • 30. 30 Distributed Shared-Memory Architecture (DSM) ● Underlying mechanism is message passing: – Shared memory convenience provided to the programmer by the operating system. – Basically, an operating system facility takes care of message passing implicitly. ● Advantage of DSM: – Ease of programming
  • 31. 31 Disadvantage of DSM ● High communication cost: – A program not specifically optimized for DSM by the programmer shall perform extremely poorly. – Data (variables) accessed by specific program segments have to be collocated. – Useful only for process-level (coarse- grained) parallelism.
  • 32. 32 SVM: Shared Virtual Memory ● Supporting DSM on top of an inherently message passing system is inefficient. ● A possible solution is SVM. ● Virtual memory mechanism is used to share objects at the page level.
  • 33. 33 Communication Overhead: Example 1 ● An application is running on a 32 node multiprocessor ● It incurs a latency of 400ns to handle a reference (read/write) to memory. ● Processor clock rate is 1 GHz; IPC (instructions per cycle) = 2. ● How much faster will be a computation, if there is no communication versus if 0.2% of the instructions involve reference to a memory location?
  • 34. 34 Communication Overhead: Solution 1 ● CPI = 0.5 ● Effective CPI with 0.2% remote references = Base CPI + memory request rate * memory request cost ● Effective CPI with 0.2% remote references = 0.5 + 0.002 * 400 ns = 0.5 + 0.8 = 1.3 ● A program having no memory references will be (1- 1/1.3)*100= 23% faster.
  • 35. 35 Communication Overhead: Example 2 ● An application running on a 32 node DSM. ● It incurs a latency of 400 ms to handle a reference (read/write) to a remote memory. ● Processor clock rate is 1 GHz; IPC (instructions per cycle) = 2. ● How much faster will be a computation, on a multiprocessor system compared to the DSM if 0.2% of the instructions involve reference to a remote memory? Assume no local memory references.
  • 36. 36 Communication Overhead: Solution 2 ● CPI = 0.5 ● Effective CPI with 0.2% remote references = Base CPI + remote request rate * remote request cost ● Effective CPI with 0.2% remote references = 0.5 + 0.002 * 400 * 1000ns = 0.5 + 800 = 800.5 ● A multiprocessor would be: 800.5/1.3 = 658 times faster. ● Performance figures of NUMA may be worse: – If we take data dependency and synchronization aspects into consideration.
  • 37. 37 Modern Computer Architectures Lecture 29: Symmetric Multiprocessors (SMPs)
  • 38. 38 Symmetric Multiprocessors (SMPs) ● SMPs are a popular shared memory multiprocessor architecture: – Processors share Memory and I/O – Bus based: access time for all memory locations is equal --- “Symmetric MP” P P P P Cache Cache Cache Cache Main memory I/O system Bus
  • 39. 39 SMPs: Some Insights ● In any multiprocessor, main memory access is a bottleneck: – Multilevel caches reduce the memory demand of a processor. – Multilevel caches in fact make it possible for more than one processor to meaningfully share the memory bus. – Hence multilevel caches are a must in a multiprocessor!
  • 40. 40 Different SMP Organizations ● Processor and cache on separate extension boards (1980s): – Plugged on to the backplane. ● Integrated on the main board (1990s): – 4 or 6 processors placed per board. ● Integrated on the same chip (multi-core) (2000s): – Dual core (IBM, Intel, AMD) – Quad core
  • 41. 41 Pros of SMPs ● Ease of programming: –Especially when communication patterns are complex or vary dynamically during execution.
  • 42. 42 Cons of SMPs ● As the number of processors increases, contention for the bus increases. – Scalability of the SMP model restricted. – One way out may be to use switches (crossbar, multistage networks, etc.) instead of a bus. – Switches set up parallel point-to-point connections. – Again switches are not without any disadvantages: make implementation of cache coherence difficult.
  • 43. 43 SMPs ● Even programs not using multithreading (conventional programs): – Experience a performance increase on SMPs – Reason: Kernel routines handling interrupts etc. run on a separate processor. ● Multicore processors are now common place: – Pentium 4 Extreme Edition, Xeon, Athlon64, DEC Alpha, UltraSparc…
  • 44. 44 Why Multicores? ● Can you recollect the constraints on further increase in circuit complexity: – Clock skew and temperature. ● Use of more complex techniques to improve single-thread performance is limited. ● Any additional transistors have to be used in a different core.
  • 45. 45 Why Multicores? Cont… ● Multiple cores on the same physical packaging: – Execute different threads. – Switched off, if no thread to execute (power saving). – Dual core, quad core, etc.
  • 46. 46 Cache Organizations for Multicores ● L1 caches are always private to a core ● L2 caches can be private or shared – which is better? P4 P3 P2 P1 L1 L1 L1 L1 L2 L2 L2 L2 P4 P3 P2 P1 L1 L1 L1 L1 L2
  • 47. 47 L2 Organizations ● Advantages of a shared L2 cache: – Efficient dynamic use of space by each core – Data shared by multiple cores is not replicated. – Every block has a fixed “home” – hence, easy to find the latest copy. ● Advantages of a private L2 cache: – Quick access to private L2 – Private bus to private L2, less contention.
  • 48. 48 An Important Problem with Shared-Memory: Coherence ● When shared data are cached: – These are replicated in multiple caches. – The data in the caches of different processors may become inconsistent. ● How to enforce cache coherency? – How does a processor know changes in the caches of other processors?
  • 49. 49 The Cache Coherency Problem P1 P2 P3 U:5 U:5 U:5 1 4 U:? U:? U:7 2 3 5 What value will P1 and P2 read? 1 3 U: ?
  • 50. 50 Cache Coherence Solutions (Protocols) ● The key to maintain cache coherence: – Track the state of sharing of every data block. ● Based on this idea, following can be an overall solution: – Dynamically recognize any potential inconsistency at run-time and carry out preventive action.
  • 51. 51 Basic Idea Behind Cache Coherency Protocols P P P P Cache Cache Cache Cache Main memory I/O system Bus
  • 52. 52 Pros and Cons of the Solution ● Pro: –Consistency maintenance becomes transparent to programmers, compilers, as well as to the operating system. ● Con: –Increased hardware complexity .
  • 53. 53 Two Important Cache Coherency Protocols ● Snooping protocol: – Each cache “snoops” the bus to find out which data is being used by whom. ● Directory-based protocol: – Keep track of the sharing state of each data block using a directory. – A directory is a centralized register for all memory blocks. – Allows coherency protocol to avoid broadcasts.
  • 54. 54 Snoopy and Directory- Based Protocols P P P P Cache Cache Cache Cache Main memory I/O system Bus
  • 55. 55 Snooping vs. Directory- based Protocols ● Snooping protocol reduces memory traffic. – More efficient. ● Snooping protocol requires broadcasts: – Can meaningfully be implemented only when there is a shared bus. – Even when there is a shared bus, scalability is a problem. – Some work arounds have been tried: Sun Enterprise server has up to 4 buses.
  • 56. 56 Snooping Protocol ● As soon as a request for any data block by a processor is put out on the bus: – Other processors “snoop” to check if they have a copy and respond accordingly. ● Works well with bus interconnection: – All transmissions on a bus are essentially broadcast: ● Snooping is therefore effortless. – Dominates almost all small scale machines.
  • 57. 57 Categories of Snoopy Protocols ● Essentially two types: – Write Invalidate Protocol – Write Broadcast Protocol ● Write invalidate protocol: – When one processor writes to its cache, all other processors having a copy of that data block invalidate that block. ● Write broadcast: – When one processor writes to its cache, all other processors having a copy of that data block update that block with the recent written value.
  • 58. 58 Write Invalidate Vs. Write Update Protocols P P P P Cache Cache Cache Cache Main memory I/O system Bus
  • 59. 59 Write Invalidate Protocol ● Handling a write to shared data: – An invalidate command is sent on bus --- all caches snoop and invalidate any copies they have. ● Handling a read Miss: – Write-through: memory is always up-to- date. – Write-back: snooping finds most recent copy.
  • 60. 60 Write Invalidate in Write Through Caches ● Simple implementation. ● Writes: – Write to shared data: broadcast on bus, processors snoop, and update any copies. – Read miss: memory is always up-to-date. ● Concurrent writes: – Write serialization automatically achieved since bus serializes requests. – Bus provides the basic arbitration support.
  • 61. 61 Write Invalidate versus Broadcast cont… ● Invalidate exploits spatial locality: –Only one bus transaction for any number of writes to the same block. –Obviously, more efficient. ● Broadcast has lower latency for writes and reads: –As compared to invalidate.
  • 62. 62 An Example Snoopy Protocol ● Assume: – Invalidation protocol, write-back cache. ● Each block of memory is in one of the following states: – Shared: Clean in all caches and up-to-date in memory, block can be read. – Exclusive: cache has the only copy, it is writeable, and dirty. – Invalid: Data present in the block obsolete, cannot be used.
  • 63. 63 Modern Computer Architectures Lecture 30: Cache Coherence Protocols
  • 64. 64 Implementation of the Snooping Protocol ● A cache controller at every processor would implement the protocol: – Has to perform specific actions: ● When the local processor requests certain things. ● Also, certain actions are required when certain address appears on the bus. – Exact actions of the cache controller depends on the state of the cache block. – Two FSMs can show the different types of actions to be performed by a controller.
  • 65. 65 Snoopy-Cache State Machine-I ● State machine considering only CPU requests a each cache block. Invalid Shared (read/o nly) Exclusive (read/wr ite) CPU Read CPU Write CPU Read hit Place read miss on bus Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit
  • 66. 66 Snoopy-Cache State Machine-II ● State machine considering only bus requests for each cache block. Invalid Shared (read/o nly) Exclusive (read/wr ite) Write Back Block; (abort memory access) Write miss for this block Read miss for this block Write miss for this block Write Back Block; (abort memory access)
  • 67. 67 Place read miss on bus Combined Snoopy-Cache State Machine ● State machine considering both CPU requests and bus requests for each cache block. Invalid Shared (read/o nly) Exclusive (read/wr ite) CPU Read CPU Write CPU Read hit Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit Write miss for this block Write Back Block; Abort memory access. Write miss for this block Read miss for this block Write Back Block; (abort memory access)
  • 68. 68 Example P1 P2 Bus Memory step State AddrValueState AddrValue Action Proc.Addr ValueAddr Value P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes A1 and A2 map to same cache block, initial cache state is invalid
  • 69. 69 Example P1 P2 Bus Memory step State AddrValueState AddrValue Action Proc.Addr ValueAddr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes A1 and A2 map to same cache block
  • 70. 70 Example P1 P2 Bus Memory step State AddrValueState AddrValue Action Proc.Addr ValueAddr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes A1 and A2 map to same cache block
  • 71. 71 Example P1 P2 Bus Memory step State AddrValueState AddrValue Action Proc.Addr ValueAddr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes A1 and A2 map to same cache block
  • 72. 72 Example P1 P2 Bus Memor step State AddrValueState AddrValue Action Proc.Addr ValueAddr Valu P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes A1 and A2 map to same cache block
  • 73. 73 Example P1 P2 Bus Memory step State AddrValueState AddrValue Action Proc.Addr ValueAddr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 WrMs P2 A2 A1 10 Excl. A2 40 WrBk P2 A1 20 A1 20 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes A1 and A2 map to same cache block, but A1 != A2
  • 74. 74 Cache Misses in SMPs ● Overall cache performance is a combination of: – Uniprocessor cache misses – Misses due to invalidations caused by coherency protocols (coherency misses). ● Changes to some parameters can affect the two types of misses in different ways: – Processor count – Cache size – Block size
  • 75. 75 Coherence Misses ● The 4th C: Misses occurring due to coherency protocols. ● Example: – First write by a processor to a shared cache block. – Causes invalidation to establish ownership of the block.
  • 76. 76 Coherence Misses ● Coherence misses: – True sharing – False sharing ● False sharing misses occur because an entire cache block has a single valid bit. – False sharing misses can be avoided if the unit of sharing is a word.
  • 77. 77 Coherence Miss: Examples Time P1 P2 1 Write X1 True Sharing 2 Read X2 False Sharing 3 Write X1 False Sharing 4 Write X2 False Sharing 5 Read X2 True Sharing X1 and X2 belong to the same block
  • 78. 78 Increase in Number of Processors ● Coherence misses (both True and False) increase. ● Capacity misses decrease. ● Overall increase in miss rate: – Resulting in increase in AMAT.
  • 79. 79 Increase in Block Size ● True sharing misses decrease. – Increase in block size from 32B to 256B reduces true sharing misses by half. – Cause: Spatial locality in access. ● Compulsory misses decrease. ● False sharing misses increase. ● Conflict misses increase.
  • 80. 80 Some Issues in Implementing Snooping Caches ● Additional circuitry needed in a cache controller. ● Controller continuously snoops on address bus: – If address matches tag, either invalidate or update. ● Since every bus transaction checks cache tags, could interfere with CPU activities: – Solution 1: Duplicate set of tags for L1 caches to allow checks in parallel with CPU. – Solution 2: Duplicate tags on L2 cache.
  • 81. 81 A Commercial Implementation ● Intel Pentium Xeon (PIII and PIV) are cache coherent multiprocessors: – Implements snooping protocol. – Larger on chip caches to reduce bus contentions. – The chipset contains an external memory controller that connects the shared processor memory bus with the memory chips.
  • 82. 82 Modern Computer Architectures Lecture 31: Cache Coherence Protocols (Cont…)
  • 83. 83 NUMA Computers: Directory-Based Solution Interconnection Network Proc +Cache Memory Dir I/O Proc +Cache Memory Dir I/O Proc +Cache Memory Dir I/O Proc +Cache Memory Dir I/O Proc +Cache Memory Dir I/O Proc +Cache Memory Dir I/O
  • 84. 84 Shared Virtual Memory in DSMs ● In SVM processes appear as if they are sharing their entire virtual address space: – Great convenience to the programmers. – In effect, the operating system takes care of moving around the pages transparently. – Pages are the unit of sharing. – Pages are the units of coherence.
  • 85. 85 Shared Virtual Memory in DSMs ● OS can easily allow pages to be replicated in read-only fashion: – Virtual memory can protect pages from being written. ● When a process writes to a page: – Traps to OS – Pages in read-only state at other nodes are invalidated. ● False sharing can be high: – Leads to lower performance.
  • 86. 86 Directory-based Solution ● In NUMA computers: – Messages have long latency. – Also, broadcast is inefficient --- all messages have explicit responses. ● Main memory controller to keep track of: – Which processors are having cached copies of which memory locations. ● On a write, – Only need to inform users, not everyone ● On a dirty read, – Forward to owner
  • 87. 87 Directory Protocol ● Three states as in Snoopy Protocol – Shared: 1 or more processors have data, memory is up-to-date. – Uncached: No processor has the block. – Exclusive: 1 processor (owner) has the block. ● In addition to cache state, – Must track which processors have data when in the shared state. – Usually implemented using bit vector, 1 if processor has copy.
  • 88. 88 Directory Behavior ● On a read: – Unused: ● give (exclusive) copy to requester ● record owner – Exclusive or shared: ● send share message to current exclusive owner ● record owner ● return value – Exclusive dirty: ● forward read request to exclusive owner.
  • 89. 89 Directory Behavior ● On Write – Send invalidate messages to all hosts caching values. ● On Write-Thru/Write-back – Update value.
  • 90. 90 CPU-Cache State Machine ● State machine for CPU requests for each memory block ● Invalid state if in memory Fetch/Invalidate or Miss due to address conflict: send Data Write Back message to home directory Invalidate or Miss due to address conflict: Invalid Shared (read/o nly) Exclusive (read/wri te) CPU Read CPU Read hit Send Read Miss message CPU Write: Send Write Miss msg to h.d. CPU Write: Send Write Miss message to home directory CPU read hit CPU write hit Fetch: send Data Write Back message to home directory
  • 91. 91 State Transition Diagram for the Directory ● Tracks all copies of memory block. ● Same states as the transition diagram for an individual cache. ● Memory controller actions: – Update of directory state – Send msgs to statisfy requests. – Also indicates an action that updates the sharing set, Sharers, as well as sending a message.
  • 92. 92 Directory State Machine ● State machine for Directory requests for each memory block ● Uncached state if in memory Data Write Back: Sharers = {} (Write back block) Uncached Shared (read only) Exclusive (read/wri te) Read miss: Sharers = {P} send Data Value Reply Write Miss: send Invalidate to Sharers; then Sharers = {P}; send Data Value Reply msg Write Miss: Sharers = {P}; send Data Value Reply msg Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block) Read miss: Sharers += {P}; send Data Value Reply Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache
  • 93. 93 Example P1 P2 Bus Directory Memory Step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 P2: Read A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block Processor 1 Processor 2 Interconnect Memory Directory
  • 94. 94 Example P1 P2 Bus Directory Memory Step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 P2: Read A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block Processor 1 Processor 2 Interconnect Memory Directory
  • 95. 95 Example P1 P2 Bus Directory Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 40 to A2 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block Processor 1 Processor 2 Interconnect Memory Directory
  • 96. 96 Example A1 and A2 map to the same cache block P1 P2 Bus Directory Memory Step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar.A1 RdMs P2 A1 Shar.A1 10 Ftch P1 A1 10 10 Shar.A1 10 DaRp P2 A1 10 A1Shar. {P1,P2} 10 10 10 P2: Write 40 to A2 10 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Interconnect Memory Directory Write Back
  • 97. 97 Example P2: Write 20 to A1 A1 and A2 map to the same cache block P1 P2 Bus Directory Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar.A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 10 Shar.A1 10 DaRp P2 A1 10 A1Shar. {P1,P2} 10 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 10 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Interconnect Memory Directory A1
  • 98. 98 Example P2: Write 20 to A1 A1 and A2 map to the same cache block P1 P2 Bus Directory Memo step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMsP1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar.A1 RdMsP2 A1 Shar.A1 10 Ftch P1 A1 10 10 Shar.A1 10 DaRp P2 A1 10 A1Shar. {P1,P2} 10 Excl.A1 20 WrMsP2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 WrMsP2 A2 A2 Excl. {P2} 0 WrBk P2 A1 20 A1Unca. {} 20 Excl.A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0 P1: Read A1 P2: Read A1 P1 Write 10 to A1 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Interconnect Memory Directory A1
  • 99. 99 Implementation Issues in Directory-Based Protocols ● When the number of processors are large: – Directory can become a bottleneck. – Directories can be distributed among different memory modules. – Different directory accesses go to different locations.
  • 101. 101 Multiprocessor Programming Support ● A key programming support provided by a processor: – Synchronization. ● Why Synchronize? – To let different processes use shared data when it is safe. ● In uniprocessors, synchronization support is provided through: – Atomic “fetch and update” instructions.
  • 102. 102 Objectives of Synchronization Algorithms ● Reduce latency: – How quickly can an application get the lock in the absence of competition? ● Reduce waiting time ● Reduce contention: – How to design a scheme to reduce the contention?
  • 103. 103 Synchronization ● Other atomic operations to read-modify-write a memory location: ● test-and-set ● fetch-and-store(R, <mem>) ● fetch-and-add(<mem>, <value>) ● compare-and-swap(<mem>, <cmp-val>, <stor-val>)
  • 104. 104 Popular Atomic Synchronization Primitives ● Atomic exchange: interchange a value in a register for a value in memory: 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable – Set register to 1 & swap ● Can be used to implement other synchronization primitives.
  • 105. 105 Synchronization Primitives cont... ● Test-and-set: Tests a value and sets it: – Only if the value passes the test. ● Fetch-and-increment: It returns the value of a memory location and atomically increments it. – 0 => synchronization variable is free
  • 106. 106 Synchronization Issues ● In multiprocessors: –Traditional atomic “fetch and update” operations are inefficient. –One of the culprits is a coherent cache. ● As a result, for SMPs: –Synchronization can become a bottleneck. –Techniques to reduce contention and latency of synchronization required.
  • 107. 107 Synchronization Algorithms ● Locks – Exclusive locks – Shared locks ● Barriers barrier
  • 108. 108 Synchronization Primitive for SMPs ● Atomic exchange in SMP: – Very inefficient to have both read and write in the same instruction. – Use separate instructions instead. ● Load linked (LL) + store conditional (SC) – Load linked returns the initial value. – Store conditional returns: ● 1 if it succeeds (no other store to same memory location since writing) ● 0 otherwise
  • 109. 109 Atomic Exchange using LL and SC ● New value in the store register indicates success in getting lock: – 0 if the processor succeeded in setting the lock (first processor to set lock). – 1 if other processor had already claimed access. – The central idea is to make exchange operation indivisible.
  • 110. 110 LL and SC ● LL and SC should execute “effectively atomically”: – As if the load and store together are completed atomically. – No other store to the same location, no context switch, interrupts, etc. ● Implemented through a link register: – Stores address of instruction doing LL – Invalidated if any other instruction does SC – Invalidated under process switch.
  • 111. 111 Permitted Instructions Between LL and SC ● Care must be taken about which instructions can be placed between an LL SC pair: – Few register instructions can only be used. – It is otherwise possible to have a starvation situation where an SC instruction is never successful.
  • 112. 112 Other Synchronization Primitives Using LL and SC ● Atomic swap: try: or R3,R4,R0 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch store fails (R3 = 0) mov R4,R2 ; put load value in R4 ● Fetch & increment: try: ll R2,0(R1); load linked daddui R2,R2,#1; increment sc R2,0(R1); store conditional beqz R2,try; branch store fails (R2 = 0)
  • 113. 113 Spin Locks ● Some times a process or thread needs a certain data item for a very short time: – E.g. updating a counter value. ● If a traditional lock is used in this case: – The contending processes would suffer context switch. – This would be more expensive (1000s of cycles), compared if the contending processes had simply busy-waited (10s of cycles).
  • 114. 114 Spin Lock Illustration ● A spin lock leads to busy wait for P2: – Prevents context switch. Spin Lock Busy Wait P1 P2 Critical Section
  • 115. 115 Spin Lock Implementation daddui R2,R0, #1 lockit: exch R2,0(R1); atomic exchange bnez R2,lockit; already locked? ● Not very efficient: – On each write attempt to 0(R1), a memory transaction is generated. ● Can be made more efficient by using cache coherency: – Spin on a cached copy of R2 until the value changes from 1 to 0. – Bus transactions can be avoided.
  • 116. 116 Problem With the Spin Lock Algorithm ● Frequent polling gets you the lock faster: – But slows everyone else down. ● An efficient scheme: – Poll on a cached copy.
  • 117. 117 An Efficient Spin Lock Implementation ● Problem: Every exchange includes a write, – Invalidates all other copies; – Generates considerable bus traffic. ● Solution: start by simply repeatedly reading the variable; when it changes, then try storing: ● lockit: ll R2,0(R1);load var bnez R2,lockit;not free=>spin daddui R2,R0,R1; locked value sc R2,0(R1); store beqz R2,lockit;already locked?
  • 118. 118 Example ● 10 processors simultaneously try to set a spin lock at time 0: – Determine the number of bus transactions required for all processors to acquire the lock once.
  • 119. 119 Solution ● Assume that at a certain time, i processes are contending: – i of ll operations – i of sc operations – 1 store to release the lock. ● Total of 2*i+1 bus transactions ● (2*i+1) = n*n+2*n
  • 120. 120 Barriers ● A set of n processes all leave the synchronization region at once: – “at the same time” is hard in a parallel system. – Sufficient if no process leaves until all process arrive. ● Can be achieved by a a busy wait on shared memory: – Creates large number of bus transactions ● To avoid this: – Use a cache update protocol. – Processors spin on the cached value.
  • 121. 121 Barrier Implementation ● Can be implemented using 2 spin locks: – One to protect the counter. – One to hold the processes until the last process arrives at the barrier. ● Assume that lock and unlock provide the basic spin locks.
  • 122. 122 Barrier Implementation ● lock(counterlock); ● if(count==0) release=0; ● count++; ● unlock(counterlock); ● if(count==total){ – count=0; release=1;} ● else spin(release==1);
  • 123. 123 Barrier Implementation: Loop Hole ● It is possible that one process races ahead: – The fast process resets the release flag and traps the remaining processes. ● Solution: – Sense-reversing barrier (read up).
  • 124. 124 Example ● Assume 10 processes try to synchronize by executing a barrier simultaneously. – Determine the number of bus transactions required for the processes to reach and leave the barrier.
  • 125. 125 Solution ● For the ith process: – The number of bus transactions is 3*i+4 ● For n processes: –(3*i+4) = (3*n*n+11*n)/2 -1
  • 126. 126 Efficient Implementation of Barrier ● We need a primitive to efficiently increment the barrier count. – Queuing locks can be used for improving the performance of a barrier.
  • 127. 127 Queuing Locks ● Each arriving processor is kept track of in a queue structure. – Signal next waiter when a process is done. 0 1 ………. p-1 flags array current lock holder queuelast {has-lock, must-wait} hl mw
  • 129. 129 Introduction ● If you were plowing a field, which of the following would you rather use: A strong ox or 1024 chickens? --- Seymour Cray ● The answer would be different if you are considering a computing problem.
  • 130. 130 Multithreading Within a Single Processor ● Until now, we considered multiple threads of an application running on different processors: – Can multiple threads execute concurrently on the same processor? Yes ● Why is this desirable? – Inexpensive – one CPU. – Faster communication among threads.
  • 131. 131 Why Does Multithreading within a Processor Make Sense? ● Superscalar processors are now common place. ● Most of a processor’s functional units can’t find enough work on an average: – Peak IPC is 6, average IPC is 1.5! ● Threads share resources: – We can execute a number of threads without a corresponding linear increase in chip area.
  • 132. 132 Analysis of Idle Cycles in a Superscalar Processor ● Issues multiple instructions every cycle. – Typically 4. ● Several functional units of each type: – Adders, Multipliers, Floating Point units, etc. – Many functional units are idle in many cycles. – Especially true when there is a cache miss. ● Dispatcher reads instructions, decides which can run in parallel: – Number of instructions limited by instruction dependencies and long-latency operations
  • 133. 133 Analysis of Processor Inefficiency • Vertical waste is introduced when the processor issues no instructions in a cycle. • Horizontal waste is introduced when not all issue slots can be filled in a cycle. • 61% of the wasted cycles are vertical waste on avg. X X X X X X X X X X X Issue Slots Cycles X full issue slot empty issue slot vertical waste = 12 slots horizontal waste = 9 slots
  • 134. 134 Multithreading: A Pictorial Explanation • Rather than enlarging the depth of the instruction window (more speculation with lowering confidence !): – Enlarge its “width”. • Fetch from multiple threads. Branch Branch Branch Branch Branch future Issue Branch Branch Branch Branch Branch Issue
  • 135. 135 Multithreading ● Essentially a latency hiding technique: – Hides stalls due to cache misses. – Hides stalls due to data dependency. ● Under cache miss or data dependency stalls: – Multithreading provides work to functional units, keeps them busy.
  • 136. 136 Basic Support for Multithreading ● Multiple states (contexts) required to be maintained at the same time. ● One set per each thread: – Program Counter – Register File (and Flags) – Per thread renaming table. ● Since register renaming provides unique register identifiers: – Instructions from multiple threads can be mixed in the data path.
  • 137. 137 Multithreading Support in Uniprocessors ● In the most basic form: – Processor interleaves execution of instructions from different threads. ● Three types of thread scheduling: – Coarse-grained multithreading – Fine-grained multithreading – Simultaneous multithreading
  • 138. 138 Coarse-Grained Multithreading ● A selected thread continues to run: – Thread switch occurs only when an active thread undergoes long stall (L2 cache miss etc.) – This form of multithreading only hides long latency events. ● Easy to implement: – But, requirement of pipeline flushing on thread switch makes it inefficient.
  • 140. 140 Coarse-grained Multithreaded Processors ● Example: Sun SPARC II Processor – Provides hardware context for 4 threads – One thread reserved for interrupt handling – Register windows provide fast switching between 4 sets of 32 GPRs. ● Used in cache-coherent DSMs: – On a cache miss to a remote memory (takes 100s of cycles) switch to a different thread. – Network messages etc are handled by the interrupt handler thread.
  • 141. 141 Fine-Grained Multithreading ● Few active threads: – Context switch among the active threads on every clock cycle. – Occupancy of the execution core would be much higher. ● Issue instructions only from a single thread in a cycle: – Again may not find enough work every cycle, but cache misses can be tolerated.
  • 142. 142 Fine-Grained Multithreading ● Hides both long and short latency events. ● Vertical waste are eliminated but horizontal wastes are not. – If a thread has little or no operations to execute, its issue slot will be underutilized.
  • 144. 144 Simultaneous Multithreading (SMT): An Overview • Converts thread-level parallelism: –Into instruction-level parallelism. • Issues instructions from multiple threads in the same cycle. –Has the highest probability of finding work for every issue slot. • Called Hyper-threading by Intel.
  • 145. 145 Simultaneous Multithreading: A Conceptual Understanding ● 4-way superscalar: Peak throughput 4 ipc Superscalar Fine-Grained Multithreading Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Idle
  • 146. 146 Differences Among Multithreaded Architectures Multithreading Type Shared Resources Context Switch Mechanism Fine-grained All but register file and control logic/state Every cycle Coarse- grained All but instruction fetch buffers, register files and control logic/state On long stalls SMT All but instruction fetch buffer, return address stack, register files, control logic/state, reorder buffer, store queue. No switching, all active
  • 147. 147 SMT-Advantages ● Two main performance limitations of multithreading: – Memory stalls – Pipeline flushes due to incorrect speculation. ● In SMTs: – Multiple threads are simultaneously executed, can hide both these problems.
  • 148. 148 Anatomy of an SMT Processor • Multiple “logical” CPUs. • One physical CPU: – ~5% extra silicon to duplicate thread state information. • Better than single threading: – Increased thread-level parallelism. – Improved processor utilization when one thread blocks. • Not as good as two physical CPUs: – CPU resources are shared, not replicated.
  • 150. 150 Some Issues in SMT ● To achive multithreading: – Extend, replicate, and redesign some units of a superscalar processor. ● Resources replicated: – States of hardware contexts (registers, PCs) – Per thread mechanisms for Pipeline flushing and subroutine returns. – Per thread branch target buffer and translation lookaside buffer.
  • 151. 151 SMT Issues ● Resources to be redesigned: – Instruction fetch unit. – Processor pipeline. ● Instruction Scheduling: – Does not require additional hardware. – Register renaming same as in superscalar processors.
  • 152. 152 Commit Unit (Multiple Instructions per Cycle) Superscalar Architecture Reservation Station Reservation Station Reservation Station Reservation Station Register File FP Unit ALU 1 ALU 2 Branch Unit Load/Store Unit Instruction Fetch & Decode Unit (Multiple Instructions per Cycle) CU Multiple Buses Multiple Buses PC
  • 153. 153 Commit Unit (Multiple Instructions per Cycle) Reservation Station Reservation Station Reservation Station Reservation Station Register File(s) FP Unit ALU 1 ALU 2 Branch Unit Load/Store Unit Instruction Fetch & Decode Unit (Multiple Instructions per Cycle) CU Multiple Buses Multiple Buses PC Simultaneous Multithreading: Block Diagram IO CU CU PC PC PC
  • 154. 154 Simultaneous Multithreading: A Model ● Instruction Fetch Unit: – Fetch 1 instructions for 2 threads. – Decode 1 thread till branch/end of cache line, then jump to the other. – Highest priority to threads with fewest instructions in the decode, renaming, and queue pipeline stages. – Small hardware addition to track queue lengths.
  • 155. 155 Simultaneous Multithreading: Model ● Register File: – Each thread has 32 registers. – Register File: 32 * #threads + rename registers. ● Con: Large register file  longer access time.
  • 156. 156 Simultaneous Multithreading: Model Pipeline Format • Superscalar • SMT
  • 157. 157 Simultaneous Multithreading: Model Pipeline Format ● To avoid increase in clock cycle time: –SMT pipeline extended to allow 2 cycle register reads and writes. ● 2 cycle reads/writes increase branch misprediction penalty.
  • 158. 158 Simultaneous Multithreading: What to Issue? ● Not exactly the same as superscalars: – In a superscalar: oldest is the best: least speculation. – In SMT not so clear: ● Branch-speculation optimism may vary across threads. ● Based on this the selection strategies: – Oldest first. – Branch speculated last etc…
  • 159. 159 Simultaneous Multithreading: Compiler Optimizations ● Should try to minimize cache interference. ● Latency hiding techniques like speculation should be enhanced. ● Sharing optimization techniques from multiprocessors changed: – Data sharing is now good.
  • 160. 160 Caching in SMT ● Same cache shared among threads: –Performance degradation due to cache sharing. –Possibility of cache thrashing.
  • 161. 161 Performance Implications of SMT ● Single thread performance is likely to go down: – Caches, branch predictors, registers, etc. are shared. ● This effect can be mitigated by trying to prioritize one thread. ● With eight threads in a processor with many resources: – SMT can yield throughput improvements of roughly 2-4.
  • 162. 162 Commercial Examples ● Compaq Alpha 21464 (EV8) – 4T SMT, June 2001 ● Intel Pentium IV (Xeon) – 2T SMT, 2002 – 10-30% gains reported ● SUN Ultra IV – 2-core, 2T SMT ● IBM POWER5 – Dual processor core – 8-way superscalar, SMT – 24% area growth per core for SMT
  • 163. 163 Pentium4: Hyper- Threading ● Two threads: – The operating system operates as if it is executing on a two-processor system. ● When only one available thread: – Pentium 4 behaves like a regular single- threaded superscalar processor. ● Intel claims 30% performance improvements.
  • 164. 164 Intel MultiCore Architecture ● Improving execution rate of a single- thread is still considered important: – Out-of-order execution and speculation. ● MultiCore architecture: – Can reduce power consumption. – (14 pipeline stages) is closer to the Pentium M (12 stages) than the P4 (30 stages). ● Many transistors invested in large branch predictors: – To reduce wasted work (power).
  • 165. 165 Processor Power Consumption “Surpassed hot-plate power density in 0.5m; Not too long to reach nuclear reactor,” - Former Intel Fellow Fred Pollack.
  • 166. 166 Intel’s Dual Core Architectures ● The Pentium D is simply two Pentium 4 cpus: – Inefficiently paired together to run as dual core. ● The Core Duo is Intel's first generation dual core processor based upon the Pentium M (a Pentium III-4 hybrid): – Made mostly for laptops and is much more efficient than Pentium D. ● The Core 2 Duo is Intel's second generation (hence, Core 2) processor: – Made for desktops and laptops designed to be fast while not consuming nearly as much power as previous CPUs. ● Intel has now dropped the Pentium name in favor of the Core architecture.
  • 168. 168 Intel Core 2 Duo • Code named “conroe” • Homogeneous cores • Bus based on chip interconnect. • Shared on-die Cache Memory. Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Large, shared set associative, prefetch, etc. Source: Intel Corp.
  • 169. 169 Intel’s Core 2 Duo ● Launched in July 2006. ● Replacement for Pentium 4 and Pentium D CPUs. ● Intel claims: – Conroe provides 40% more performance at 40% less power compared to the Pentium D. ● All Conroe processors are manufactured with 4 MB L2 cache: – Due to manufacturing defects, the E6300 and E6400 versions based on this core have half their cache disabled, leaving them with only 2 MB of usable L2 cache.
  • 170. 170 Intel Core Processor Specification ● Speeds:1.06 GHz to 3 GHz ● FSB speeds:533 MT/s to 1333 MT/s ● Process: 0.065 µm (MOSFET channel length) ● Instruction set:x86, MMX, SSE, SSE2, SSE3, SSSE3, x86-64 (Streaming SIMD Extension ● Microarchitecture: Intel Core microarchitecture
  • 172. 172 Why Sharing On-Die L2? • What happens when L2 is too large?
  • 174. 174 Commercial Examples: IBM POWER5 ● SMT added to Superscalar Micro- architecture. ● Additional Program Counter (PC). ● GPR/FPR rename mapper expanded to map second set of registers . ● Completion logic replicated to track two threads.
  • 175. 175 Commercial Examples: IBM POWER5 ● Includes: 1. Thread Priority Mechanism: 8 levels. 2. Dynamic Thread Switching: ● Used if no instruction ready to run. ● Allocates all machine resources to one thread at any time.
  • 176. 176 Sun’s Niagara ● Commercial servers require high thread-level throughput: – Suffer from cache misses. ● Sun’s Niagara focuses on: – Simple cores (low power, design complexity, can accommodate more cores) – Fine-grain multi-threading (to tolerate long memory latencies)
  • 178. 178 Xeon and Opteron I/O Hub PCI-E Bridge I/O Hub PCI-E Bridge PCI-E Bridge PCI-E Bridge Memory Controller Hub Dual-Core Dual-Core Dual-Core Dual- Core PCI-E Bridge Legacy x86 Architecture • 20-year old front-side bus architecture • CPUs, Memory, I/O all share a bus • A bottleneck to performance • Faster CPUs or more cores ≠ performance AMD64 • Direct Connect Architecture eliminates FSB bottleneck.
  • 179. 179 179 I/O Hub PCI-E Bridge PCI-E Bridge PCI-E Bridge PCI-E Bridge PCI-E Bridge I/O Hub XMB XMB XMB XMB Memory Controller Hub Dual-Core Dual-Core Dual-Core Dual-Core Dual- Core Dual- Core Dual- Core Dual- Core Xeon Vs. Opteron
  • 180. 180 Reducing Power and Cooling Requirements with Processor Performance States P-State HIGH LOW P0 2600MHz 1.35V ~95watts P1 2400MHz 1.30V ~80watts P2 2200MHz 1.25V ~66watts P3 2000MHz 1.20V ~55watts P4 1800MHz 1.15V ~51watts P5 1000MHz 1.10V ~34watts PROCESSOR UTILIZATION AMD PowerNow!™ Technology with Optimized Power Management Multiple performance states for optimized power management Dynamically reduces processor power based on workload Lowers power consumption without compromising performance Up to 75% processor power savings. Example: AMD Opteron™ processor 2218 series
  • 181. 181 Summary Cont… • ILP now appears fully exploited: –For the last decade or so, the focus has been on thread-and process-level parallelism. • Multiprocessors progressed from add-on cards, to chips on the mother board: –Now available as multicore.
  • 182. 182 Summary Cont… ● Major issues in multiprocessors: – Cache coherency and synchronization. ● Cache coherency: – The copies of data blocks in individual cahces may become inconsistent.
  • 183. 183 Summary Cont… ● Cache coherency:Two popular protocols: – Snooping: suitable in SMPs – Directory-based: Suitable in NUMA processors. ● Mutithreading in uniprocessors is another promising approach: – Simultaneous multithreading (SMT)
  • 184. 184 Future Trends ● Simultaneous and Redundantly Threaded Processors(SRT): – Increase reliability with fault detection and correction. – Run multiple copies of the same programme simultaneously.
  • 185. 185 Future Trends ● Software Pre-Execution in SMT: – In some cases data adress is extremely hard to predict. – Use an idle thread of SMT for pre- execution. ● Speculation: – More advanced techniques for speculation.
  • 186. 186 References [1]J.L. Hennessy & D.A. Patterson, “Computer Architecture: A Quantitative Approach”. Morgan Kaufmann Publishers, 3rd Edition, 2003 [2]John Paul Shen and Mikko Lipasti, “Modern Processor Design,” Tata Mc-Graw-Hill, 2005 [3] S. McFarling, "Program Optimization for Instruction Caches, " Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 183--191, April 1989.
  • 187. 187 References ● "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. ● “Simultaneous Multithreading: Present Developments and Future Directions” by Miquel Peric, June 2003 ● “Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor” by IBM, Aug 2004 ● “Simultaneous Multithreading: A Platform for Next- Generation Processors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.