3. 3
Introduction
● Initial computer performance improvements
came from use of:
– Innovative manufacturing techniques.
● In later years,
– Most improvements came from exploitation of ILP.
– Both software and hardware techniques are being
used.
– Pipelining, dynamic instruction scheduling, out of
order execution, VLIW, vector processing, etc.
● ILP now appears fully exploited:
– Further performance improvements from ILP
appears limited.
4. 4
Thread and Process-
Level Parallelism
● The way to achieve higher performance:
– Of late, exploitation of thread and process-
level parallelism is being focused.
● Exploit parallelism existing across
multiple processes or threads:
– Cannot be exploited by any ILP processor.
● Consider a banking application:
– Individual transactions can be executed in
parallel.
5. 5
Processes versus Threads
(Oh my!)
● Processes:
– A process is a program in execution.
– An application normally consists of
multiple processes.
● Threads:
– A process consists of one of more
threads.
– Threads belonging to the same process
share data, and code space.
7. 7
How can Threads be
Created?
● By using any of the popular
thread libraries:
– POSIX Pthreads
– Win32 threads
– Java threads, etc.
8. 8
User Threads
● Thread management done in user
space.
● User threads are supported and
managed without kernel support.
– Invisible to the kernel.
– If one thread blocks, entire
process blocks.
– Limited benefits of threading.
9. 9
Kernel Threads
● Kernel threads supported and
managed directly by the OS.
– Kernel creates Light Weight Processes
(LWPs).
● Most modern OS support kernel
threads:
– Windows XP/2000
– Solaris
– Linux
– Mac OS, etc.
10. 10
Benefits of Threading
● Responsiveness:
– Threads share code, and data.
– Thread creation and switching
therefore much more efficient than
that for processes;
● As an example in Solaris:
– Creating threads 30x less costly
than processes.
– Context switching about 5x faster
than processes.
11. 11
Benefits of Threading
cont…
● Truly concurrent execution:
–Possible with processors
supporting concurrent
execution of threads: SMP,
multi-core, SMT
(hyperthreading), etc.
13. 13
A Case for Processor Support
for Thread-level Parallelism
• Using pure ILP, execution unit
utilization is only about 20%-25%:
– Utilization limited by control dependency,
Cache misses during memory access, etc.
– It is rare for units to be even
reasonably busy on the average.
● In pure ILP:
– At any time only one thread is under
execution.
14. 14
A Case for Processor Support
for Thread-level Parallelism
● Utilization of execution units can be
improved:
– Have several threads under execution:
● called active threads in PIII.
– Execute several threads at the same
time:
● SMP, SMT, and Multi-core processors.
15. 15
Threads in Applications
● Threads are natural to a wide
ranging set of applications:
– Often more or less independent.
– Though share data among
themselves to some extent.
– Also, synchronize sometimes
among themselves.
16. 16
A Few Thread Examples
● Independent threads occur
naturally in several applications:
– Web server: different http
requests are the threads.
– File server
– Name server
– Banking: independent transactions
– Desktop applications: file loading,
display, computations, etc. can be
threads.
17. 17
Reflection on Threading
● To think of it:
– Threading is inherent to any
server application.
● Threads are also easily
identifiable in traditional
applications:
– Banking, Scientific computations,
etc.
18. 18
Thread-level Parallelism
--- Cons
● Threads have to be identified by
the programmer:
– No rules exist as to what can be a
meaningful thread.
– Threads can not possibly be
identified by any automatic static or
dynamic analysis of code.
– Burden on programmer: requires
careful thinking and programming.
19. 19
Thread-level Parallelism
--- Cons cont…
● Threads with severe
dependencies:
– May make multithreading an
exercise in futility.
● Also not as “programmer
friendly” as ILP.
20. 20
Thread Vs. Process-
Level Parallelism
● Threads are light weight (or fine-
grained):
– Threads share address space, data, files etc.
– Even when extent of data sharing and
synchronization is low: Exploitation of
thread-level parallelism meaningful only when
communication latency is low.
– Consequently, shared memory architectures
(UMA) are a popular way to exploit thread-
level parallelism.
21. 21
Thread Vs. Process-
Level Parallelism cont…
● Processes are coarse-grained:
– Communication to computation
requirement is lower.
– DSM (Distributed Shared
Memory), Clusters, Grids, etc.
are meaningful.
22. 22
Focus of Next Few
Lectures
● Shared memory multiprocessors
– Cache coherency
– Synchronization: spin locks
● The recent phenomenon of threading
support in uniprocessors.
● Distributed memory multiprocessors
– DSM
– Clusters (Discussed in Module 6)
– Grids (Discussed in Module 6)
23. 23
A Broad Classification of
Computers
● Shared-memory multiprocessors
– Also called UMA
● Distributed memory computers
– Also called NUMA:
● Distributed Shared-memory (DSM)
architectures
● Clusters
● Grids, etc.
25. 25
Distributed Memory
Computers
● Distributed memory computers use:
– Message Passing Model
● Explicit message send and receive
instructions have to be written by the
programmer.
– Send: specifies local buffer + receiving
process (id) on remote computer (address).
– Receive: specifies sending process on
remote computer + local buffer to place
data.
26. 26
Advantages of Message-
Passing Communication
● Hardware for communication and
synchronization are much simpler:
– Compared to communication in a shared memory
model.
● Explicit communication:
– Programs simpler to understand, helps to reduce
maintenance and development costs.
● Synchronization is implicit:
– Naturally associated with sending/receiving
messages.
– Easier to debug.
27. 27
Disadvantages of Message-
Passing Communication
● Programmer has to write explicit
message passing constructs.
– Also, precisely identify the
processes (or threads) with which
communication is to occur.
● Explicit calls to operating
system:
– Higher overhead.
28. 28
MPI: A Message Passing
Standard
● A (de facto standard) developed
by a group of industry and academic
professionals:
– Aim is to foster portability and
widespread use.
● Defines routines, and not
implementations:
– Several free implementations exist.
– Synchronous and asynchronous modes.
29. 29
DSM
● Physically separate memories are
accessed as one logical address space.
● Processors running on a multi-
computer system share their memory.
– Implemented by operating system.
● DSM multiprocessors are NUMA:
– Access time depends on the exact
location of the data.
30. 30
Distributed Shared-Memory
Architecture (DSM)
● Underlying mechanism is message
passing:
– Shared memory convenience provided to
the programmer by the operating system.
– Basically, an operating system facility
takes care of message passing implicitly.
● Advantage of DSM:
– Ease of programming
31. 31
Disadvantage of DSM
● High communication cost:
– A program not specifically optimized
for DSM by the programmer shall
perform extremely poorly.
– Data (variables) accessed by specific
program segments have to be
collocated.
– Useful only for process-level (coarse-
grained) parallelism.
32. 32
SVM: Shared Virtual
Memory
● Supporting DSM on top of an
inherently message passing
system is inefficient.
● A possible solution is SVM.
● Virtual memory mechanism is used
to share objects at the page level.
33. 33
Communication Overhead:
Example 1
● An application is running on a 32 node
multiprocessor
● It incurs a latency of 400ns to handle a
reference (read/write) to memory.
● Processor clock rate is 1 GHz; IPC
(instructions per cycle) = 2.
● How much faster will be a computation,
if there is no communication versus if
0.2% of the instructions involve
reference to a memory location?
34. 34
Communication Overhead:
Solution 1
● CPI = 0.5
● Effective CPI with 0.2% remote
references = Base CPI + memory
request rate * memory request cost
● Effective CPI with 0.2% remote
references = 0.5 + 0.002 * 400 ns
= 0.5 + 0.8 = 1.3
● A program having no memory references
will be (1- 1/1.3)*100= 23% faster.
35. 35
Communication Overhead:
Example 2
● An application running on a 32 node DSM.
● It incurs a latency of 400 ms to handle a
reference (read/write) to a remote memory.
● Processor clock rate is 1 GHz; IPC
(instructions per cycle) = 2.
● How much faster will be a computation, on a
multiprocessor system compared to the DSM
if 0.2% of the instructions involve reference
to a remote memory? Assume no local
memory references.
36. 36
Communication Overhead:
Solution 2
● CPI = 0.5
● Effective CPI with 0.2% remote references =
Base CPI + remote request rate * remote
request cost
● Effective CPI with 0.2% remote references
= 0.5 + 0.002 * 400 * 1000ns = 0.5 + 800 =
800.5
● A multiprocessor would be: 800.5/1.3 = 658
times faster.
● Performance figures of NUMA may be worse:
– If we take data dependency and synchronization
aspects into consideration.
38. 38
Symmetric Multiprocessors
(SMPs)
● SMPs are a popular shared memory
multiprocessor architecture:
– Processors share Memory and I/O
– Bus based: access time for all memory locations is
equal --- “Symmetric MP”
P P P P
Cache Cache Cache Cache
Main memory I/O system
Bus
39. 39
SMPs: Some Insights
● In any multiprocessor, main memory
access is a bottleneck:
– Multilevel caches reduce the memory demand
of a processor.
– Multilevel caches in fact make it possible for
more than one processor to meaningfully
share the memory bus.
– Hence multilevel caches are a must in a
multiprocessor!
40. 40
Different SMP
Organizations
● Processor and cache on separate
extension boards (1980s):
– Plugged on to the backplane.
● Integrated on the main board (1990s):
– 4 or 6 processors placed per board.
● Integrated on the same chip (multi-core)
(2000s):
– Dual core (IBM, Intel, AMD)
– Quad core
41. 41
Pros of SMPs
● Ease of programming:
–Especially when communication
patterns are complex or vary
dynamically during execution.
42. 42
Cons of SMPs
● As the number of processors increases,
contention for the bus increases.
– Scalability of the SMP model restricted.
– One way out may be to use switches
(crossbar, multistage networks, etc.)
instead of a bus.
– Switches set up parallel point-to-point
connections.
– Again switches are not without any
disadvantages: make implementation of
cache coherence difficult.
43. 43
SMPs
● Even programs not using multithreading
(conventional programs):
– Experience a performance increase on SMPs
– Reason: Kernel routines handling interrupts
etc. run on a separate processor.
● Multicore processors are now common
place:
– Pentium 4 Extreme Edition, Xeon, Athlon64,
DEC Alpha, UltraSparc…
44. 44
Why Multicores?
● Can you recollect the constraints on
further increase in circuit complexity:
– Clock skew and temperature.
● Use of more complex techniques to
improve single-thread performance is
limited.
● Any additional transistors have to be
used in a different core.
45. 45
Why Multicores?
Cont…
● Multiple cores on the same
physical packaging:
– Execute different threads.
– Switched off, if no thread to
execute (power saving).
– Dual core, quad core, etc.
46. 46
Cache Organizations for
Multicores
● L1 caches are always private to a core
● L2 caches can be private or shared
– which is better?
P4
P3
P2
P1
L1
L1
L1
L1
L2
L2
L2
L2
P4
P3
P2
P1
L1
L1
L1
L1
L2
47. 47
L2 Organizations
● Advantages of a shared L2 cache:
– Efficient dynamic use of space by each core
– Data shared by multiple cores is not
replicated.
– Every block has a fixed “home” – hence, easy
to find the latest copy.
● Advantages of a private L2 cache:
– Quick access to private L2
– Private bus to private L2, less contention.
48. 48
An Important Problem with
Shared-Memory: Coherence
● When shared data are cached:
– These are replicated in multiple
caches.
– The data in the caches of different
processors may become inconsistent.
● How to enforce cache coherency?
– How does a processor know changes in
the caches of other processors?
50. 50
Cache Coherence Solutions
(Protocols)
● The key to maintain cache coherence:
– Track the state of sharing of every
data block.
● Based on this idea, following can be
an overall solution:
– Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.
51. 51
Basic Idea Behind Cache
Coherency Protocols
P P P P
Cache Cache Cache Cache
Main memory I/O system
Bus
52. 52
Pros and Cons of the
Solution
● Pro:
–Consistency maintenance becomes
transparent to programmers,
compilers, as well as to the
operating system.
● Con:
–Increased hardware complexity .
53. 53
Two Important Cache
Coherency Protocols
● Snooping protocol:
– Each cache “snoops” the bus to find out
which data is being used by whom.
● Directory-based protocol:
– Keep track of the sharing state of each
data block using a directory.
– A directory is a centralized register for
all memory blocks.
– Allows coherency protocol to avoid
broadcasts.
55. 55
Snooping vs. Directory-
based Protocols
● Snooping protocol reduces memory
traffic.
– More efficient.
● Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
– Even when there is a shared bus, scalability
is a problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses.
56. 56
Snooping Protocol
● As soon as a request for any data block
by a processor is put out on the bus:
– Other processors “snoop” to check if they
have a copy and respond accordingly.
● Works well with bus interconnection:
– All transmissions on a bus are essentially
broadcast:
● Snooping is therefore effortless.
– Dominates almost all small scale machines.
57. 57
Categories of Snoopy
Protocols
● Essentially two types:
– Write Invalidate Protocol
– Write Broadcast Protocol
● Write invalidate protocol:
– When one processor writes to its cache, all
other processors having a copy of that
data block invalidate that block.
● Write broadcast:
– When one processor writes to its cache, all
other processors having a copy of that
data block update that block with the
recent written value.
59. 59
Write Invalidate Protocol
● Handling a write to shared data:
– An invalidate command is sent on bus ---
all caches snoop and invalidate any copies
they have.
● Handling a read Miss:
– Write-through: memory is always up-to-
date.
– Write-back: snooping finds most recent
copy.
60. 60
Write Invalidate in Write
Through Caches
● Simple implementation.
● Writes:
– Write to shared data: broadcast on bus,
processors snoop, and update any copies.
– Read miss: memory is always up-to-date.
● Concurrent writes:
– Write serialization automatically achieved
since bus serializes requests.
– Bus provides the basic arbitration support.
61. 61
Write Invalidate versus
Broadcast cont…
● Invalidate exploits spatial locality:
–Only one bus transaction for any
number of writes to the same block.
–Obviously, more efficient.
● Broadcast has lower latency for
writes and reads:
–As compared to invalidate.
62. 62
An Example Snoopy
Protocol
● Assume:
– Invalidation protocol, write-back cache.
● Each block of memory is in one of the
following states:
– Shared: Clean in all caches and up-to-date
in memory, block can be read.
– Exclusive: cache has the only copy, it is
writeable, and dirty.
– Invalid: Data present in the block obsolete,
cannot be used.
64. 64
Implementation of the
Snooping Protocol
● A cache controller at every processor
would implement the protocol:
– Has to perform specific actions:
● When the local processor requests certain
things.
● Also, certain actions are required when certain
address appears on the bus.
– Exact actions of the cache controller
depends on the state of the cache block.
– Two FSMs can show the different types of
actions to be performed by a controller.
65. 65
Snoopy-Cache State
Machine-I
● State machine
considering only
CPU requests
a each cache
block.
Invalid
Shared
(read/o
nly)
Exclusive
(read/wr
ite)
CPU Read
CPU Write
CPU Read hit
Place read miss
on bus
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus
CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
66. 66
Snoopy-Cache State
Machine-II
● State machine
considering only
bus requests
for each cache
block.
Invalid
Shared
(read/o
nly)
Exclusive
(read/wr
ite)
Write Back
Block; (abort
memory access)
Write miss
for this block
Read miss
for this block
Write miss
for this block
Write Back
Block; (abort
memory
access)
67. 67
Place read miss
on bus
Combined Snoopy-Cache
State Machine
● State machine
considering both
CPU requests
and bus requests
for each
cache block.
Invalid
Shared
(read/o
nly)
Exclusive
(read/wr
ite)
CPU Read
CPU Write
CPU Read hit
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
Write miss
for this block
Write Back
Block; Abort
memory access.
Write miss
for this block
Read miss
for this block
Write Back
Block; (abort
memory access)
68. 68
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block,
initial cache state is invalid
69. 69
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
70. 70
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
71. 71
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
72. 72
Example
P1 P2 Bus Memor
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Valu
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block
73. 73
Example
P1 P2 Bus Memory
step State AddrValueState AddrValue
Action
Proc.Addr ValueAddr
Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Assumes A1 and A2 map to same cache block,
but A1 != A2
74. 74
Cache Misses in SMPs
● Overall cache performance is a
combination of:
– Uniprocessor cache misses
– Misses due to invalidations caused by
coherency protocols (coherency misses).
● Changes to some parameters can affect
the two types of misses in different ways:
– Processor count
– Cache size
– Block size
75. 75
Coherence Misses
● The 4th C: Misses occurring due
to coherency protocols.
● Example:
– First write by a processor to a
shared cache block.
– Causes invalidation to establish
ownership of the block.
76. 76
Coherence Misses
● Coherence misses:
– True sharing
– False sharing
● False sharing misses occur
because an entire cache block has
a single valid bit.
– False sharing misses can be avoided
if the unit of sharing is a word.
77. 77
Coherence Miss: Examples
Time P1 P2
1 Write X1 True Sharing
2 Read X2 False Sharing
3 Write X1 False Sharing
4 Write X2 False Sharing
5 Read X2 True Sharing
X1 and X2 belong to the same block
78. 78
Increase in Number of
Processors
● Coherence misses (both True
and False) increase.
● Capacity misses decrease.
● Overall increase in miss rate:
– Resulting in increase in AMAT.
79. 79
Increase in Block Size
● True sharing misses decrease.
– Increase in block size from 32B to
256B reduces true sharing misses
by half.
– Cause: Spatial locality in access.
● Compulsory misses decrease.
● False sharing misses increase.
● Conflict misses increase.
80. 80
Some Issues in Implementing
Snooping Caches
● Additional circuitry needed in a cache
controller.
● Controller continuously snoops on address
bus:
– If address matches tag, either invalidate or
update.
● Since every bus transaction checks cache
tags, could interfere with CPU activities:
– Solution 1: Duplicate set of tags for L1
caches to allow checks in parallel with CPU.
– Solution 2: Duplicate tags on L2 cache.
81. 81
A Commercial
Implementation
● Intel Pentium Xeon (PIII and PIV)
are cache coherent multiprocessors:
– Implements snooping protocol.
– Larger on chip caches to reduce bus
contentions.
– The chipset contains an external
memory controller that connects the
shared processor memory bus with the
memory chips.
84. 84
Shared Virtual Memory in
DSMs
● In SVM processes appear as if they
are sharing their entire virtual
address space:
– Great convenience to the programmers.
– In effect, the operating system takes
care of moving around the pages
transparently.
– Pages are the unit of sharing.
– Pages are the units of coherence.
85. 85
Shared Virtual Memory in
DSMs
● OS can easily allow pages to be
replicated in read-only fashion:
– Virtual memory can protect pages from
being written.
● When a process writes to a page:
– Traps to OS
– Pages in read-only state at other nodes are
invalidated.
● False sharing can be high:
– Leads to lower performance.
86. 86
Directory-based Solution
● In NUMA computers:
– Messages have long latency.
– Also, broadcast is inefficient --- all
messages have explicit responses.
● Main memory controller to keep track of:
– Which processors are having cached copies
of which memory locations.
● On a write,
– Only need to inform users, not everyone
● On a dirty read,
– Forward to owner
87. 87
Directory Protocol
● Three states as in Snoopy Protocol
– Shared: 1 or more processors have data,
memory is up-to-date.
– Uncached: No processor has the block.
– Exclusive: 1 processor (owner) has the block.
● In addition to cache state,
– Must track which processors have data when
in the shared state.
– Usually implemented using bit vector, 1 if
processor has copy.
88. 88
Directory Behavior
● On a read:
– Unused:
● give (exclusive) copy to requester
● record owner
– Exclusive or shared:
● send share message to current exclusive
owner
● record owner
● return value
– Exclusive dirty:
● forward read request to exclusive owner.
89. 89
Directory Behavior
● On Write
– Send invalidate messages to all
hosts caching values.
● On Write-Thru/Write-back
– Update value.
90. 90
CPU-Cache State Machine
● State machine
for CPU requests
for each
memory block
● Invalid state
if in
memory
Fetch/Invalidate
or Miss due to
address conflict:
send Data Write Back message
to home directory
Invalidate
or Miss due to
address conflict:
Invalid
Shared
(read/o
nly)
Exclusive
(read/wri
te)
CPU Read
CPU Read hit
Send Read Miss
message
CPU Write:
Send Write Miss
msg to h.d.
CPU Write:
Send
Write Miss message
to home directory
CPU read hit
CPU write hit
Fetch: send
Data Write Back message
to home directory
91. 91
State Transition Diagram
for the Directory
● Tracks all copies of memory block.
● Same states as the transition diagram
for an individual cache.
● Memory controller actions:
– Update of directory state
– Send msgs to statisfy requests.
– Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message.
92. 92
Directory State Machine
● State machine
for Directory
requests for each
memory block
● Uncached state
if in memory
Data Write Back:
Sharers = {}
(Write back block)
Uncached
Shared
(read
only)
Exclusive
(read/wri
te)
Read miss:
Sharers = {P}
send Data Value
Reply
Write Miss:
send Invalidate
to Sharers;
then Sharers = {P};
send Data Value
Reply msg
Write Miss:
Sharers = {P};
send Data
Value Reply
msg
Read miss:
Sharers += {P};
send Fetch;
send Data Value Reply
msg to remote cache
(Write back block)
Read miss:
Sharers += {P};
send Data Value Reply
Write Miss:
Sharers = {P};
send Fetch/Invalidate;
send Data Value Reply
msg to remote cache
93. 93
Example
P1 P2 Bus Directory Memory
Step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
94. 94
Example
P1 P2 Bus Directory Memory
Step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
95. 95
Example
P1 P2 Bus Directory Memory
step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 40 to A2
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect Memory
Directory
96. 96
Example
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memory
Step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar.A1 RdMs P2 A1
Shar.A1 10 Ftch P1 A1 10 10
Shar.A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
10
10
P2: Write 40 to A2 10
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
Write Back
97. 97
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memory
step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar.A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 10
Shar.A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 10
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
A1
98. 98
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memo
step State
Addr
Value
State
Addr
Value
Action
Proc.
Addr
Value
Addr
State
{Procs}
Value
P1: Write 10 to A1 WrMsP1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar.A1 RdMsP2 A1
Shar.A1 10 Ftch P1 A1 10 10
Shar.A1 10 DaRp P2 A1 10 A1Shar.
{P1,P2} 10
Excl.A1 20 WrMsP2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 WrMsP2 A2 A2 Excl. {P2} 0
WrBk P2 A1 20 A1Unca. {} 20
Excl.A2 40 DaRp P2 A2 0 A2 Excl.
{P2} 0
P1: Read A1
P2: Read A1
P1 Write 10 to A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Interconnect Memory
Directory
A1
99. 99
Implementation Issues in
Directory-Based Protocols
● When the number of processors
are large:
– Directory can become a bottleneck.
– Directories can be distributed
among different memory modules.
– Different directory accesses go to
different locations.
101. 101
Multiprocessor
Programming Support
● A key programming support provided by a
processor:
– Synchronization.
● Why Synchronize?
– To let different processes use shared data
when it is safe.
● In uniprocessors, synchronization
support is provided through:
– Atomic “fetch and update” instructions.
102. 102
Objectives of
Synchronization Algorithms
● Reduce latency:
– How quickly can an application get the
lock in the absence of competition?
● Reduce waiting time
● Reduce contention:
– How to design a scheme to reduce the
contention?
103. 103
Synchronization
● Other atomic operations to
read-modify-write a memory
location:
● test-and-set
● fetch-and-store(R, <mem>)
● fetch-and-add(<mem>, <value>)
● compare-and-swap(<mem>,
<cmp-val>, <stor-val>)
104. 104
Popular Atomic
Synchronization Primitives
● Atomic exchange: interchange a value in a
register for a value in memory:
0 => synchronization variable is free
1 => synchronization variable is locked and
unavailable
– Set register to 1 & swap
● Can be used to implement other
synchronization primitives.
105. 105
Synchronization
Primitives cont...
● Test-and-set: Tests a value and
sets it:
– Only if the value passes the test.
● Fetch-and-increment: It returns
the value of a memory location
and atomically increments it.
– 0 => synchronization variable is free
106. 106
Synchronization Issues
● In multiprocessors:
–Traditional atomic “fetch and update”
operations are inefficient.
–One of the culprits is a coherent cache.
● As a result, for SMPs:
–Synchronization can become a
bottleneck.
–Techniques to reduce contention and
latency of synchronization required.
108. 108
Synchronization
Primitive for SMPs
● Atomic exchange in SMP:
– Very inefficient to have both read and
write in the same instruction.
– Use separate instructions instead.
● Load linked (LL) + store conditional (SC)
– Load linked returns the initial value.
– Store conditional returns:
● 1 if it succeeds (no other store to same memory
location since writing)
● 0 otherwise
109. 109
Atomic Exchange using LL
and SC
● New value in the store register
indicates success in getting lock:
– 0 if the processor succeeded in setting
the lock (first processor to set lock).
– 1 if other processor had already
claimed access.
– The central idea is to make exchange
operation indivisible.
110. 110
LL and SC
● LL and SC should execute “effectively
atomically”:
– As if the load and store together are
completed atomically.
– No other store to the same location, no
context switch, interrupts, etc.
● Implemented through a link register:
– Stores address of instruction doing LL
– Invalidated if any other instruction does SC
– Invalidated under process switch.
111. 111
Permitted Instructions
Between LL and SC
● Care must be taken about which
instructions can be placed between
an LL SC pair:
– Few register instructions can only
be used.
– It is otherwise possible to have a
starvation situation where an SC
instruction is never successful.
112. 112
Other Synchronization
Primitives Using LL and SC
● Atomic swap:
try: or R3,R4,R0 ; mov exchange value
ll R2,0(R1) ; load linked
sc R3,0(R1) ; store conditional
beqz R3,try ; branch store fails (R3 = 0)
mov R4,R2 ; put load value in R4
● Fetch & increment:
try: ll R2,0(R1); load linked
daddui R2,R2,#1; increment
sc R2,0(R1); store conditional
beqz R2,try; branch store fails (R2 = 0)
113. 113
Spin Locks
● Some times a process or thread needs a
certain data item for a very short time:
– E.g. updating a counter value.
● If a traditional lock is used in this case:
– The contending processes would suffer
context switch.
– This would be more expensive (1000s of
cycles), compared if the contending
processes had simply busy-waited (10s of
cycles).
114. 114
Spin Lock Illustration
● A spin lock leads to busy wait for P2:
– Prevents context switch.
Spin Lock
Busy Wait
P1 P2
Critical
Section
115. 115
Spin Lock Implementation
daddui R2,R0, #1
lockit: exch R2,0(R1); atomic exchange
bnez R2,lockit; already locked?
● Not very efficient:
– On each write attempt to 0(R1), a memory
transaction is generated.
● Can be made more efficient by using
cache coherency:
– Spin on a cached copy of R2 until the value
changes from 1 to 0.
– Bus transactions can be avoided.
116. 116
Problem With the Spin
Lock Algorithm
● Frequent polling gets you the
lock faster:
– But slows everyone else down.
● An efficient scheme:
– Poll on a cached copy.
117. 117
An Efficient Spin Lock
Implementation
● Problem: Every exchange includes a write,
– Invalidates all other copies;
– Generates considerable bus traffic.
● Solution: start by simply repeatedly
reading the variable; when it changes, then
try storing:
● lockit: ll R2,0(R1);load var
bnez R2,lockit;not free=>spin
daddui R2,R0,R1; locked value
sc R2,0(R1); store
beqz R2,lockit;already locked?
118. 118
Example
● 10 processors simultaneously
try to set a spin lock at time 0:
– Determine the number of bus
transactions required for all
processors to acquire the lock
once.
119. 119
Solution
● Assume that at a certain time, i
processes are contending:
– i of ll operations
– i of sc operations
– 1 store to release the lock.
● Total of 2*i+1 bus transactions
● (2*i+1) = n*n+2*n
120. 120
Barriers
● A set of n processes all leave the
synchronization region at once:
– “at the same time” is hard in a parallel system.
– Sufficient if no process leaves until all
process arrive.
● Can be achieved by a a busy wait on shared
memory:
– Creates large number of bus transactions
● To avoid this:
– Use a cache update protocol.
– Processors spin on the cached value.
121. 121
Barrier Implementation
● Can be implemented using 2 spin
locks:
– One to protect the counter.
– One to hold the processes until
the last process arrives at the
barrier.
● Assume that lock and unlock
provide the basic spin locks.
123. 123
Barrier Implementation:
Loop Hole
● It is possible that one process
races ahead:
– The fast process resets the
release flag and traps the
remaining processes.
● Solution:
– Sense-reversing barrier (read
up).
124. 124
Example
● Assume 10 processes try to
synchronize by executing a
barrier simultaneously.
– Determine the number of bus
transactions required for the
processes to reach and leave the
barrier.
125. 125
Solution
● For the ith process:
– The number of bus transactions
is 3*i+4
● For n processes:
–(3*i+4) = (3*n*n+11*n)/2 -1
126. 126
Efficient Implementation
of Barrier
● We need a primitive to
efficiently increment the
barrier count.
– Queuing locks can be used for
improving the performance of a
barrier.
127. 127
Queuing Locks
● Each arriving processor is kept
track of in a queue structure.
– Signal next waiter when a process
is done.
0 1 ………. p-1
flags
array
current
lock holder
queuelast
{has-lock,
must-wait}
hl mw
129. 129
Introduction
● If you were plowing a field, which of
the following would you rather use:
A strong ox or 1024 chickens?
--- Seymour Cray
● The answer would be different if
you are considering a computing
problem.
130. 130
Multithreading Within a
Single Processor
● Until now, we considered multiple
threads of an application running on
different processors:
– Can multiple threads execute
concurrently on the same processor?
Yes
● Why is this desirable?
– Inexpensive – one CPU.
– Faster communication among threads.
131. 131
Why Does Multithreading
within a Processor Make
Sense?
● Superscalar processors are now common
place.
● Most of a processor’s functional units
can’t find enough work on an average:
– Peak IPC is 6, average IPC is 1.5!
● Threads share resources:
– We can execute a number of threads
without a corresponding linear increase in
chip area.
132. 132
Analysis of Idle Cycles in a
Superscalar Processor
● Issues multiple instructions every cycle.
– Typically 4.
● Several functional units of each type:
– Adders, Multipliers, Floating Point units, etc.
– Many functional units are idle in many cycles.
– Especially true when there is a cache miss.
● Dispatcher reads instructions, decides which
can run in parallel:
– Number of instructions limited by instruction
dependencies and long-latency operations
133. 133
Analysis of Processor Inefficiency
• Vertical waste is
introduced when the
processor issues no
instructions in a cycle.
• Horizontal waste is
introduced when not
all issue slots can be
filled in a cycle.
• 61% of the wasted
cycles are vertical
waste on avg.
X X X
X X
X
X X X X
X
Issue Slots
Cycles
X full issue
slot
empty issue
slot
vertical waste
= 12 slots
horizontal waste
= 9 slots
134. 134
Multithreading: A Pictorial
Explanation
• Rather than enlarging the
depth of the instruction
window (more speculation
with lowering confidence !):
– Enlarge its “width”.
• Fetch from multiple
threads.
Branch
Branch
Branch
Branch
Branch
future
Issue
Branch
Branch
Branch
Branch
Branch
Issue
135. 135
Multithreading
● Essentially a latency hiding
technique:
– Hides stalls due to cache misses.
– Hides stalls due to data dependency.
● Under cache miss or data
dependency stalls:
– Multithreading provides work to
functional units, keeps them busy.
136. 136
Basic Support for
Multithreading
● Multiple states (contexts) required to
be maintained at the same time.
● One set per each thread:
– Program Counter
– Register File (and Flags)
– Per thread renaming table.
● Since register renaming provides unique
register identifiers:
– Instructions from multiple threads can be
mixed in the data path.
137. 137
Multithreading Support in
Uniprocessors
● In the most basic form:
– Processor interleaves execution of
instructions from different
threads.
● Three types of thread scheduling:
– Coarse-grained multithreading
– Fine-grained multithreading
– Simultaneous multithreading
138. 138
Coarse-Grained
Multithreading
● A selected thread continues to run:
– Thread switch occurs only when an
active thread undergoes long stall (L2
cache miss etc.)
– This form of multithreading only hides
long latency events.
● Easy to implement:
– But, requirement of pipeline flushing on
thread switch makes it inefficient.
140. 140
Coarse-grained
Multithreaded Processors
● Example: Sun SPARC II Processor
– Provides hardware context for 4 threads
– One thread reserved for interrupt handling
– Register windows provide fast switching
between 4 sets of 32 GPRs.
● Used in cache-coherent DSMs:
– On a cache miss to a remote memory (takes
100s of cycles) switch to a different thread.
– Network messages etc are handled by the
interrupt handler thread.
141. 141
Fine-Grained
Multithreading
● Few active threads:
– Context switch among the active
threads on every clock cycle.
– Occupancy of the execution core would
be much higher.
● Issue instructions only from a single
thread in a cycle:
– Again may not find enough work every
cycle, but cache misses can be
tolerated.
142. 142
Fine-Grained
Multithreading
● Hides both long and short latency
events.
● Vertical waste are eliminated but
horizontal wastes are not.
– If a thread has little or no
operations to execute, its issue
slot will be underutilized.
144. 144
Simultaneous Multithreading
(SMT): An Overview
• Converts thread-level parallelism:
–Into instruction-level parallelism.
• Issues instructions from multiple
threads in the same cycle.
–Has the highest probability of finding
work for every issue slot.
• Called Hyper-threading by Intel.
146. 146
Differences Among
Multithreaded Architectures
Multithreading
Type
Shared Resources Context Switch
Mechanism
Fine-grained All but register file and
control logic/state
Every cycle
Coarse-
grained
All but instruction fetch
buffers, register files and
control logic/state
On long stalls
SMT All but instruction fetch
buffer, return address
stack, register files, control
logic/state, reorder buffer,
store queue.
No switching,
all active
147. 147
SMT-Advantages
● Two main performance limitations
of multithreading:
– Memory stalls
– Pipeline flushes due to incorrect
speculation.
● In SMTs:
– Multiple threads are simultaneously
executed, can hide both these
problems.
148. 148
Anatomy of an SMT
Processor
• Multiple “logical” CPUs.
• One physical CPU:
– ~5% extra silicon to duplicate thread state
information.
• Better than single threading:
– Increased thread-level parallelism.
– Improved processor utilization when one
thread blocks.
• Not as good as two physical CPUs:
– CPU resources are shared, not replicated.
150. 150
Some Issues in SMT
● To achive multithreading:
– Extend, replicate, and redesign some units of
a superscalar processor.
● Resources replicated:
– States of hardware contexts (registers, PCs)
– Per thread mechanisms for Pipeline flushing
and subroutine returns.
– Per thread branch target buffer and
translation lookaside buffer.
151. 151
SMT Issues
● Resources to be redesigned:
– Instruction fetch unit.
– Processor pipeline.
● Instruction Scheduling:
– Does not require additional
hardware.
– Register renaming same as in
superscalar processors.
152. 152
Commit Unit
(Multiple Instructions per Cycle)
Superscalar Architecture
Reservation
Station
Reservation
Station
Reservation
Station
Reservation
Station
Register File
FP Unit
ALU 1 ALU 2
Branch
Unit
Load/Store
Unit
Instruction
Fetch & Decode Unit
(Multiple Instructions per Cycle)
CU
Multiple Buses
Multiple Buses
PC
153. 153
Commit Unit
(Multiple Instructions per Cycle)
Reservation
Station
Reservation
Station
Reservation
Station
Reservation
Station
Register
File(s)
FP Unit
ALU 1 ALU 2
Branch
Unit
Load/Store
Unit
Instruction
Fetch & Decode Unit
(Multiple Instructions per Cycle)
CU
Multiple Buses
Multiple Buses
PC
Simultaneous Multithreading:
Block Diagram
IO
CU
CU
PC
PC
PC
154. 154
Simultaneous
Multithreading: A Model
● Instruction Fetch Unit:
– Fetch 1 instructions for 2 threads.
– Decode 1 thread till branch/end of
cache line, then jump to the other.
– Highest priority to threads with
fewest instructions in the decode,
renaming, and queue pipeline stages.
– Small hardware addition to track
queue lengths.
157. 157
Simultaneous Multithreading:
Model Pipeline Format
● To avoid increase in clock
cycle time:
–SMT pipeline extended to allow
2 cycle register reads and
writes.
● 2 cycle reads/writes increase
branch misprediction penalty.
158. 158
Simultaneous Multithreading:
What to Issue?
● Not exactly the same as superscalars:
– In a superscalar: oldest is the best:
least speculation.
– In SMT not so clear:
● Branch-speculation optimism may vary
across threads.
● Based on this the selection strategies:
– Oldest first.
– Branch speculated last etc…
159. 159
Simultaneous Multithreading:
Compiler Optimizations
● Should try to minimize cache
interference.
● Latency hiding techniques like
speculation should be enhanced.
● Sharing optimization techniques
from multiprocessors changed:
– Data sharing is now good.
160. 160
Caching in SMT
● Same cache shared
among threads:
–Performance degradation
due to cache sharing.
–Possibility of cache
thrashing.
161. 161
Performance Implications
of SMT
● Single thread performance is likely to go
down:
– Caches, branch predictors, registers, etc.
are shared.
● This effect can be mitigated by trying
to prioritize one thread.
● With eight threads in a processor with
many resources:
– SMT can yield throughput improvements of
roughly 2-4.
162. 162
Commercial Examples
● Compaq Alpha 21464 (EV8)
– 4T SMT, June 2001
● Intel Pentium IV (Xeon)
– 2T SMT, 2002
– 10-30% gains reported
● SUN Ultra IV
– 2-core, 2T SMT
● IBM POWER5
– Dual processor core
– 8-way superscalar, SMT
– 24% area growth per core for SMT
163. 163
Pentium4: Hyper-
Threading
● Two threads:
– The operating system operates as if it
is executing on a two-processor system.
● When only one available thread:
– Pentium 4 behaves like a regular single-
threaded superscalar processor.
● Intel claims 30% performance
improvements.
164. 164
Intel MultiCore Architecture
● Improving execution rate of a single-
thread is still considered important:
– Out-of-order execution and speculation.
● MultiCore architecture:
– Can reduce power consumption.
– (14 pipeline stages) is closer to the Pentium
M (12 stages) than the P4 (30 stages).
● Many transistors invested in large
branch predictors:
– To reduce wasted work (power).
166. 166
Intel’s Dual Core Architectures
● The Pentium D is simply two Pentium 4 cpus:
– Inefficiently paired together to run as dual core.
● The Core Duo is Intel's first generation dual core
processor based upon the Pentium M (a Pentium III-4
hybrid):
– Made mostly for laptops and is much more efficient than
Pentium D.
● The Core 2 Duo is Intel's second generation (hence,
Core 2) processor:
– Made for desktops and laptops designed to be fast while
not consuming nearly as much power as previous CPUs.
● Intel has now dropped the Pentium name in favor of
the Core architecture.
168. 168
Intel Core 2 Duo
• Code named “conroe”
• Homogeneous cores
• Bus based on chip
interconnect.
• Shared on-die Cache
Memory.
Classic OOO: Reservation
Stations, Issue ports,
Schedulers…etc Large, shared set
associative, prefetch, etc.
Source: Intel Corp.
169. 169
Intel’s Core 2 Duo
● Launched in July 2006.
● Replacement for Pentium 4 and Pentium D
CPUs.
● Intel claims:
– Conroe provides 40% more performance at
40% less power compared to the Pentium D.
● All Conroe processors are manufactured
with 4 MB L2 cache:
– Due to manufacturing defects, the E6300 and
E6400 versions based on this core have half
their cache disabled, leaving them with only
2 MB of usable L2 cache.
174. 174
Commercial Examples: IBM
POWER5
● SMT added to Superscalar Micro-
architecture.
● Additional Program Counter (PC).
● GPR/FPR rename mapper expanded
to map second set of registers .
● Completion logic replicated to track
two threads.
175. 175
Commercial Examples: IBM
POWER5
● Includes:
1. Thread Priority Mechanism: 8
levels.
2. Dynamic Thread Switching:
● Used if no instruction ready to run.
● Allocates all machine resources to
one thread at any time.
176. 176
Sun’s Niagara
● Commercial servers require high
thread-level throughput:
– Suffer from cache misses.
● Sun’s Niagara focuses on:
– Simple cores (low power, design
complexity, can accommodate more
cores)
– Fine-grain multi-threading (to
tolerate long memory latencies)
180. 180
Reducing Power and Cooling Requirements
with Processor Performance States
P-State
HIGH
LOW
P0
2600MHz
1.35V
~95watts
P1
2400MHz
1.30V
~80watts
P2
2200MHz
1.25V
~66watts
P3
2000MHz
1.20V
~55watts
P4
1800MHz
1.15V
~51watts
P5
1000MHz
1.10V
~34watts
PROCESSOR
UTILIZATION
AMD PowerNow!™
Technology with Optimized
Power Management
Multiple performance states for
optimized power management
Dynamically reduces processor
power based on workload
Lowers power consumption
without compromising
performance
Up to 75% processor power
savings.
Example:
AMD Opteron™ processor 2218 series
181. 181
Summary
Cont…
• ILP now appears fully exploited:
–For the last decade or so, the focus has
been on thread-and process-level
parallelism.
• Multiprocessors progressed from
add-on cards, to chips on the mother
board:
–Now available as multicore.
182. 182
Summary
Cont…
● Major issues in multiprocessors:
– Cache coherency and
synchronization.
● Cache coherency:
– The copies of data blocks in individual
cahces may become inconsistent.
183. 183
Summary
Cont…
● Cache coherency:Two popular
protocols:
– Snooping: suitable in SMPs
– Directory-based: Suitable in NUMA
processors.
● Mutithreading in uniprocessors is
another promising approach:
– Simultaneous multithreading (SMT)
184. 184
Future Trends
● Simultaneous and Redundantly
Threaded Processors(SRT):
– Increase reliability with fault
detection and correction.
– Run multiple copies of the same
programme simultaneously.
185. 185
Future Trends
● Software Pre-Execution in SMT:
– In some cases data adress is
extremely hard to predict.
– Use an idle thread of SMT for pre-
execution.
● Speculation:
– More advanced techniques for
speculation.
186. 186
References
[1]J.L. Hennessy & D.A. Patterson, “Computer Architecture:
A Quantitative Approach”. Morgan Kaufmann Publishers,
3rd Edition, 2003
[2]John Paul Shen and Mikko Lipasti, “Modern Processor
Design,” Tata Mc-Graw-Hill, 2005
[3] S. McFarling, "Program Optimization for Instruction
Caches, " Proceedings of the Third International
Conference on Architectural Support for Programming
Languages and Operating Systems, pp. 183--191, April
1989.
187. 187
References
● "Simultaneous Multithreading: Maximizing On-Chip
Parallelism" by Tullsen, Eggers and Levy in ISCA95.
● “Simultaneous Multithreading: Present Developments
and Future Directions” by Miquel Peric, June 2003
● “Simultaneous Multi-threading Implementation in
POWER5 -- IBM's Next Generation POWER
Microprocessor” by IBM, Aug 2004
● “Simultaneous Multithreading: A Platform for Next-
Generation Processors” by Eggers, Emer, Levy, Lo,
Stamm and Tullsen in IEEE Micro, October, 1997.