1. Data-Level Parallelism
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers
3. SIMD Architectures
Exploit significant data-level parallelism for
Matrix-oriented scientific computing
Media-oriented image & sound processors
SIMD is more energy efficient than MIMD
Only needs to fetch 1 instruction per data operation
Makes SIMD attractive for personal mobile devices
SIMD allows programmer to continue to think
sequentially
3
4. SIMD Parallelism
1. Vector architectures
2. SIMD extensions
3. Graphics Processor Units (GPUs)
For x86 processors
Expect 2 additional cores per chip per year
SIMD width to double every 4 years
Potential speedup from SIMD to be 2x from MIMD
4
6. 1. Vector Architectures
Idea
Read sets of data elements into “vector registers”
Operate on those registers
Disperse results back into memory
Registers are controlled by compiler
Used to hide memory latency
Leverage memory bandwidth
6
8. VMIPS
Loosely based on Cray-1
Vector registers
Each register holds a 64-element, 64 bits/element vector
Register file has 16 read ports & 8 write ports
Vector functional units
Fully pipelined
Data & control hazards are detected
Vector load-store unit
Fully pipelined
1 word per clock cycle after initial latency
Scalar registers
32 general-purpose registers
32 floating-point registers 8
9. VMIPS Instructions
ADDVV.D – add 2 vectors
ADDVS.D – add vector to a scalar
LV/SV – vector load & vector store from
address
9
10. Example – DAXPY
Y = aX + Y
L.D F0,a ; load scalar a
LV V1,Rx ; load vector X
MULVS.D V2,V1,F0 ; vector-scalar multiply
LV V3,Ry ; load vector Y
ADDVV V4,V2,V3 ; add
SV Ry,V4 ; store result
Requires 6 instructions vs. almost 600 for MIPS
Assuming 64 elements per vector
10
DAXPY – Double precision a×X plus Y
11. Vector Execution Time
Execution time depends on 3 factors
Length of operand vectors
Structural hazards
Data dependencies
VMIPS functional units consume 1 element per
clock cycle
~ Execution time = Vector length
11
12. Definitions
Convey
Set of vector instructions that could potentially execute
together
Free of structural hazards
Sequences with read-after-write hazards can be in
same convey via chaining
Chaining
Allows a vector operation to start as soon as individual
elements of its vector source operand become available
12
13. Convey
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV Ry,V4 ;store the sum
Convoys
1 LV MULVS.D
2 LV ADDVV.D
3 SV
13
14. Definitions – Chime
Unit of time to execute 1 convey
m conveys executes in m chimes
For vector length of n, requires m × n clock cycles
14
15. Convey
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV Ry,V4 ;store the sum
Convoys
1 LV MULVS.D
2 LV ADDVV.D
3 SV
3 chimes, 2 FP ops per result, cycles per FLOP = 1.5
For 64 element vectors, requires 64 x 3 = 192 clock
cycles 15
16. Challenges
Start up time
Latency of vector functional unit
Assume same as Cray-1
Floating-point add 6 clock cycles
Floating-point multiply 7 clock cycles
Floating-point divide 20 clock cycles
Vector load 12 clock cycles
16
17. Improvements
1. > 1 element per clock cycle
Multiple Lanes
2. Non-64 wide vectors
Vector Length Register
3. IF statements in vector code
Vector Mask Registers
4. Memory system optimizations to support vector
processors
Memory Banks
5. Multi-dimensional matrices
Stride
6. Sparse matrices
7. Programming a vector computer 17
18. 1. Multiple Lanes
Multiple hardware lanes
Element n of vector register A is “hardwired” to
element n of vector register B
18
19. 2. Vector Length Register
Vector length not known at compile time?
Vector Length Register (VLR)
Use strip mining for vectors over maximum length
19
20. 3. Vector Mask Registers
for (i = 0; i < 64; i=i+1)
if (X[i] != 0)
X[i] = X[i] – Y[i];
Use vector mask register to “disable” elements
LV V1,Rx ;load vector X into V1
LV V2,Ry ;load vector Y
L.D F0,#0 ;load FP zero into F0
SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV Rx,V1 ;store the result in X
20
21. 4. Memory Banks
Designed to support high bandwidth for vector
loads & stores
Spread accesses across multiple banks
Control bank addresses independently
Load or store none sequential words
Support multiple vector processors sharing the same
memory
21
Source: www.hardwareanalysis.com/content/image/10220/
22. 5. Stride
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j = j+1) {
A[i][j] = 0.0;
for (k = 0; k < 100; k=k+1)
A[i][j] = A[i][j] + B[i][k] * D[k][j];
}
}
Row-major vs. column-major
Vectorize multiplication of rows of B with columns of D
Use non-unit stride
Bank conflict (stall) occurs when same bank is hit faster
than bank busy time
22
23. 2. SIMD Extensions
Media applications operate on data types
narrower than native word size
E.g., four 8-bit operations on a 32-bit system
23
24. Why SIMD?
Need minor changes to hardware
Require little extra state compared to vector
architectures
While context switching
No need for high memory bandwidth compared
to vector processors
No need to deal with page faults when used with
virtual memory
24
25. SIMD Extensions
Limitations compared to vector instructions
No of data operands encoded into op code
Many instructions
No sophisticated addressing modes
Strided, scatter-gather
No mask registers for conditional execution
25
26. SIMD Implementations
Intel MMX (1996)
Eight 8-bit integer ops or four 16-bit integer ops
Streaming SIMD Extensions (SSE) (1999)
Eight 16-bit integer ops
Four 32-bit integer/FP ops or two 64-bit integer/FP ops
Advanced Vector Extensions (AVX) (2010)
Four 64-bit integer/fp ops
Rely on programmer & libraries than compiler
Operands must be consecutive & aligned memory
locations 26
27. Graphical Processing Units (GPUs)
Originally designed to accelerate large no of
computations performed in graphics rendering
Offloaded numerically intensive computation from
CPU
GPUs grew with demand for high performance
graphics
Eventually GPUs have become much powerful,
even more than CPUs for many computations
Cost-power-performance advantage
27
28. GPUs
GPU is typically a computer card, installed into a
PCI Express slot
Market leaders: NVIDIA, Intel, AMD (ATI)
Example NVIDIA GPUs at UoM
28
GeForce GTX 480 Tesla 2070
29. Example Specifications
29
GTX 480 Tesla 2070 Tesla K80
Peak double
precision FP
performance
650 Gigaflops 515 Gigaflops 2.91 Teraflops
Peak single
precision FP
performance
1.3 Teraflops 1.03 Teraflops 8.74 Teraflops
CUDA cores 480 448 4992
Frequency of CUDA
Cores
1.40 GHz 1.15 GHz 560/875 MHz
Memory size
(GDDR5)
1536 MB 6 GB 24 GB
Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec
ECC Memory No Yes Yes
30. CPU vs. GPU Architecture
30
GPU devotes more transistors for computation
31. Basic Idea
Heterogeneous execution model
CPU – host
GPU – device
Develop a C-like programming language for GPU
Unify all forms of GPU parallelism as threads
Handle each data element in a separate thread no
need for synchronization
Programming model is “Single Instruction Multiple
Thread (SIMT)”
31
32. Programming GPUs
CUDA language for Nvidia GPU products
Compute Unified Device Architecture
Based on C
nvcc compiler
Lots of tools for analysis, debug, profile, …
OpenCL - Open Computing Language
Based on C
Supports GPU & CPU programming
Lots of active research
e.g., automatic code generation for GPUs
32
33. Threads & Blocks
A thread is associated with
each data element
Threads are organized into
blocks
Blocks are organized into a
grid
GPU hardware handles
thread management, not
applications or OS
33
34. Grids, Blocks, & Threads
July-Aug 2011 34
Grid of size 6 (3x2
blocks)
Each block has 12
threads (4x3)
35. NVIDIA GPU Architecture
Similarities to vector machines
Works well with data-level parallel problems
Scatter-gather transfers
Mask registers
Large register files
Differences
No scalar processor
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few
deeply pipelined units like a vector processor
35
36. Example – Multiply 2 Vectors of
Length 8192
Code that works over all elements is the grid
Thread blocks break this down into manageable sizes
512/1024 threads per block
SIMD instruction executes 32 elements at a time
Thus, grid size = 16/8 blocks
Block is analogous to a strip-mined vector loop with
vector length of 32
Block is assigned to a multithreaded SIMD processor by
thread block scheduler
Current-generation GPUs (Fermi) have 7-15
multithreaded SIMD processors (a.k.a. streaming
multiprocessors, SMX)
36
38. Threads of SIMD Instructions
Each has its own PC
Thread scheduler uses scoreboard to dispatch
Keeps track of up to 48 threads of SIMD
instructions
Hides memory latency
No data dependencies between threads!
Thread block scheduler schedules blocks to
SIMD processors
Within each SIMD processor
32 SIMD lanes
Wide & shallow compared to vector processors 38
39. Registers
NVIDIA GPU has 32,768 registers
Divided into lanes
Each SIMD thread is limited to 64 registers
SIMD thread has up to
64 vector registers of 32 32-bit elements
32 vector registers of 32 64-bit elements
Fermi has 16 physical SIMD lanes, each containing
2048 registers
39
40. Conditional Branching
Like vector architectures, GPU branch hardware uses
internal masks
Also uses
Branch synchronization stack
Entries consist of masks for each SIMD lane
i.e. which threads commit their results (all threads execute)
Instruction markers to manage when a branch diverges into
multiple execution paths
Push on divergent branch
…and when paths converge
Act as barriers
Pops stack
Per-thread-lane 1-bit predicate register, specified by
programmer 40
41. NVIDIA GPU Memory Structure
GRID Architecture
41
Grid
• A group of threads all running
the same kernel
• Can run multiple grids at once
Block
• Grids composed of blocks
• Each block is a logical unit
containing a no of
coordinating threads &
some amount of shared
memory
43. NVIDIA GPU Memory Structures
(Cont.)
Each SIMD lane has private section of off-chip DRAM
Private memory
Contains stack frame, spilling registers, & private
variables
Each multithreaded SIMD processor also has
local memory
Shared by SIMD lanes / threads within a block
Memory shared by SIMD processors is GPU
Memory
Host can read & write GPU memory
43
44. Fermi Architecture Innovations
Each SIMD processor has
2 SIMD thread schedulers, 2 instruction dispatch units
16 SIMD lanes (SIMD width = 32, chime = 2 cycles), 16 load-
store units, 4 special function units
Thus, 2 threads of SIMD instructions are scheduled every 2
clock cycles
Fast double precision
Caches for GPU memory
64-bit addressing & unified address space
Error correcting codes
Faster context switching
Faster atomic instructions
44
46. Summary
Vector architectures
High performance
High cost
SIMD instruction set for multimedia
Simple extensions
Low cost
Low performance
GPUs
High performance
Low cost
SIMD only 46
Editor's Notes
192/128 = 1.5
Predicate register – can be used to tell whether to execute in instruction given a condition
spilling registers - A "spilled variable" is a variable in main memory rather than in a CPU register. The operation of moving a variable from a register to memory is called spilling,