"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
1. Performance
Analysis
of
La1ce
QCD
on
GPUs
in
APGAS
Programming
Model
Koichi
Shirahata1,
Jun
Doi2,
Mikio
Takeuchi2
1:
Tokyo
InsGtute
of
Technology
2:
IBM
Research
-‐
Tokyo
2. Programming
Models
for
Exascale
CompuGng
• GPU-‐based
heterogeneous
clusters
– e.g.)
TSUBAME
2.5
(3
GPUs
x
1408
nodes)
– AcceleraGon
using
GPUs
• High
peak
performance/memory
bandwidth
• Programming
models
for
GPU-‐based
clusters
– Massage
passing
(e.g.
MPI)
• High
tuning
efficiency
• High
programming
cost
– APGAS
(Asynchronous
ParGGoned
Global
Address
Space)
• Abstract
distributed
memory
and
deep
memory
hierarchy
• X10:
an
instance
of
APGAS
programming
languages
→
Highly
scalable
and
producGve
compuGng
on
GPUs
using
APGAS
3. Problem
Statement
• How
much
do
GPUs
accelerate
applicaGons
using
APGAS?
– Tradeoff
between
performance
and
producGvity
• Performance
– The
abstracGon
of
memory
hierarchy
may
limit
performance
– Scalability
on
mulG-‐GPU
• ProducGvity
– Can
we
use
GPUs
with
li`le
programming
cost?
4. Goal
and
ContribuGons
• Goal
– Scalable
and
producGve
compuGng
on
GPUs
• Approach
– Performance
analysis
of
la1ce
QCD
in
X10
CUDA
• Implement
la1ce
QCD
in
X10
CUDA
• ComparaGve
performance
analysis
of
X10
on
GPUs
• Confirm
acceleraGon
using
GPUs
in
X10
– 19.4x
speedup
from
X10
by
using
X10
CUDA
on
32
nodes
of
TSUBAME
2.5
5. The
APGAS
Model
using
GPUs
Child
Place
0
Child
Place
M-‐1
...
...
...
...
...
...
GPU
Place
...
Place
0
Place
1
Place
N-‐1
AcGvity
at
async
at
asyncCopy
asyncCopy
asyncCopy
at
at
asyncCopy
Object
CPU
Place
X10
provides
CUDA
support
for
GPUs
[1]
[1]
Cunningham,
D.
et
al.
"GPU
programming
in
a
high
level
language:
compiling
X10
to
CUDA."
Proceedings
of
the
2011
ACM
SIGPLAN
X10
Workshop.
ACM,
2011.
6. La1ce
QCD
• La1ce
QCD
– SimulaGon
of
Quantum
ChromoDynamics
(QCD)
of
quarks
and
gluons
on
4D
grid
of
points
in
space
and
Gme
– A
grand
challenge
in
high-‐performance
compuGng
• Requires
high
memory/network
bandwidth
and
computaGonal
power
• CompuGng
la1ce
QCD
– Monte-‐Carlo
simulaGons
on
4D
grid
– Dominated
solving
linear
equaGons
of
matrix-‐vector
mulGplicaGon
using
iteraGve
methods
(etc.
CG
method)
– Parallelizable
by
dividing
4D
grid
into
parGal
grids
for
each
place
• Boundary
exchanges
are
required
between
places
in
each
direcGon
7. ImplementaGon
of
La1ce
QCD
in
X10
CUDA
• We
extend
our
la1ce
QCD
in
X10
into
X10
CUDA
– PorGng
whole
solvers
into
CUDA
kernels
• Wilson-‐Dirac
operator,
BLAS
level
1
operaGons
• Avoid
waste
memory
copy
overheads
between
CPU
and
GPU,
except
boundary
data
transfers
– Implement
boundary
data
transfer
among
GPUs
• Add
memory
copy
operaGons
– (1)
GPU
à
CPU,
(2)
CPU
à
CPU,
(3)
CPU
à
GPU
– OpGmizaGons
• Data
layout
transformaGon
• CommunicaGon
overlapping
8. Data
Layout
TransformaGon
• Two
types
of
data
layouts
for
quarks
and
gluons
– AoS
(Array
of
Structure)
• Non-‐conGguous
data
• Used
in
our
original
CPU
implementaGons
– SoA
(Structure
of
Array)
• ConGguous
data
• Suitable
for
vectorizaGon
• We
translate
from
AoS
to
SoA
– GPU
is
suitable
for
coalesced
memory
access
(0,
0,
0,
0)
(1,
0,
0,
0)
(x,
y,
z,
t):
…
Spin
1
Spin
2
…
Spin:
…
…
AoS
…
…
SoA
(n-‐1,
n-‐1,
n-‐1,
n-‐1)
Spin
m
…
…
9. CommunicaGon
OpGmizaGons
in
X10
CUDA
• Two
communicaGon
opGmizaGons
– mulG-‐dimensional
parGGoning
– CommunicaGon
overlapping
• Overlap
memory
copy
between
GPU
and
CPU
in
addiGon
to
between
CPU
and
CPU
• Overlapping
domain
is
limited
by
finish-‐based
synchronizaGon
Comp.
Comm.
GPU Kernel
Transfer btw. CPU and GPU
Exchange btw. CPU and CPU
Time
Synchronization (using finish)
T:
X:
Y:
Z:
Inner:
10. Experimental
Setup
• Performance
comparison
with
other
implementaGons
– X10
C++,
MPI
C,
MPI
CUDA
– Use
one
GPU
per
node
• Measurements
– Weak
scaling
– Strong
scaling
• ConfiguraGon
– Measure
average
iteraGon
Gme
of
one
convergence
of
CG
method
• Typically
300
–
500
iteraGons
– Problem
size
• (x,
y,
z,
t)
=
(24,
24,
24,
96)
(unless
specified)
• Fit
on
one
Tesla
K20X
GPU
11. Experimental
environments
• TSUBAME2.5
supercomputer
(unless
specified)
– Use
up
to
32
nodes
(32
GPUs)
– CPU-‐GPU:
PCI-‐E
2.0
x16
(8
GB/sec)
– Internode:
QDR
IB
dual
rail
(10
GB/sec)
• Setup
– 1
place
per
node
– 1
GPU
per
place
(X10
CUDA)
– 12
threads
per
place
(X10
C++,
MPI
C)
• Sotware
– X10
version
2.4.3.2
– CUDA
version
6.0
– OpenMPI
1.6.5
2
CPUs
/
node
3
GPUs
/
node
Model
Intel®
Xeon®
X5670
Tesla
K20X
#
Cores
6
2688
Frequency
2.93
GHz
0.732
GHz
Memory
54
GB
6
GB
Memory
BW
32
GB/sec
250
GB/sec
Compiler
gcc
4.3.4
Nvcc
6.0
11
12. Comparison
of
Weak
Scaling
with
CPU-‐based
ImplementaGons
1
10
100
1000
1
2
4
8
16
32
Performance
[Gflops]
Number
of
Nodes
MPI
C
X10
C++
X10
CUDA
DP
X10
CUDA
SP
• X10
CUDA
exhibits
good
weak
scaling
– 19.4x
and
11.0x
faster
than
X10
C++
and
MPI
C
each
on
32
nodes
– X10
CUDA
does
not
incur
communicaGonal
penalty
with
large
amount
of
data
on
GPUs
0
500
1000
1500
2000
2500
3000
1
2
4
8
16
32
Elapsed
Time
per
Itera@on
[ms]
Number
of
Nodes
MPI
C
X10
C++
X10
CUDA
DP
X10
CUDA
SP
13. Comparison
of
Weak
Scaling
with
MPI
CUDA
• Comparison
on
TSUBAME-‐KFC
– X10
2.5.1,
CUDA
5.5,
OpenMPI
1.7.2
– Problem
Size:
24
x
24
x
24
x
24
per
node
• X10
CUDA
performs
similar
scalability
with
MPI
CUDA
• Increase
performance
gap
as
using
larger
number
of
nodes
1
10
100
1000
1
2
4
8
16
32
Performance
[Gflops]
Number
of
Nodes
MPI
CUDA
DP
X10
CUDA
DP
0
10
20
30
40
50
60
70
80
1
2
4
8
16
32
Elapsed
Time
per
Itera@on
[msec]
Number
of
Nodes
MPI
CUDA
DP
X10
CUDA
DP
14. Strong
Scaling
Comparing
with
CPU-‐
based
ImplementaGons
• X10
CUDA
outperforms
both
X10
C++
and
MPI
C
– 8.27x
and
4.57x
faster
each
on
16
Places
– Scalability
of
X10
CUDA
gets
poorer
as
the
number
of
nodes
increases
1
10
100
1000
1
2
4
8
16
32
Performance
[Gflops]
Number
of
Nodes
MPI
C
X10
C++
X10
CUDA
DP
X10
CUDA
SP
15. Comparison
of
Strong
Scaling
with
MPI
CUDA
• Comparison
on
TSUBAME-‐KFC
– X10
2.5.1,
CUDA
5.5,
OpenMPI
1.7.2
• X10
CUDA
exhibits
comparaGve
performance
up
to
4
nodes
• X10
CUDA
suffers
heavy
overheads
on
over
8
nodes
1
10
100
1000
1
2
4
8
16
32
Preformance
[Gflops]
Number
of
Nodes
MPI
CUDA
DP
X10
CUDA
DP
0
20
40
60
80
100
120
140
160
180
200
0
10
20
30
40
Elapsed
Time
per
Itera@on
[msec]
Number
of
Nodes
MPI
CUDA
DP
X10
CUDA
DP
16. Performance
Breakdown
of
La1ce
QCD
in
X10
CUDA
• CommunicaGon
overhead
increases
– Both
boundary
communicaGon
and
MPI
Allreduce
• ComputaGon
parts
also
do
not
scale
linearly
0
10
20
30
40
50
60
70
80
90
100
0
20
40
60
80
100
120
1
2
4
8
16
32
Comm.
Overhead
[%]
Elapsed
Time
per
Iteratoin
[ms]
Number
of
Nodes
Bnd
Comm.
Allreduce
BLAS
Wilson-‐Dirac
Comm.
RaGo
17. Comparison
with
MPI
CUDA
using
Different
Problem
Sizes
• Comparison
on
TSUBAME-‐KFC
– X10
2.5.1,
CUDA
5.5
• X10
CUDA
suffers
heavy
overhead
on
small
problem
sizes
– 6.02x
slower
on
4
x
4
x
4
x
8
– We
consider
X10
CUDA
suffers
constant
overhead
1
10
100
1000
10000
4x4x4x8
8x8x8x16
12x12x12x48
24x24x24x96
Performance
[Gflops]
Problem
Size
(X
x
Y
x
Z
x
T)
MPI
CUDA
DP
X10
CUDA
DP
18. Comparison
of
ProducGvity
with
MPI
CUDA
• Lines
of
code
– X10
CUDA
version
contains
1.92x
larger
lines
of
code
compared
with
MPI
CUDA
in
total
• Since
currently
X10
CUDA
cannot
call
device
funcGons
inside
CUDA
kernels
• Compiling
Gme
– X10
CUDA
takes
11.3x
longer
Gme
to
compile
MPI
CUDA
X10
CUDA
Total
4667
8942
Wilson
Dirac
1590
6512
MPI
CUDA
X10
CUDA
Compiling
Time
[sec]
15.19
171.50
19. Pros/Cons
of
X10
CUDA
from
Our
Study
• Advantages
of
X10
CUDA
– Straighxorward
porGng
from
X10
to
X10
CUDA
• Simply
porGng
computaGon
kernels
into
CUDA
• inserGng
memory
copy
operaGons
between
CPU
and
GPU
– X10
CUDA
exhibits
good
weak
scaling
• Drawbacks
of
current
version
of
X10
CUDA
– LimitaGons
of
programmability
• X10
CUDA
cannot
call
a
funcGon
inside
a
kernel
– LimitaGons
of
performance
• finish-‐based
synchronizaGon
incurs
overhead
• X10
CUDA
does
not
support
creaGng
CUDA
streams
20. Related
Work
• High
performance
large-‐scale
la1ce
QCD
computaGon
– Peta-‐scale
la1ce
QCD
on
a
Blue
Gene/Q
supercomputer
[Doi
et
al.
2012]
• Fully
overlapping
communicaGon
and
applying
node-‐mapping
opGmizaGon
for
BG/Q
• La1ce
QCD
using
many-‐core
accelerators
– QUDA:
A
QCD
library
on
GPUs
[Clark
et
al.
2010]
• Invokes
mulGple
CUDA
streams
for
overlapping
– La1ce
QCD
on
Intel
Xeon
Phi
[Joo
et
al.
2013]
• PGAS
language
extension
for
mulG-‐node
GPU
clusters
– XcalableMP
extension
for
GPU
[Lee
et
al.
2012]
• Demonstrated
their
N-‐body
implementaGon
scales
well
21. Conclusion
• Conclusion
– GPUs
accelerate
la1ce
QCD
significantly
in
APGAS
programming
model
• X10
CUDA
exhibits
good
scalability
in
weak
scaling
• 19.4x
speedup
from
X10
by
using
X10
CUDA
on
32
nodes
of
TSUBAME
2.5
– We
reveal
limitaGons
in
current
X10
CUDA
• Performance
overheads
in
strong
scalability
• Increase
of
lines
of
code
• Future
work
– Performance
improvement
in
strong
scaling
• More
detailed
analysis
of
overheads
in
X10
CUDA
23. Breakdown
of
Wilson-‐Dirac
ComputaGon
in
X10
CUDA
• CommunicaGon
becomes
dominant
when
using
more
than
16
places
– A
cause
of
the
limit
of
strong
scaling
• Possible
ways
to
improve
the
scalability
– Applying
one-‐to-‐one
synchronizaGon
– Improving
communicaGon
and
synchronizaGon
operaGons
themselves
in
X10
0
10
20
30
40
50
60
70
1
2
4
8
16
32
Elapsed
Time
[ms]
Number
of
Places
Gamma5
Bnd
Set
Copy
CPU
to
GPU
max(Bulk,
Bnd)
Bulk
Bnd
(Make
+
Comm.)
24. Comparison
with
Different
Dimensions
of
La1ce
ParGGoning
• Comparison
between
4D
and
1D
parGGoning
– 1D
parGGoning
exhibits
be`er
strong
scalability
– 1.35x
be`er
on
32
GPUs
– SGll
saturate
using
over
16
GPUs
0
50
100
150
200
250
0
5
10
15
20
25
30
35
Gflops
Number
of
Nodes
(=
Number
of
GPUs)
Strong
Scaling
(24x24x24x96)
X10
CUDA
SP
X10
CUDA
SP
(div
t)
25. Comparison
with
QUDA
• QUDA
[1]:
A
QCD
library
on
GPUs
– Highly
opGmized
for
GPUs
• QUDA
exhibits
be`er
strong
scalability
– Currently
30.4x
slower
in
X10
CUDA
1
10
100
1000
10000
1
2
4
8
16
32
Gflops
Number
of
Places
(number
of
GPUs)
X10
CUDA
DP
(div
t)
X10
CUDA
SP
(div
t)
QUDA
SP
recon
18
QUDA
SP
recon
12
[1]:
Clark,
Michael
A.,
et
al.
"Solving
La1ce
QCD
systems
of
equaGons
using
mixed
precision
solvers
on
GPUs."
Computer
Physics
CommunicaGons
181.9
(2010):
1517-‐1528.