SlideShare a Scribd company logo
1 of 25
Download to read offline
Performance	
  Analysis	
  of	
  	
  
La1ce	
  QCD	
  on	
  GPUs	
  in	
  APGAS	
  
Programming	
  Model	
Koichi	
  Shirahata1,	
  Jun	
  Doi2,	
  Mikio	
  Takeuchi2	
  
	
   	
   	
   	
  1:	
  Tokyo	
  InsGtute	
  of	
  Technology	
  
	
   	
   	
   	
  2:	
  IBM	
  Research	
  -­‐	
  Tokyo
Programming	
  Models	
  for	
  Exascale	
  
CompuGng	
•  GPU-­‐based	
  heterogeneous	
  clusters	
  
–  e.g.)	
  TSUBAME	
  2.5	
  (3	
  GPUs	
  x	
  1408	
  nodes)	
  
–  AcceleraGon	
  using	
  GPUs	
  
•  High	
  peak	
  performance/memory	
  bandwidth	
  
•  Programming	
  models	
  for	
  GPU-­‐based	
  clusters	
  
–  Massage	
  passing	
  (e.g.	
  MPI)	
  
•  High	
  tuning	
  efficiency	
  
•  High	
  programming	
  cost	
  
–  APGAS	
  (Asynchronous	
  ParGGoned	
  Global	
  Address	
  Space)	
  
•  Abstract	
  distributed	
  memory	
  and	
  deep	
  memory	
  hierarchy	
  
•  X10:	
  an	
  instance	
  of	
  APGAS	
  programming	
  languages	
  
→	
  Highly	
  scalable	
  and	
  producGve	
  compuGng	
  on	
  GPUs	
  
	
  using	
  	
  APGAS	
  
Problem	
  Statement	
•  How	
  much	
  do	
  GPUs	
  accelerate	
  applicaGons	
  	
  
using	
  APGAS?	
  
– Tradeoff	
  between	
  performance	
  and	
  producGvity	
  
•  Performance	
  
– The	
  abstracGon	
  of	
  memory	
  hierarchy	
  may	
  limit	
  
performance	
  
– Scalability	
  on	
  mulG-­‐GPU	
  
•  ProducGvity	
  
– Can	
  we	
  use	
  GPUs	
  with	
  li`le	
  programming	
  cost?	
  
Goal	
  and	
  ContribuGons	
•  Goal	
  
– Scalable	
  and	
  producGve	
  compuGng	
  on	
  GPUs	
  
•  Approach	
  
– Performance	
  analysis	
  of	
  la1ce	
  QCD	
  in	
  X10	
  CUDA	
  
•  Implement	
  la1ce	
  QCD	
  in	
  X10	
  CUDA	
  
•  ComparaGve	
  performance	
  analysis	
  of	
  X10	
  on	
  GPUs	
  
•  Confirm	
  acceleraGon	
  using	
  GPUs	
  in	
  X10	
  
– 19.4x	
  speedup	
  from	
  X10	
  by	
  using	
  X10	
  CUDA	
  	
  
on	
  32	
  nodes	
  of	
  TSUBAME	
  2.5
The	
  APGAS	
  Model	
  using	
  GPUs	
Child	
  	
  
Place	
  0	
Child	
  	
  
Place	
  M-­‐1	
  
...	
 ...	
 ...	
...	
...	
...	
GPU	
  Place	
...	
Place	
  0	
 Place	
  1	
 Place	
  N-­‐1	
AcGvity	
at	
async	
at	
asyncCopy	
asyncCopy	
 asyncCopy	
at	
 at	
 asyncCopy	
Object	
CPU	
  Place	
X10	
  provides	
  CUDA	
  support	
  for	
  GPUs	
  [1]	
[1]	
  Cunningham,	
  D.	
  et	
  al.	
  "GPU	
  programming	
  in	
  a	
  high	
  level	
  language:	
  compiling	
  X10	
  to	
  CUDA."	
  	
  
Proceedings	
  of	
  the	
  2011	
  ACM	
  SIGPLAN	
  X10	
  Workshop.	
  ACM,	
  2011.
La1ce	
  QCD	
  	
  
•  La1ce	
  QCD	
  
–  SimulaGon	
  of	
  Quantum	
  ChromoDynamics	
  (QCD)	
  of	
  quarks	
  and	
  
gluons	
  on	
  4D	
  grid	
  of	
  points	
  in	
  space	
  and	
  Gme	
  
–  A	
  grand	
  challenge	
  in	
  high-­‐performance	
  compuGng	
  
•  Requires	
  high	
  memory/network	
  bandwidth	
  and	
  computaGonal	
  power	
  
•  CompuGng	
  la1ce	
  QCD	
  
–  Monte-­‐Carlo	
  simulaGons	
  on	
  4D	
  grid	
  
–  Dominated	
  solving	
  linear	
  equaGons	
  of	
  matrix-­‐vector	
  
mulGplicaGon	
  using	
  iteraGve	
  methods	
  (etc.	
  CG	
  method)	
  
–  Parallelizable	
  by	
  dividing	
  4D	
  grid	
  into	
  parGal	
  grids	
  for	
  each	
  place	
  
•  Boundary	
  exchanges	
  are	
  required	
  between	
  places	
  in	
  each	
  direcGon	
  
ImplementaGon	
  of	
  La1ce	
  QCD	
  	
  
in	
  X10	
  CUDA	
•  We	
  extend	
  our	
  la1ce	
  QCD	
  in	
  X10	
  into	
  X10	
  CUDA	
  
–  PorGng	
  whole	
  solvers	
  into	
  CUDA	
  kernels	
  
•  Wilson-­‐Dirac	
  operator,	
  BLAS	
  level	
  1	
  operaGons	
  
•  Avoid	
  waste	
  memory	
  copy	
  overheads	
  between	
  CPU	
  and	
  
GPU,	
  except	
  boundary	
  data	
  transfers	
  
–  Implement	
  boundary	
  data	
  transfer	
  among	
  GPUs	
  
•  Add	
  memory	
  copy	
  operaGons	
  
–  (1)	
  GPU	
  à	
  CPU,	
  (2)	
  CPU	
  à	
  CPU,	
  (3)	
  CPU	
  à	
  GPU	
  
–  OpGmizaGons	
  
•  Data	
  layout	
  transformaGon	
  
•  CommunicaGon	
  overlapping	
  
Data	
  Layout	
  TransformaGon	
•  Two	
  types	
  of	
  data	
  layouts	
  for	
  quarks	
  and	
  gluons	
  
–  AoS	
  (Array	
  of	
  Structure)	
  
•  Non-­‐conGguous	
  data	
  
•  Used	
  in	
  our	
  original	
  CPU	
  implementaGons	
  
–  SoA	
  (Structure	
  of	
  Array)	
  
•  ConGguous	
  data	
  
•  Suitable	
  for	
  vectorizaGon	
  
•  We	
  translate	
  from	
  AoS	
  to	
  SoA	
  
–  GPU	
  is	
  suitable	
  for	
  coalesced	
  memory	
  access	
  
(0,	
  0,	
  0,	
  0)	
(1,	
  0,	
  0,	
  0)	
(x,	
  y,	
  z,	
  t):	
  	
…	
Spin	
  1	
 Spin	
  2	
…	
Spin:	
  	
…	
 …	
AoS	
 …	
 …	
SoA	
(n-­‐1,	
  n-­‐1,	
  n-­‐1,	
  n-­‐1)	
Spin	
  m	
…	
…
CommunicaGon	
  OpGmizaGons	
  	
  
in	
  X10	
  CUDA	
•  Two	
  communicaGon	
  opGmizaGons	
  
–  mulG-­‐dimensional	
  parGGoning	
  
–  CommunicaGon	
  overlapping	
  
•  Overlap	
  memory	
  copy	
  between	
  GPU	
  and	
  CPU	
  in	
  addiGon	
  to	
  
between	
  CPU	
  and	
  CPU	
  
•  Overlapping	
  domain	
  is	
  limited	
  by	
  finish-­‐based	
  synchronizaGon	
  
Comp.	
  
Comm.	
  
GPU Kernel	
Transfer btw. CPU and GPU	
Exchange btw. CPU and CPU	
Time	
  
Synchronization (using finish)	
T:	
  
X:	
  
Y:	
  
Z:	
  
Inner:	
  
Experimental	
  Setup	
•  Performance	
  comparison	
  with	
  other	
  
implementaGons	
  
–  X10	
  C++,	
  MPI	
  C,	
  MPI	
  CUDA	
  
–  Use	
  one	
  GPU	
  per	
  node	
  
•  Measurements	
  
–  Weak	
  scaling	
  
–  Strong	
  scaling	
  
•  ConfiguraGon	
  
–  Measure	
  average	
  iteraGon	
  Gme	
  of	
  one	
  convergence	
  of	
  CG	
  
method	
  
•  Typically	
  300	
  –	
  500	
  iteraGons	
  
–  Problem	
  size	
  
•  (x,	
  y,	
  z,	
  t)	
  =	
  (24,	
  24,	
  24,	
  96)	
  (unless	
  specified)	
  
•  Fit	
  on	
  one	
  Tesla	
  K20X	
  GPU	
  
Experimental	
  environments	
•  TSUBAME2.5	
  supercomputer	
  (unless	
  specified)	
  
–  Use	
  up	
  to	
  32	
  nodes	
  (32	
  GPUs)	
  
–  CPU-­‐GPU:	
  	
  PCI-­‐E	
  2.0	
  x16	
  	
  
(8	
  GB/sec)	
  
–  Internode:	
  QDR	
  IB	
  dual	
  rail	
  	
  
(10	
  GB/sec)	
  
•  Setup	
  
–  1	
  place	
  per	
  node	
  
–  1	
  GPU	
  per	
  place	
  (X10	
  CUDA)	
  
–  12	
  threads	
  per	
  place	
  	
  
(X10	
  C++,	
  MPI	
  C)	
  
•  Sotware	
  
–  X10	
  version	
  2.4.3.2	
  
–  CUDA	
  version	
  6.0	
  
–  OpenMPI	
  1.6.5	
  
2	
  CPUs	
  /	
  node	
 3	
  GPUs	
  /	
  node	
Model	
  	
 Intel®	
  Xeon®	
  X5670	
 Tesla	
  K20X	
  	
#	
  Cores	
 6	
 2688	
Frequency	
  	
 2.93	
  GHz	
 0.732	
  GHz	
Memory	
 54	
  GB	
 6	
  GB	
Memory	
  BW	
 32	
  GB/sec	
 250	
  GB/sec	
Compiler	
  	
 gcc	
  4.3.4	
 Nvcc	
  6.0	
11
Comparison	
  of	
  Weak	
  Scaling	
  with	
  	
  
CPU-­‐based	
  ImplementaGons	
1	
  
10	
  
100	
  
1000	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Performance	
  [Gflops]	
Number	
  of	
  Nodes	
MPI	
  C	
  
X10	
  C++	
  
X10	
  CUDA	
  DP	
  
X10	
  CUDA	
  SP	
  
•  X10	
  CUDA	
  exhibits	
  good	
  weak	
  scaling	
  
–  19.4x	
  and	
  11.0x	
  faster	
  than	
  X10	
  C++	
  and	
  MPI	
  C	
  each	
  	
  
on	
  32	
  nodes	
  
–  X10	
  CUDA	
  does	
  not	
  incur	
  communicaGonal	
  penalty	
  with	
  
large	
  amount	
  of	
  data	
  on	
  GPUs	
0	
  
500	
  
1000	
  
1500	
  
2000	
  
2500	
  
3000	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Elapsed	
  Time	
  per	
  Itera@on	
  [ms]	
Number	
  of	
  Nodes	
MPI	
  C	
  
X10	
  C++	
  
X10	
  CUDA	
  DP	
  
X10	
  CUDA	
  SP	
  
Comparison	
  of	
  Weak	
  Scaling	
  	
  
with	
  MPI	
  CUDA	
•  Comparison	
  on	
  TSUBAME-­‐KFC	
  
–  X10	
  2.5.1,	
  CUDA	
  5.5,	
  OpenMPI	
  1.7.2	
  
–  Problem	
  Size:	
  24	
  x	
  24	
  x	
  24	
  x	
  24	
  per	
  node	
  
•  X10	
  CUDA	
  performs	
  similar	
  scalability	
  with	
  MPI	
  CUDA	
  
•  Increase	
  performance	
  gap	
  as	
  using	
  larger	
  number	
  of	
  nodes	
  
1	
  
10	
  
100	
  
1000	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Performance	
  [Gflops]	
Number	
  of	
  Nodes	
MPI	
  CUDA	
  DP	
  
X10	
  CUDA	
  DP	
  
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
80	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Elapsed	
  Time	
  per	
  Itera@on	
  [msec]	
Number	
  of	
  Nodes	
MPI	
  CUDA	
  DP	
  
X10	
  CUDA	
  DP	
  
Strong	
  Scaling	
  Comparing	
  with	
  CPU-­‐
based	
  ImplementaGons	
•  X10	
  CUDA	
  outperforms	
  both	
  X10	
  C++	
  and	
  MPI	
  C	
  
–  8.27x	
  and	
  4.57x	
  faster	
  each	
  on	
  16	
  Places	
  
–  Scalability	
  of	
  X10	
  CUDA	
  gets	
  poorer	
  as	
  the	
  number	
  of	
  
nodes	
  increases	
1	
  
10	
  
100	
  
1000	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Performance	
  [Gflops]	
Number	
  of	
  Nodes	
MPI	
  C	
  
X10	
  C++	
  
X10	
  CUDA	
  DP	
  
X10	
  CUDA	
  SP	
  
Comparison	
  of	
  Strong	
  Scaling	
  	
  
with	
  MPI	
  CUDA	
•  Comparison	
  on	
  TSUBAME-­‐KFC	
  
–  X10	
  2.5.1,	
  CUDA	
  5.5,	
  OpenMPI	
  1.7.2	
  
•  X10	
  CUDA	
  exhibits	
  comparaGve	
  performance	
  up	
  to	
  4	
  nodes	
  
•  X10	
  CUDA	
  suffers	
  heavy	
  overheads	
  on	
  over	
  8	
  nodes	
  
1	
  
10	
  
100	
  
1000	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Preformance	
  [Gflops]	
Number	
  of	
  Nodes	
MPI	
  CUDA	
  DP	
  
X10	
  CUDA	
  DP	
  
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
120	
  
140	
  
160	
  
180	
  
200	
  
0	
   10	
   20	
   30	
   40	
  
Elapsed	
  Time	
  per	
  Itera@on	
  [msec]	
Number	
  of	
  Nodes	
MPI	
  CUDA	
  DP	
  
X10	
  CUDA	
  DP	
  
Performance	
  Breakdown	
  of	
  	
  
La1ce	
  QCD	
  in	
  X10	
  CUDA	
•  CommunicaGon	
  overhead	
  increases	
  
–  Both	
  boundary	
  communicaGon	
  and	
  MPI	
  Allreduce	
  
•  ComputaGon	
  parts	
  also	
  do	
  not	
  scale	
  linearly	
  
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
80	
  
90	
  
100	
  
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
120	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Comm.	
  Overhead	
  [%]	
Elapsed	
  Time	
  per	
  Iteratoin	
  [ms]	
Number	
  of	
  Nodes	
Bnd	
  Comm.	
   Allreduce	
  
BLAS	
   Wilson-­‐Dirac	
  
Comm.	
  RaGo	
  
Comparison	
  with	
  MPI	
  CUDA	
  using	
  
Different	
  Problem	
  Sizes	
•  Comparison	
  on	
  TSUBAME-­‐KFC	
  
–  X10	
  2.5.1,	
  CUDA	
  5.5	
  
•  X10	
  CUDA	
  suffers	
  heavy	
  overhead	
  on	
  small	
  problem	
  sizes	
  
–  6.02x	
  slower	
  on	
  4	
  x	
  4	
  x	
  4	
  x	
  8	
  
–  We	
  consider	
  X10	
  CUDA	
  suffers	
  constant	
  overhead	
  
1	
  
10	
  
100	
  
1000	
  
10000	
  
4x4x4x8	
   8x8x8x16	
   12x12x12x48	
  24x24x24x96	
  
Performance	
  [Gflops]	
Problem	
  Size	
  (X	
  x	
  Y	
  x	
  Z	
  x	
  T)	
MPI	
  CUDA	
  DP	
  
X10	
  CUDA	
  DP	
  
Comparison	
  of	
  ProducGvity	
  	
  
with	
  MPI	
  CUDA	
•  Lines	
  of	
  code	
  
–  X10	
  CUDA	
  version	
  contains	
  1.92x	
  larger	
  lines	
  of	
  code	
  
compared	
  with	
  MPI	
  CUDA	
  in	
  total	
  
•  Since	
  currently	
  X10	
  CUDA	
  cannot	
  call	
  device	
  funcGons	
  inside	
  CUDA	
  
kernels	
  
	
  
•  Compiling	
  Gme	
  
–  X10	
  CUDA	
  takes	
  11.3x	
  longer	
  Gme	
  to	
  compile	
  	
  
MPI	
  CUDA	
 X10	
  CUDA	
Total	
 4667	
 8942	
Wilson	
  Dirac	
 1590	
 6512	
MPI	
  CUDA	
 X10	
  CUDA	
Compiling	
  Time	
  [sec]	
 15.19	
 171.50
Pros/Cons	
  of	
  X10	
  CUDA	
  	
  
from	
  Our	
  Study	
•  Advantages	
  of	
  X10	
  CUDA	
  
–  Straighxorward	
  porGng	
  from	
  X10	
  to	
  X10	
  CUDA	
  
•  Simply	
  porGng	
  computaGon	
  kernels	
  into	
  CUDA	
  
•  inserGng	
  memory	
  copy	
  operaGons	
  between	
  CPU	
  and	
  GPU	
  
–  X10	
  CUDA	
  exhibits	
  good	
  weak	
  scaling	
  
•  Drawbacks	
  of	
  current	
  version	
  of	
  X10	
  CUDA	
  
–  LimitaGons	
  of	
  programmability	
  
•  X10	
  CUDA	
  cannot	
  call	
  a	
  funcGon	
  inside	
  a	
  kernel	
  
–  LimitaGons	
  of	
  performance	
  
•  finish-­‐based	
  synchronizaGon	
  incurs	
  overhead	
  
•  X10	
  CUDA	
  does	
  not	
  support	
  creaGng	
  CUDA	
  streams	
  
Related	
  Work	
•  High	
  performance	
  large-­‐scale	
  la1ce	
  QCD	
  computaGon	
  
–  Peta-­‐scale	
  la1ce	
  QCD	
  on	
  a	
  Blue	
  Gene/Q	
  supercomputer	
  
[Doi	
  et	
  al.	
  2012]	
  
•  Fully	
  overlapping	
  communicaGon	
  and	
  applying	
  node-­‐mapping	
  
opGmizaGon	
  for	
  BG/Q	
  
•  La1ce	
  QCD	
  using	
  many-­‐core	
  accelerators	
  
–  QUDA:	
  A	
  QCD	
  library	
  on	
  GPUs	
  [Clark	
  et	
  al.	
  2010]	
  
•  Invokes	
  mulGple	
  CUDA	
  streams	
  for	
  overlapping	
  
–  La1ce	
  QCD	
  on	
  Intel	
  Xeon	
  Phi	
  [Joo	
  et	
  al.	
  2013]	
  
•  PGAS	
  language	
  extension	
  for	
  mulG-­‐node	
  GPU	
  clusters	
  
–  XcalableMP	
  extension	
  for	
  GPU	
  [Lee	
  et	
  al.	
  2012]	
  
•  Demonstrated	
  their	
  N-­‐body	
  implementaGon	
  scales	
  well	
  
Conclusion	
•  Conclusion	
  
–  GPUs	
  accelerate	
  la1ce	
  QCD	
  significantly	
  in	
  APGAS	
  
programming	
  model	
  
•  X10	
  CUDA	
  exhibits	
  good	
  scalability	
  in	
  weak	
  scaling	
  
•  19.4x	
  speedup	
  from	
  X10	
  by	
  using	
  X10	
  CUDA	
  on	
  32	
  nodes	
  	
  
of	
  TSUBAME	
  2.5	
  
–  We	
  reveal	
  limitaGons	
  in	
  current	
  X10	
  CUDA	
  
•  Performance	
  overheads	
  in	
  strong	
  scalability	
  
•  Increase	
  of	
  lines	
  of	
  code	
  
•  Future	
  work	
  
–  Performance	
  improvement	
  in	
  strong	
  scaling	
  
•  More	
  detailed	
  analysis	
  of	
  overheads	
  in	
  X10	
  CUDA
•  Backup
Breakdown	
  of	
  Wilson-­‐Dirac	
  
ComputaGon	
  in	
  X10	
  CUDA	
•  CommunicaGon	
  becomes	
  dominant	
  when	
  using	
  more	
  than	
  16	
  places	
  
–  A	
  cause	
  of	
  the	
  limit	
  of	
  strong	
  scaling	
  
•  Possible	
  ways	
  to	
  improve	
  the	
  scalability	
  
–  Applying	
  one-­‐to-­‐one	
  synchronizaGon	
  
–  Improving	
  communicaGon	
  and	
  synchronizaGon	
  operaGons	
  themselves	
  in	
  X10	
  
0	
  
10	
  
20	
  
30	
  
40	
  
50	
  
60	
  
70	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Elapsed	
  Time	
  [ms]	
Number	
  of	
  Places	
Gamma5	
  
Bnd	
  Set	
  
Copy	
  CPU	
  to	
  GPU	
  
max(Bulk,	
  Bnd)	
  
Bulk	
  
Bnd	
  (Make	
  +	
  Comm.)	
  
Comparison	
  with	
  Different	
  Dimensions	
  
of	
  La1ce	
  ParGGoning	
•  Comparison	
  between	
  4D	
  and	
  1D	
  parGGoning	
  
–  1D	
  parGGoning	
  exhibits	
  be`er	
  strong	
  scalability	
  
–  1.35x	
  be`er	
  on	
  32	
  GPUs	
  
–  SGll	
  saturate	
  using	
  over	
  16	
  GPUs	
  
0	
  
50	
  
100	
  
150	
  
200	
  
250	
  
0	
   5	
   10	
   15	
   20	
   25	
   30	
   35	
  
Gflops	
Number	
  of	
  Nodes	
  (=	
  Number	
  of	
  GPUs)	
Strong	
  Scaling	
  (24x24x24x96)	
X10	
  CUDA	
  SP	
  
X10	
  CUDA	
  SP	
  (div	
  t)	
  
Comparison	
  with	
  QUDA	
•  QUDA	
  [1]:	
  A	
  QCD	
  library	
  on	
  GPUs	
  
–  Highly	
  opGmized	
  for	
  GPUs	
  
•  QUDA	
  exhibits	
  be`er	
  strong	
  scalability	
  
–  Currently	
  30.4x	
  slower	
  in	
  X10	
  CUDA	
1	
  
10	
  
100	
  
1000	
  
10000	
  
1	
   2	
   4	
   8	
   16	
   32	
  
Gflops	
Number	
  of	
  Places	
  (number	
  of	
  GPUs)	
X10	
  CUDA	
  DP	
  (div	
  t)	
  
X10	
  CUDA	
  SP	
  (div	
  t)	
  
QUDA	
  SP	
  recon	
  18	
  
QUDA	
  SP	
  recon	
  12	
  
[1]:	
  Clark,	
  Michael	
  A.,	
  et	
  al.	
  "Solving	
  La1ce	
  QCD	
  systems	
  of	
  equaGons	
  using	
  mixed	
  precision	
  solvers	
  on	
  GPUs."	
  	
  
Computer	
  Physics	
  CommunicaGons	
  181.9	
  (2010):	
  1517-­‐1528.

More Related Content

What's hot

Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalCodemotion Tel Aviv
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cachergrebski
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARMEdge AI and Vision Alliance
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen..."Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...Edge AI and Vision Alliance
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacGanesan Narayanasamy
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 

What's hot (20)

Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cache
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen..."Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
"Using the OpenCL C Kernel Language for Embedded Vision Processors," a Presen...
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 

Similar to Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model

improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performances.rohit
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORNVIDIA Japan
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Tokyo Institute of Technology
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookFaisal Siddiqi
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsWim Vanderbauwhede
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 

Similar to Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model (20)

improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Cuda
CudaCuda
Cuda
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 

More from Koichi Shirahata

GPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてGPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてKoichi Shirahata
 
GPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングGPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングKoichi Shirahata
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersKoichi Shirahata
 
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VA GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VKoichi Shirahata
 
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価Koichi Shirahata
 
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化Koichi Shirahata
 
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...Koichi Shirahata
 
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...Koichi Shirahata
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Koichi Shirahata
 

More from Koichi Shirahata (9)

GPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてGPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けて
 
GPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングGPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリング
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VA GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
 
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
 
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
 
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
 
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model

  • 1. Performance  Analysis  of     La1ce  QCD  on  GPUs  in  APGAS   Programming  Model Koichi  Shirahata1,  Jun  Doi2,  Mikio  Takeuchi2          1:  Tokyo  InsGtute  of  Technology          2:  IBM  Research  -­‐  Tokyo
  • 2. Programming  Models  for  Exascale   CompuGng •  GPU-­‐based  heterogeneous  clusters   –  e.g.)  TSUBAME  2.5  (3  GPUs  x  1408  nodes)   –  AcceleraGon  using  GPUs   •  High  peak  performance/memory  bandwidth   •  Programming  models  for  GPU-­‐based  clusters   –  Massage  passing  (e.g.  MPI)   •  High  tuning  efficiency   •  High  programming  cost   –  APGAS  (Asynchronous  ParGGoned  Global  Address  Space)   •  Abstract  distributed  memory  and  deep  memory  hierarchy   •  X10:  an  instance  of  APGAS  programming  languages   →  Highly  scalable  and  producGve  compuGng  on  GPUs    using    APGAS  
  • 3. Problem  Statement •  How  much  do  GPUs  accelerate  applicaGons     using  APGAS?   – Tradeoff  between  performance  and  producGvity   •  Performance   – The  abstracGon  of  memory  hierarchy  may  limit   performance   – Scalability  on  mulG-­‐GPU   •  ProducGvity   – Can  we  use  GPUs  with  li`le  programming  cost?  
  • 4. Goal  and  ContribuGons •  Goal   – Scalable  and  producGve  compuGng  on  GPUs   •  Approach   – Performance  analysis  of  la1ce  QCD  in  X10  CUDA   •  Implement  la1ce  QCD  in  X10  CUDA   •  ComparaGve  performance  analysis  of  X10  on  GPUs   •  Confirm  acceleraGon  using  GPUs  in  X10   – 19.4x  speedup  from  X10  by  using  X10  CUDA     on  32  nodes  of  TSUBAME  2.5
  • 5. The  APGAS  Model  using  GPUs Child     Place  0 Child     Place  M-­‐1   ... ... ... ... ... ... GPU  Place ... Place  0 Place  1 Place  N-­‐1 AcGvity at async at asyncCopy asyncCopy asyncCopy at at asyncCopy Object CPU  Place X10  provides  CUDA  support  for  GPUs  [1] [1]  Cunningham,  D.  et  al.  "GPU  programming  in  a  high  level  language:  compiling  X10  to  CUDA."     Proceedings  of  the  2011  ACM  SIGPLAN  X10  Workshop.  ACM,  2011.
  • 6. La1ce  QCD     •  La1ce  QCD   –  SimulaGon  of  Quantum  ChromoDynamics  (QCD)  of  quarks  and   gluons  on  4D  grid  of  points  in  space  and  Gme   –  A  grand  challenge  in  high-­‐performance  compuGng   •  Requires  high  memory/network  bandwidth  and  computaGonal  power   •  CompuGng  la1ce  QCD   –  Monte-­‐Carlo  simulaGons  on  4D  grid   –  Dominated  solving  linear  equaGons  of  matrix-­‐vector   mulGplicaGon  using  iteraGve  methods  (etc.  CG  method)   –  Parallelizable  by  dividing  4D  grid  into  parGal  grids  for  each  place   •  Boundary  exchanges  are  required  between  places  in  each  direcGon  
  • 7. ImplementaGon  of  La1ce  QCD     in  X10  CUDA •  We  extend  our  la1ce  QCD  in  X10  into  X10  CUDA   –  PorGng  whole  solvers  into  CUDA  kernels   •  Wilson-­‐Dirac  operator,  BLAS  level  1  operaGons   •  Avoid  waste  memory  copy  overheads  between  CPU  and   GPU,  except  boundary  data  transfers   –  Implement  boundary  data  transfer  among  GPUs   •  Add  memory  copy  operaGons   –  (1)  GPU  à  CPU,  (2)  CPU  à  CPU,  (3)  CPU  à  GPU   –  OpGmizaGons   •  Data  layout  transformaGon   •  CommunicaGon  overlapping  
  • 8. Data  Layout  TransformaGon •  Two  types  of  data  layouts  for  quarks  and  gluons   –  AoS  (Array  of  Structure)   •  Non-­‐conGguous  data   •  Used  in  our  original  CPU  implementaGons   –  SoA  (Structure  of  Array)   •  ConGguous  data   •  Suitable  for  vectorizaGon   •  We  translate  from  AoS  to  SoA   –  GPU  is  suitable  for  coalesced  memory  access   (0,  0,  0,  0) (1,  0,  0,  0) (x,  y,  z,  t):   … Spin  1 Spin  2 … Spin:   … … AoS … … SoA (n-­‐1,  n-­‐1,  n-­‐1,  n-­‐1) Spin  m … …
  • 9. CommunicaGon  OpGmizaGons     in  X10  CUDA •  Two  communicaGon  opGmizaGons   –  mulG-­‐dimensional  parGGoning   –  CommunicaGon  overlapping   •  Overlap  memory  copy  between  GPU  and  CPU  in  addiGon  to   between  CPU  and  CPU   •  Overlapping  domain  is  limited  by  finish-­‐based  synchronizaGon   Comp.   Comm.   GPU Kernel Transfer btw. CPU and GPU Exchange btw. CPU and CPU Time   Synchronization (using finish) T:   X:   Y:   Z:   Inner:  
  • 10. Experimental  Setup •  Performance  comparison  with  other   implementaGons   –  X10  C++,  MPI  C,  MPI  CUDA   –  Use  one  GPU  per  node   •  Measurements   –  Weak  scaling   –  Strong  scaling   •  ConfiguraGon   –  Measure  average  iteraGon  Gme  of  one  convergence  of  CG   method   •  Typically  300  –  500  iteraGons   –  Problem  size   •  (x,  y,  z,  t)  =  (24,  24,  24,  96)  (unless  specified)   •  Fit  on  one  Tesla  K20X  GPU  
  • 11. Experimental  environments •  TSUBAME2.5  supercomputer  (unless  specified)   –  Use  up  to  32  nodes  (32  GPUs)   –  CPU-­‐GPU:    PCI-­‐E  2.0  x16     (8  GB/sec)   –  Internode:  QDR  IB  dual  rail     (10  GB/sec)   •  Setup   –  1  place  per  node   –  1  GPU  per  place  (X10  CUDA)   –  12  threads  per  place     (X10  C++,  MPI  C)   •  Sotware   –  X10  version  2.4.3.2   –  CUDA  version  6.0   –  OpenMPI  1.6.5   2  CPUs  /  node 3  GPUs  /  node Model   Intel®  Xeon®  X5670 Tesla  K20X   #  Cores 6 2688 Frequency   2.93  GHz 0.732  GHz Memory 54  GB 6  GB Memory  BW 32  GB/sec 250  GB/sec Compiler   gcc  4.3.4 Nvcc  6.0 11
  • 12. Comparison  of  Weak  Scaling  with     CPU-­‐based  ImplementaGons 1   10   100   1000   1   2   4   8   16   32   Performance  [Gflops] Number  of  Nodes MPI  C   X10  C++   X10  CUDA  DP   X10  CUDA  SP   •  X10  CUDA  exhibits  good  weak  scaling   –  19.4x  and  11.0x  faster  than  X10  C++  and  MPI  C  each     on  32  nodes   –  X10  CUDA  does  not  incur  communicaGonal  penalty  with   large  amount  of  data  on  GPUs 0   500   1000   1500   2000   2500   3000   1   2   4   8   16   32   Elapsed  Time  per  Itera@on  [ms] Number  of  Nodes MPI  C   X10  C++   X10  CUDA  DP   X10  CUDA  SP  
  • 13. Comparison  of  Weak  Scaling     with  MPI  CUDA •  Comparison  on  TSUBAME-­‐KFC   –  X10  2.5.1,  CUDA  5.5,  OpenMPI  1.7.2   –  Problem  Size:  24  x  24  x  24  x  24  per  node   •  X10  CUDA  performs  similar  scalability  with  MPI  CUDA   •  Increase  performance  gap  as  using  larger  number  of  nodes   1   10   100   1000   1   2   4   8   16   32   Performance  [Gflops] Number  of  Nodes MPI  CUDA  DP   X10  CUDA  DP   0   10   20   30   40   50   60   70   80   1   2   4   8   16   32   Elapsed  Time  per  Itera@on  [msec] Number  of  Nodes MPI  CUDA  DP   X10  CUDA  DP  
  • 14. Strong  Scaling  Comparing  with  CPU-­‐ based  ImplementaGons •  X10  CUDA  outperforms  both  X10  C++  and  MPI  C   –  8.27x  and  4.57x  faster  each  on  16  Places   –  Scalability  of  X10  CUDA  gets  poorer  as  the  number  of   nodes  increases 1   10   100   1000   1   2   4   8   16   32   Performance  [Gflops] Number  of  Nodes MPI  C   X10  C++   X10  CUDA  DP   X10  CUDA  SP  
  • 15. Comparison  of  Strong  Scaling     with  MPI  CUDA •  Comparison  on  TSUBAME-­‐KFC   –  X10  2.5.1,  CUDA  5.5,  OpenMPI  1.7.2   •  X10  CUDA  exhibits  comparaGve  performance  up  to  4  nodes   •  X10  CUDA  suffers  heavy  overheads  on  over  8  nodes   1   10   100   1000   1   2   4   8   16   32   Preformance  [Gflops] Number  of  Nodes MPI  CUDA  DP   X10  CUDA  DP   0   20   40   60   80   100   120   140   160   180   200   0   10   20   30   40   Elapsed  Time  per  Itera@on  [msec] Number  of  Nodes MPI  CUDA  DP   X10  CUDA  DP  
  • 16. Performance  Breakdown  of     La1ce  QCD  in  X10  CUDA •  CommunicaGon  overhead  increases   –  Both  boundary  communicaGon  and  MPI  Allreduce   •  ComputaGon  parts  also  do  not  scale  linearly   0   10   20   30   40   50   60   70   80   90   100   0   20   40   60   80   100   120   1   2   4   8   16   32   Comm.  Overhead  [%] Elapsed  Time  per  Iteratoin  [ms] Number  of  Nodes Bnd  Comm.   Allreduce   BLAS   Wilson-­‐Dirac   Comm.  RaGo  
  • 17. Comparison  with  MPI  CUDA  using   Different  Problem  Sizes •  Comparison  on  TSUBAME-­‐KFC   –  X10  2.5.1,  CUDA  5.5   •  X10  CUDA  suffers  heavy  overhead  on  small  problem  sizes   –  6.02x  slower  on  4  x  4  x  4  x  8   –  We  consider  X10  CUDA  suffers  constant  overhead   1   10   100   1000   10000   4x4x4x8   8x8x8x16   12x12x12x48  24x24x24x96   Performance  [Gflops] Problem  Size  (X  x  Y  x  Z  x  T) MPI  CUDA  DP   X10  CUDA  DP  
  • 18. Comparison  of  ProducGvity     with  MPI  CUDA •  Lines  of  code   –  X10  CUDA  version  contains  1.92x  larger  lines  of  code   compared  with  MPI  CUDA  in  total   •  Since  currently  X10  CUDA  cannot  call  device  funcGons  inside  CUDA   kernels     •  Compiling  Gme   –  X10  CUDA  takes  11.3x  longer  Gme  to  compile     MPI  CUDA X10  CUDA Total 4667 8942 Wilson  Dirac 1590 6512 MPI  CUDA X10  CUDA Compiling  Time  [sec] 15.19 171.50
  • 19. Pros/Cons  of  X10  CUDA     from  Our  Study •  Advantages  of  X10  CUDA   –  Straighxorward  porGng  from  X10  to  X10  CUDA   •  Simply  porGng  computaGon  kernels  into  CUDA   •  inserGng  memory  copy  operaGons  between  CPU  and  GPU   –  X10  CUDA  exhibits  good  weak  scaling   •  Drawbacks  of  current  version  of  X10  CUDA   –  LimitaGons  of  programmability   •  X10  CUDA  cannot  call  a  funcGon  inside  a  kernel   –  LimitaGons  of  performance   •  finish-­‐based  synchronizaGon  incurs  overhead   •  X10  CUDA  does  not  support  creaGng  CUDA  streams  
  • 20. Related  Work •  High  performance  large-­‐scale  la1ce  QCD  computaGon   –  Peta-­‐scale  la1ce  QCD  on  a  Blue  Gene/Q  supercomputer   [Doi  et  al.  2012]   •  Fully  overlapping  communicaGon  and  applying  node-­‐mapping   opGmizaGon  for  BG/Q   •  La1ce  QCD  using  many-­‐core  accelerators   –  QUDA:  A  QCD  library  on  GPUs  [Clark  et  al.  2010]   •  Invokes  mulGple  CUDA  streams  for  overlapping   –  La1ce  QCD  on  Intel  Xeon  Phi  [Joo  et  al.  2013]   •  PGAS  language  extension  for  mulG-­‐node  GPU  clusters   –  XcalableMP  extension  for  GPU  [Lee  et  al.  2012]   •  Demonstrated  their  N-­‐body  implementaGon  scales  well  
  • 21. Conclusion •  Conclusion   –  GPUs  accelerate  la1ce  QCD  significantly  in  APGAS   programming  model   •  X10  CUDA  exhibits  good  scalability  in  weak  scaling   •  19.4x  speedup  from  X10  by  using  X10  CUDA  on  32  nodes     of  TSUBAME  2.5   –  We  reveal  limitaGons  in  current  X10  CUDA   •  Performance  overheads  in  strong  scalability   •  Increase  of  lines  of  code   •  Future  work   –  Performance  improvement  in  strong  scaling   •  More  detailed  analysis  of  overheads  in  X10  CUDA
  • 23. Breakdown  of  Wilson-­‐Dirac   ComputaGon  in  X10  CUDA •  CommunicaGon  becomes  dominant  when  using  more  than  16  places   –  A  cause  of  the  limit  of  strong  scaling   •  Possible  ways  to  improve  the  scalability   –  Applying  one-­‐to-­‐one  synchronizaGon   –  Improving  communicaGon  and  synchronizaGon  operaGons  themselves  in  X10   0   10   20   30   40   50   60   70   1   2   4   8   16   32   Elapsed  Time  [ms] Number  of  Places Gamma5   Bnd  Set   Copy  CPU  to  GPU   max(Bulk,  Bnd)   Bulk   Bnd  (Make  +  Comm.)  
  • 24. Comparison  with  Different  Dimensions   of  La1ce  ParGGoning •  Comparison  between  4D  and  1D  parGGoning   –  1D  parGGoning  exhibits  be`er  strong  scalability   –  1.35x  be`er  on  32  GPUs   –  SGll  saturate  using  over  16  GPUs   0   50   100   150   200   250   0   5   10   15   20   25   30   35   Gflops Number  of  Nodes  (=  Number  of  GPUs) Strong  Scaling  (24x24x24x96) X10  CUDA  SP   X10  CUDA  SP  (div  t)  
  • 25. Comparison  with  QUDA •  QUDA  [1]:  A  QCD  library  on  GPUs   –  Highly  opGmized  for  GPUs   •  QUDA  exhibits  be`er  strong  scalability   –  Currently  30.4x  slower  in  X10  CUDA 1   10   100   1000   10000   1   2   4   8   16   32   Gflops Number  of  Places  (number  of  GPUs) X10  CUDA  DP  (div  t)   X10  CUDA  SP  (div  t)   QUDA  SP  recon  18   QUDA  SP  recon  12   [1]:  Clark,  Michael  A.,  et  al.  "Solving  La1ce  QCD  systems  of  equaGons  using  mixed  precision  solvers  on  GPUs."     Computer  Physics  CommunicaGons  181.9  (2010):  1517-­‐1528.