SlideShare a Scribd company logo
1 of 37
Download to read offline
Performance Analysis of
Lattice QCD Application with APGAS
Programming Model
Koichi Shirahata1, Jun Doi2, Mikio Takeuchi2
1: Tokyo Institute of Technology
2: IBM Research - Tokyo
Programming Models for Exascale
Computing
•  Extremely parallel supercomputers
–  It is expected that the first exascale supercomputer will be deployed
by 2020
–  Which programming model will allow easy development and high
performance is still unknown
•  Programming models for extremely parallel supercomputers
–  Partitioned Global Address Space (PGAS)
•  Global view of distributed memory
•  Asynchronous PGAS (APGAS)
Highly Scalable and Productive Computing using APGAS
Programming Model
Problem Statement
•  How is the performance of APGAS programming model
compared with existing massage passing model?
–  Message Passing (MPI)
•  Good tuning efficiency
•  High programming complexity
–  Asynchronous Partitioned Global Address Space
(APGAS)
•  High programming productivity, Good scalability
•  Limited tuning efficiency
MPI
Approach
•  Performance analysis of lattice QCD application with
APGAS programming model
–  Lattice QCD
•  one of the most challenging application for supercomputers
–  Implement lattice QCD in X10
•  Port C++ lattice QCD to X10
•  Parallelize using APGAS programming model
–  Performance analysis of lattice QCD in X10
•  Analyze parallel efficiency of X10
•  Compare the performance of X10 with MPI
4
Goal and Contributions
•  Goal
–  Highly scalable computing using APGAS
programming model
•  Contributions
–  Implementation of lattice QCD application in X10
•  Several optimizations on lattice QCD in X10
–  Detailed performance analysis on lattice QCD in X10
•  102.8x speedup in strong scaling
•  MPI performs 2.26x – 2.58x faster, due to the limited
communication overlapping in X10
Table of Contents
•  Introduction
•  Implementation of lattice QCD in X10
–  Lattice QCD application
–  Lattice QCD with APGAS programming model
•  Evaluation
–  Performance of multi-threaded lattice QCD
–  Performance of distributed lattice QCD
•  Related Work
•  Conclusion
Lattice QCD
•  La#ce&QCD&
–  Common&technique&to&simulate&a&field&theory&(e.g.&Big&Bang)&of&
Quantum&ChromoDynamics&(QCD)&of&quarks&and&gluons&on&4D&
grid&of&points&in&space&and&Ame&
–  A&grand&challenge&in&highCperformance&compuAng&
•  Requires&high&memory/network&bandwidth&and&computaAonal&power&
•  CompuAng&la#ce&QCD&
–  MonteCCarlo&simulaAons&on&4D&grid&
–  Dominated&by&solving&a&system&of&linear&equaAons&of&matrixC
vector&mulAplicaAon&using&iteraAve&methods&(etc.&CG&method)&
–  Parallelizable&by&dividing&4D&grid&into&parAal&grids&for&each&place&
•  Boundary&exchanges&are&required&between&places&in&each&direcAon&
Implementation of lattice QCD in X10
•  Fully ported from sequential C++ implementation
•  Data structure
–  Use Rail class (1D array) for storing 4D arrays of
quarks and gluons
•  Parallelization
–  Partition 4D grid into places
•  Calculate memory offsets on each place at the initialization
•  Boundary exchanges using asynchronous copy function
•  Optimizations
–  Communication optimizations
•  Overlap boundary exchange and bulk computations
–  Hybrid parallelization
•  Places and threads
Communication Optimizations
•  Communication overlapping by using “asyncCopy” function
–  “asyncCopy” creates a new thread then copy asynchronously
–  Wait completion of “asyncCopy” by “finish” syntax
•  Communication through Put-wise operations
–  Put-wise communication uses one-sided communication while Get-wise
communication uses two-sided communication
•  Communication is not fully overlapped in the current implementation
–  “finish” requires all the places to synchronize
Time
Comp.
Comm.
Boundary data creation
Bulk Multiplication
Boundary reconstruct
Boundary exchange
Barrier Synchronizations
T:
X:
Y:
Z:
Hybrid Parallelization
10
•  Hybrid&parallelizaAon&on&places&and&threads&(acAviAes)&
•  ParallelizaAon&strategies&for&places&
–  (1)&AcAvate&places&for&each&parallelizable&part&of&computaAon&
–  (2)&BarrierCbased&synchronizaAon&
•  Call&“finish”&for&places&at&the&beginning&of&CG&iteraAon&
→&We&adopt&(2)&since&calling&“finish”&for&each&parallelizable&part&of&
computaAon&causes&increase&of&synchronizaAon&overheads&
•  ParallelizaAon&strategies&for&threads&
–  (1)&AcAvate&threads&for&each&parallelizable&part&of&computaAon&
–  (2)&ClockCbased&synchronizaAon&
•  Call&“finish”&for&threads&at&the&beginning&of&CG&iteraAon&
→&We&adopt&(1)&since&we&observed&“finish”&performs&beUer&
scalability&compared&to&the&clockCbased&synchronizaAon
Table of Contents
•  Introduction
•  Implementation of lattice QCD in X10
–  Lattice QCD application
–  Lattice QCD with APGAS programming model
•  Evaluation
–  Performance of multi-threaded lattice QCD
–  Performance of distributed lattice QCD
•  Related Work
•  Conclusion
Evaluation
•  Objective
–  Analyze parallel efficiency of our lattice QCD in X10
–  Comparison with lattice QCD in MPI
•  Measurements
–  Effect of multi-threading
•  Comparison of multi-threaded X10 with OpenMP on a single
node
•  Comparison of hybrid parallelization with MPI+OpenMP
–  Scalability on multiple nodes
•  Comparison of our distributed X10 implementation with MPI
•  Measure strong/weak scaling up to 256 places
•  Configuration
–  Measure elapsed time of one convergence of CG
method
•  Typically 300 to 500 CG iterations
–  Compare native X10 (C++) and MPI C
Experimental Environments
•  IBM BladeCenter HS23 (Use 1 node for multi-threaded performance)
–  CPU: Xeon E5 2680 (2.70GHz, L1=32KB, L2=256KB, L3=20MB, 8 cores) x2
sockets, SMT enabled
–  Memory: 32 GB
–  MPI: MPICH2 1.2.1
–  g++: v4.4.6
–  X10: 2.4.0 trunk r25972 (built with “-Doptimize=true -DNO_CHECKS=true”)
–  Compile option
•  Native X10: -x10rt mpi -O -NO_CHECKS
•  MPI C: -O2 -finline-functions –fopenmp
•  IBM Power 775 (Use up to 13 nodes for scalability study)
–  CPU: Power7 (3.84 GHz, 32 cores), SMT Enabled
–  Memory: 128 GB
–  xlC_r: v12.1
–  X10: 2.4.0 trunk r26346 (built with “-Doptimize=true -DNO_CHECKS=true”)
–  Compile option
•  Native X10: -x10rt pami -O -NO_CHECKS
•  MPI C: -O3 –qsmp=omp
Performance on Single Place
•  Multi-thread parallelization (on 1 Place)
–  Create multiple threads (activities) for each parallelizable part of
computation
–  Problem size: (x, y, z, t) = (16, 16, 16, 32)
•  Results
–  Native X10 with 8 threads exhibits 4.01x speedup over 1 thread
–  Performance of X10 is 71.7% of OpenMP on 8 threads
–  Comparable scalability with OpenMP
Thelowerthebetter
Strong Scaling
Thehigherthebetter
Elapsed Time
(scale is based on performance on 1 thread of each impl.)
4.01x
71.7% of OpenMP
Performance on Difference Problem Sizes
•  Performance on (x,y,z,t) = (8,8,8,16)
–  Poor scalability on Native X10 (2.18x on 8 threads)
!  Performance on (x,y,z,t) = (16,16,16,32)
– Good scalability on Native X10 (4.01x on 8 threads)
2.18x
33.4% of OpenMP
4.01x
71.7% of OpenMP
Performance on Difference Problem Sizes
2.18x
33.4% of OpenMP
4.01x
71.7% of OpenMP
Breakdown
•  Performance on (x,y,z,t) = (8,8,8,16)
–  Poor scalability on Native X10 (2.18x on 8 threads)
!  Performance on (x,y,z,t) = (16,16,16,32)
– Good scalability on Native X10 (4.01x on 8 threads)
Performance Breakdown on Single Place
•  Running on Native X10 with multi-threads suffers from significant overhead of
–  Thread activations (20.5% overhead on 8 threads)
–  Thread synchronizations (19.2% overhead on 8 threads)
–  Computation is also slower than that on OpenMP (36.3% slower)
Time Breakdown on 8 thread
(Consecutive elapsed time of 460 CG steps, in xp-way computation )
151.4
Sync (Native x10)
68.41
Sync (OpenMP)
Time
Activation overhead
(20.5%) Synchronization overhead
(19.2%)
Computational performance
(36.3% slower)
Comparison of Hybrid Parallelization
with MPI + OpenMP
•  Hybrid Parallelization
–  Comparison with MPI
•  Measurement
–  Use 1 node (2 sockets of
8 cores, SMT enabled)
–  Vary # Processes and # Threads
s.t. (#Processes) x (# Threads)
is constant
•  Best performance when
(# Processes, # Threads) =
(4, 4) in MPI, and (16, 2) in X10
–  2 threads per node exhibits best
performance
–  1 thread per node also exhibits
similar performance as 2 threads
Fix (#Processes) x (# Threads) = 16
Fix (#Processes) x (# Threads) = 32
ThelowerthebetterThelowerthebetter
Table of Contents
•  Introduction
•  Implementation of lattice QCD in X10
–  Lattice QCD application
–  Lattice QCD with APGAS programming model
•  Evaluation
–  Performance of multi-threaded lattice QCD
–  Performance of distributed lattice QCD
•  Related Work
•  Conclusion
!  X10 Implementation
Comparison with MPI
20
T:
X:
Y:
Z:
Comp.
Comm.
Boundary data creation
Bulk Multiplication
Boundary reconstruct
Boundary exchange
!  MPI Implementation
T:
X:
Y:
Z:
Barrier Synchronizations
Time
•  Compare&the&performance&of&X10&La#ce&QCD&with&our&
MPICbased&la#ce&QCD&
–  PointCtoCpoint&communicaAon&using&MPI_Isend/Irecv&&
(no&barrier&synchronizaAons)
Strong Scaling: Comparison with MPI
•  Measurement on IBM Power 775 (using up to 13 nodes)
–  Increase #Places up to 256 places (19-20 places / node)
•  Not using hybrid parallelization
–  Problem size: (x, y, z, t) = (32, 32, 32, 64)
•  Results
–  102.8x speedup on 256 places compared to on 1 place
–  MPI exhibits better scalability
•  2.58x faster on 256 places compared to X10
Thehigherthebetter
Strong Scaling Elapsed TimeProblem size: (x, y, z, t) = (32, 32, 32, 64)
Thelowerthebetter
102.8x
!  Simple in X10 (Put, Get)
Effect of Communication Optimization
22
T:
X:
Y:
Z:
Comp.
Comm.
Boundary data creation
Bulk Multiplication
Boundary reconstruct
Boundary exchange
T:
X:
Y:
Z:
!  Overlap in X10 (Get overlap)
Barrier Synchronizations
Time
•  Comparison&of&PutCwise&and&GetCwise&communicaAons&
–  Put: “at” to source place, then copy data to destination place
–  Get: “at” to destination place, then copy data from source place
–  Apply&communicaAon&overlapping&(in&GetCwise&communicaAon)&
–  Multiple copies in a finish
Results:
Effect of Communication Optimization
•  Comparison of PUT with GET in communication
–  PUT performs better strong scaling
•  Underlying PUT implementation in PAMI uses one-sided communication
while GET implementation uses two-sided communication
Thehigherthebetter
Strong Scaling
Thelowerthebetter
Elapsed TimeProblem size: (x, y, z, t) = (32, 32, 32, 64)
Weak Scaling: Comparison with MPI
•  Measurement on IBM Power 775
–  Increase #Places up to 256 places (19-20 places / node)
–  Problem size per Place: 131072 ((x, y, z, t) = (16, 16, 16, 32))
•  Results
–  97.5x speedup on 256 places
–  MPI exhibits better scalability
•  2.26x on 256 places compares with X10
•  MPI implementation performs more overlapping of communication
Weak Scaling Problem size per Place: 131072
Thehigherthebetter
Elapsed Time
Thelowerthebetter
97.5x
Table of Contents
•  Introduction
•  Implementation of lattice QCD in X10
–  Lattice QCD application
–  Lattice QCD with APGAS programming model
•  Evaluation
–  Performance of multi-threaded lattice QCD
–  Performance of distributed lattice QCD
•  Related Work
•  Conclusion
Related work
•  Peta-scale lattice QCD on a Blue Gene/Q supercomputer [1]
–  Fully overlapping communication and applying node-mapping
optimization for BG/Q
•  Performance comparison of PGAS with MPI [2]
–  Compare the performance of Co-Array Fortran (CAF) with MPI on
micro benchmarks
•  Hybrid programming model of PGAS and MPI [3]
–  Hybrid programming of Unified Parallel C (UPC) and MPI, which
allows MPI programmers incremental access of a greater amount of
memory by aggregating the memory of several nodes into a global
address space
[1]: Doi, J.: Peta-scale Lattice Quantum Chromodynamics on a Blue Gene/Q supercomputer
[2]: Shan, H. et al.:A preliminary evaluation of the hardware acceleration of the Cray Gemini Interconnect for PGAS
languages and comparison with MPI
[3]: Dinan, J. et al: Hybrid parallel programming with MPI and unified parallel C
Conclusion
•  Summary
–  Towards highly scalable computing with APGAS programming model
–  Implementation of lattice QCD application in X10
•  Include several optimizations
–  Detailed performance analysis on lattice QCD in X10
•  102.8x speedup in strong scaling, 97.5x speedup in weak scaling
•  MPI performs 2.26x – 2.58x faster, due to the limited
communication overlapping in X10
•  Future work
–  Further optimizations for lattice QCD in X10
•  Further overlapping by using point-to-point synchronizations
•  Accelerates computational parts using GPUs
–  Performance analysis on supercomputers
•  IBM BG/Q, TSUBAME 2.5
Source Code of Lattice QCD in X10
and MPI
•  Available from the following URL
– https://svn.code.sourceforge.net/p/x10/code/
applications/trunk/LatticeQCD/
Backup
APGAS Programming Model
•  PGAS (Partitioned Global Address Space)
programming model
–  Global Address Space
•  Every thread sees entire data set
–  Partitioned
•  Global address space is divided to multiple memories
•  APGAS programming model
–  Asynchronous PGAS (APGAS)
•  Threads can be dynamically created under programmer control
–  X10 is a language implementing APGAS model
•  New activity (thread) is created dynamically using “async” syntax
•  PGAS memories are called Places (Processes)
–  Move to other memory using “at” syntax
•  Threads and places are synchronized by calling “finish” syntax
Existing Implementation of Lattice
QCD in MPI
•  Partition 4D grid into MPI processes
– Boundary data creation => Boundary
exchange => Bulk update
•  Computation and communication overlap
– Overlap Boundary exchange and Boundary/
Bulk update
•  Hybrid parallelization
– MPI + OpenMP
31
!  Simple in X10 (Put-wise)
Effect of Communication Optimization
32
T:
X:
Y:
Z:
Comp.
Comm.
Boundary data creation
Bulk Multiplication
Boundary reconstruct
Boundary exchange
T:
X:
Y:
Z:
!  Overlap in X10 (Get-wise overlap)
!  Overlap in MPI
T:
X:
Y:
Z:
Barrier Synchronizations
Time
Hybrid Performance on Multiple
Nodes•  Hybrid Parallelization on multiple nodes
•  Measurement
–  Use up to 4 nodes
–  (# Places, # Threads) = (32, 2) shows best performance
•  (# Places, # Threads) = (8, 2) per node
Elapsed Time on multiple threads
Thelowerthebetter
Performance of Hybrid Parallelization on Single Node
•  Hybrid Parallelization
–  Places and Threads
•  Measurement
–  Use 1 node (2 sockets of
8 cores, HT enabled)
–  Vary # Places and #
Threads from 1 to 32 for
each
•  Best performance when
(# Places, # Threads) = (16, 2)
•  Best balance when
(# Places) x (# Threads) = 32
Thread Scalability (Elapsed Time)
Place Scalability (Elapsed Time)
ThelowerthebetterThelowerthebetter
Lattice QCD (Koichi)
•  Time breakdown of Lattice QCD (non-overlapping version)
•  Communication overhead causes the performance saturation
–  Communication overhead increases in proportion to the number of nodes
–  Communication ratio increases in proportion to the number of places
Elapsed Time
Thelowerthebetter
Xeon(Sandy bridge) E5 2680 (2.70GHz, L1=32KB, L2=256KB, L3=20MB, 8 cores, HT ON) x2
DDR-3 32GB, Red Hat Enterprise Linux Server 6.3 (2.6.32-279.el6.x86_64)
X10 trunk r25972 (built with “-Doptimize=true -DNO_CHECKS=true”)
g++: 4.4.7
Compile option for native x10: -x10rt mpi -O -NO_CHECKS
Lattice QCD (Koichi)•  Time breakdown of a part of computation in a single finish (64, 128 places)
–  Significant degradation on 128 places compared to 64 places
–  Similar behavior on each place between using 64 places and 128 places
–  Hypothesis: invocation overhead of “at async” and/or synchronization overhead of “finish”
finish
19565270 ns
finish
3564331 ns
Elapsed Time on 64 places
Elapsed Time on 128 places
5.49x degradation
Xeon(Sandy bridge) E5 2680 (2.70GHz, L1=32KB, L2=256KB, L3=20MB, 8 cores, HT ON) x2
DDR-3 32GB, Red Hat Enterprise Linux Server 6.3 (2.6.32-279.el6.x86_64)
X10 trunk r25972 (built with “-Doptimize=true -DNO_CHECKS=true”)
g++: 4.4.7
Compile option for native x10: -x10rt mpi -O -NO_CHECKS
Lattice QCD (Koichi)
•  at v/s asyncCopy on data transfer
–  asyncCopy performs 36.2% better when using 2 places (1 place / node)
Elapsed Time
Thelowerthebetter
Xeon(Sandy bridge) E5 2680 (2.70GHz, L1=32KB, L2=256KB, L3=20MB, 8 cores, HT ON) x2
DDR-3 32GB, Red Hat Enterprise Linux Server 6.3 (2.6.32-279.el6.x86_64)
X10 trunk r25972 (built with “-Doptimize=true -DNO_CHECKS=true”)
g++: 4.4.7
Compile option for native x10: -x10rt mpi -O -NO_CHECKS

More Related Content

What's hot

MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performancejemin lee
 
Highly Scalable Java Programming for Multi-Core System
Highly Scalable Java Programming for Multi-Core SystemHighly Scalable Java Programming for Multi-Core System
Highly Scalable Java Programming for Multi-Core SystemJames Gan
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachjemin lee
 
Caffe framework tutorial
Caffe framework tutorialCaffe framework tutorial
Caffe framework tutorialPark Chunduck
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
 
GKEL_IGARSS_2011.ppt
GKEL_IGARSS_2011.pptGKEL_IGARSS_2011.ppt
GKEL_IGARSS_2011.pptgrssieee
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012Jinwon Lee
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionKenta Oono
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_partyOpen Party
 
Radio Signal Classification with Deep Neural Networks
Radio Signal Classification with Deep Neural NetworksRadio Signal Classification with Deep Neural Networks
Radio Signal Classification with Deep Neural NetworksKachi Odoemene
 
211121 detection in crowded scenes one proposal, multiple predictions
211121 detection in crowded scenes   one proposal, multiple predictions211121 detection in crowded scenes   one proposal, multiple predictions
211121 detection in crowded scenes one proposal, multiple predictionstaeseon ryu
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...MLconf
 
(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2Serhii Havrylov
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningSsu-Rui Lee
 

What's hot (20)

MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
 
Basanta jtr2009
Basanta jtr2009Basanta jtr2009
Basanta jtr2009
 
Highly Scalable Java Programming for Multi-Core System
Highly Scalable Java Programming for Multi-Core SystemHighly Scalable Java Programming for Multi-Core System
Highly Scalable Java Programming for Multi-Core System
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
Caffe framework tutorial
Caffe framework tutorialCaffe framework tutorial
Caffe framework tutorial
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
2011.jtr.pbasanta.
2011.jtr.pbasanta.2011.jtr.pbasanta.
2011.jtr.pbasanta.
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
GKEL_IGARSS_2011.ppt
GKEL_IGARSS_2011.pptGKEL_IGARSS_2011.ppt
GKEL_IGARSS_2011.ppt
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introduction
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
Radio Signal Classification with Deep Neural Networks
Radio Signal Classification with Deep Neural NetworksRadio Signal Classification with Deep Neural Networks
Radio Signal Classification with Deep Neural Networks
 
211121 detection in crowded scenes one proposal, multiple predictions
211121 detection in crowded scenes   one proposal, multiple predictions211121 detection in crowded scenes   one proposal, multiple predictions
211121 detection in crowded scenes one proposal, multiple predictions
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
 
(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2(Kpi summer school 2015) theano tutorial part2
(Kpi summer school 2015) theano tutorial part2
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning
 

Similar to Performance Analysis of Lattice QCD with APGAS Programming Model

Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraMasaharu Munetomo
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...Unai Lopez-Novoa
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORNVIDIA Japan
 
Performance Analysis of Creido Enhanced Chord Overlay Protocol for Wireless S...
Performance Analysis of Creido Enhanced Chord Overlay Protocol for Wireless S...Performance Analysis of Creido Enhanced Chord Overlay Protocol for Wireless S...
Performance Analysis of Creido Enhanced Chord Overlay Protocol for Wireless S...Prasanna Shanmugasundaram
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDKNVIDIA Japan
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchSubhashis Hazarika
 
Application of Parallel Processing
Application of Parallel ProcessingApplication of Parallel Processing
Application of Parallel Processingare you
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...waqarnabi
 

Similar to Performance Analysis of Lattice QCD with APGAS Programming Model (20)

Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study us...
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
 
Performance Analysis of Creido Enhanced Chord Overlay Protocol for Wireless S...
Performance Analysis of Creido Enhanced Chord Overlay Protocol for Wireless S...Performance Analysis of Creido Enhanced Chord Overlay Protocol for Wireless S...
Performance Analysis of Creido Enhanced Chord Overlay Protocol for Wireless S...
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/HardScaling Open Source Big Data Cloud Applications is Easy/Hard
Scaling Open Source Big Data Cloud Applications is Easy/Hard
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
Application of Parallel Processing
Application of Parallel ProcessingApplication of Parallel Processing
Application of Parallel Processing
 
Værktøjer udviklet på AAU til analyse af SCJ programmer
Værktøjer udviklet på AAU til analyse af SCJ programmerVærktøjer udviklet på AAU til analyse af SCJ programmer
Værktøjer udviklet på AAU til analyse af SCJ programmer
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
 

More from Koichi Shirahata

GPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてGPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてKoichi Shirahata
 
GPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングGPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングKoichi Shirahata
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersKoichi Shirahata
 
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VA GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VKoichi Shirahata
 
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価Koichi Shirahata
 
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化Koichi Shirahata
 
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...Koichi Shirahata
 
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...Koichi Shirahata
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Koichi Shirahata
 

More from Koichi Shirahata (9)

GPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けてGPGPUを用いた大規模高速グラフ処理に向けて
GPGPUを用いた大規模高速グラフ処理に向けて
 
GPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリングGPUを考慮したMapReduceのタスクスケジューリング
GPUを考慮したMapReduceのタスクスケジューリング
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VA GPU Implementation of Generalized Graph Processing Algorithm GIM-V
A GPU Implementation of Generalized Graph Processing Algorithm GIM-V
 
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
GPUアクセラレータと不揮発性メモリを考慮したI/O性能の予備評価
 
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
汎用グラフ処理モデルGIM-Vの複数GPUによる大規模計算とデータ転送の最適化
 
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...
 
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...
 
Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...Performance Analysis of MapReduce Implementations on High Performance Homolog...
Performance Analysis of MapReduce Implementations on High Performance Homolog...
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Performance Analysis of Lattice QCD with APGAS Programming Model

  • 1. Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata1, Jun Doi2, Mikio Takeuchi2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo
  • 2. Programming Models for Exascale Computing •  Extremely parallel supercomputers –  It is expected that the first exascale supercomputer will be deployed by 2020 –  Which programming model will allow easy development and high performance is still unknown •  Programming models for extremely parallel supercomputers –  Partitioned Global Address Space (PGAS) •  Global view of distributed memory •  Asynchronous PGAS (APGAS) Highly Scalable and Productive Computing using APGAS Programming Model
  • 3. Problem Statement •  How is the performance of APGAS programming model compared with existing massage passing model? –  Message Passing (MPI) •  Good tuning efficiency •  High programming complexity –  Asynchronous Partitioned Global Address Space (APGAS) •  High programming productivity, Good scalability •  Limited tuning efficiency MPI
  • 4. Approach •  Performance analysis of lattice QCD application with APGAS programming model –  Lattice QCD •  one of the most challenging application for supercomputers –  Implement lattice QCD in X10 •  Port C++ lattice QCD to X10 •  Parallelize using APGAS programming model –  Performance analysis of lattice QCD in X10 •  Analyze parallel efficiency of X10 •  Compare the performance of X10 with MPI 4
  • 5. Goal and Contributions •  Goal –  Highly scalable computing using APGAS programming model •  Contributions –  Implementation of lattice QCD application in X10 •  Several optimizations on lattice QCD in X10 –  Detailed performance analysis on lattice QCD in X10 •  102.8x speedup in strong scaling •  MPI performs 2.26x – 2.58x faster, due to the limited communication overlapping in X10
  • 6. Table of Contents •  Introduction •  Implementation of lattice QCD in X10 –  Lattice QCD application –  Lattice QCD with APGAS programming model •  Evaluation –  Performance of multi-threaded lattice QCD –  Performance of distributed lattice QCD •  Related Work •  Conclusion
  • 7. Lattice QCD •  La#ce&QCD& –  Common&technique&to&simulate&a&field&theory&(e.g.&Big&Bang)&of& Quantum&ChromoDynamics&(QCD)&of&quarks&and&gluons&on&4D& grid&of&points&in&space&and&Ame& –  A&grand&challenge&in&highCperformance&compuAng& •  Requires&high&memory/network&bandwidth&and&computaAonal&power& •  CompuAng&la#ce&QCD& –  MonteCCarlo&simulaAons&on&4D&grid& –  Dominated&by&solving&a&system&of&linear&equaAons&of&matrixC vector&mulAplicaAon&using&iteraAve&methods&(etc.&CG&method)& –  Parallelizable&by&dividing&4D&grid&into&parAal&grids&for&each&place& •  Boundary&exchanges&are&required&between&places&in&each&direcAon&
  • 8. Implementation of lattice QCD in X10 •  Fully ported from sequential C++ implementation •  Data structure –  Use Rail class (1D array) for storing 4D arrays of quarks and gluons •  Parallelization –  Partition 4D grid into places •  Calculate memory offsets on each place at the initialization •  Boundary exchanges using asynchronous copy function •  Optimizations –  Communication optimizations •  Overlap boundary exchange and bulk computations –  Hybrid parallelization •  Places and threads
  • 9. Communication Optimizations •  Communication overlapping by using “asyncCopy” function –  “asyncCopy” creates a new thread then copy asynchronously –  Wait completion of “asyncCopy” by “finish” syntax •  Communication through Put-wise operations –  Put-wise communication uses one-sided communication while Get-wise communication uses two-sided communication •  Communication is not fully overlapped in the current implementation –  “finish” requires all the places to synchronize Time Comp. Comm. Boundary data creation Bulk Multiplication Boundary reconstruct Boundary exchange Barrier Synchronizations T: X: Y: Z:
  • 10. Hybrid Parallelization 10 •  Hybrid&parallelizaAon&on&places&and&threads&(acAviAes)& •  ParallelizaAon&strategies&for&places& –  (1)&AcAvate&places&for&each&parallelizable&part&of&computaAon& –  (2)&BarrierCbased&synchronizaAon& •  Call&“finish”&for&places&at&the&beginning&of&CG&iteraAon& →&We&adopt&(2)&since&calling&“finish”&for&each&parallelizable&part&of& computaAon&causes&increase&of&synchronizaAon&overheads& •  ParallelizaAon&strategies&for&threads& –  (1)&AcAvate&threads&for&each&parallelizable&part&of&computaAon& –  (2)&ClockCbased&synchronizaAon& •  Call&“finish”&for&threads&at&the&beginning&of&CG&iteraAon& →&We&adopt&(1)&since&we&observed&“finish”&performs&beUer& scalability&compared&to&the&clockCbased&synchronizaAon
  • 11. Table of Contents •  Introduction •  Implementation of lattice QCD in X10 –  Lattice QCD application –  Lattice QCD with APGAS programming model •  Evaluation –  Performance of multi-threaded lattice QCD –  Performance of distributed lattice QCD •  Related Work •  Conclusion
  • 12. Evaluation •  Objective –  Analyze parallel efficiency of our lattice QCD in X10 –  Comparison with lattice QCD in MPI •  Measurements –  Effect of multi-threading •  Comparison of multi-threaded X10 with OpenMP on a single node •  Comparison of hybrid parallelization with MPI+OpenMP –  Scalability on multiple nodes •  Comparison of our distributed X10 implementation with MPI •  Measure strong/weak scaling up to 256 places •  Configuration –  Measure elapsed time of one convergence of CG method •  Typically 300 to 500 CG iterations –  Compare native X10 (C++) and MPI C
  • 13. Experimental Environments •  IBM BladeCenter HS23 (Use 1 node for multi-threaded performance) –  CPU: Xeon E5 2680 (2.70GHz, L1=32KB, L2=256KB, L3=20MB, 8 cores) x2 sockets, SMT enabled –  Memory: 32 GB –  MPI: MPICH2 1.2.1 –  g++: v4.4.6 –  X10: 2.4.0 trunk r25972 (built with “-Doptimize=true -DNO_CHECKS=true”) –  Compile option •  Native X10: -x10rt mpi -O -NO_CHECKS •  MPI C: -O2 -finline-functions –fopenmp •  IBM Power 775 (Use up to 13 nodes for scalability study) –  CPU: Power7 (3.84 GHz, 32 cores), SMT Enabled –  Memory: 128 GB –  xlC_r: v12.1 –  X10: 2.4.0 trunk r26346 (built with “-Doptimize=true -DNO_CHECKS=true”) –  Compile option •  Native X10: -x10rt pami -O -NO_CHECKS •  MPI C: -O3 –qsmp=omp
  • 14. Performance on Single Place •  Multi-thread parallelization (on 1 Place) –  Create multiple threads (activities) for each parallelizable part of computation –  Problem size: (x, y, z, t) = (16, 16, 16, 32) •  Results –  Native X10 with 8 threads exhibits 4.01x speedup over 1 thread –  Performance of X10 is 71.7% of OpenMP on 8 threads –  Comparable scalability with OpenMP Thelowerthebetter Strong Scaling Thehigherthebetter Elapsed Time (scale is based on performance on 1 thread of each impl.) 4.01x 71.7% of OpenMP
  • 15. Performance on Difference Problem Sizes •  Performance on (x,y,z,t) = (8,8,8,16) –  Poor scalability on Native X10 (2.18x on 8 threads) !  Performance on (x,y,z,t) = (16,16,16,32) – Good scalability on Native X10 (4.01x on 8 threads) 2.18x 33.4% of OpenMP 4.01x 71.7% of OpenMP
  • 16. Performance on Difference Problem Sizes 2.18x 33.4% of OpenMP 4.01x 71.7% of OpenMP Breakdown •  Performance on (x,y,z,t) = (8,8,8,16) –  Poor scalability on Native X10 (2.18x on 8 threads) !  Performance on (x,y,z,t) = (16,16,16,32) – Good scalability on Native X10 (4.01x on 8 threads)
  • 17. Performance Breakdown on Single Place •  Running on Native X10 with multi-threads suffers from significant overhead of –  Thread activations (20.5% overhead on 8 threads) –  Thread synchronizations (19.2% overhead on 8 threads) –  Computation is also slower than that on OpenMP (36.3% slower) Time Breakdown on 8 thread (Consecutive elapsed time of 460 CG steps, in xp-way computation ) 151.4 Sync (Native x10) 68.41 Sync (OpenMP) Time Activation overhead (20.5%) Synchronization overhead (19.2%) Computational performance (36.3% slower)
  • 18. Comparison of Hybrid Parallelization with MPI + OpenMP •  Hybrid Parallelization –  Comparison with MPI •  Measurement –  Use 1 node (2 sockets of 8 cores, SMT enabled) –  Vary # Processes and # Threads s.t. (#Processes) x (# Threads) is constant •  Best performance when (# Processes, # Threads) = (4, 4) in MPI, and (16, 2) in X10 –  2 threads per node exhibits best performance –  1 thread per node also exhibits similar performance as 2 threads Fix (#Processes) x (# Threads) = 16 Fix (#Processes) x (# Threads) = 32 ThelowerthebetterThelowerthebetter
  • 19. Table of Contents •  Introduction •  Implementation of lattice QCD in X10 –  Lattice QCD application –  Lattice QCD with APGAS programming model •  Evaluation –  Performance of multi-threaded lattice QCD –  Performance of distributed lattice QCD •  Related Work •  Conclusion
  • 20. !  X10 Implementation Comparison with MPI 20 T: X: Y: Z: Comp. Comm. Boundary data creation Bulk Multiplication Boundary reconstruct Boundary exchange !  MPI Implementation T: X: Y: Z: Barrier Synchronizations Time •  Compare&the&performance&of&X10&La#ce&QCD&with&our& MPICbased&la#ce&QCD& –  PointCtoCpoint&communicaAon&using&MPI_Isend/Irecv&& (no&barrier&synchronizaAons)
  • 21. Strong Scaling: Comparison with MPI •  Measurement on IBM Power 775 (using up to 13 nodes) –  Increase #Places up to 256 places (19-20 places / node) •  Not using hybrid parallelization –  Problem size: (x, y, z, t) = (32, 32, 32, 64) •  Results –  102.8x speedup on 256 places compared to on 1 place –  MPI exhibits better scalability •  2.58x faster on 256 places compared to X10 Thehigherthebetter Strong Scaling Elapsed TimeProblem size: (x, y, z, t) = (32, 32, 32, 64) Thelowerthebetter 102.8x
  • 22. !  Simple in X10 (Put, Get) Effect of Communication Optimization 22 T: X: Y: Z: Comp. Comm. Boundary data creation Bulk Multiplication Boundary reconstruct Boundary exchange T: X: Y: Z: !  Overlap in X10 (Get overlap) Barrier Synchronizations Time •  Comparison&of&PutCwise&and&GetCwise&communicaAons& –  Put: “at” to source place, then copy data to destination place –  Get: “at” to destination place, then copy data from source place –  Apply&communicaAon&overlapping&(in&GetCwise&communicaAon)& –  Multiple copies in a finish
  • 23. Results: Effect of Communication Optimization •  Comparison of PUT with GET in communication –  PUT performs better strong scaling •  Underlying PUT implementation in PAMI uses one-sided communication while GET implementation uses two-sided communication Thehigherthebetter Strong Scaling Thelowerthebetter Elapsed TimeProblem size: (x, y, z, t) = (32, 32, 32, 64)
  • 24. Weak Scaling: Comparison with MPI •  Measurement on IBM Power 775 –  Increase #Places up to 256 places (19-20 places / node) –  Problem size per Place: 131072 ((x, y, z, t) = (16, 16, 16, 32)) •  Results –  97.5x speedup on 256 places –  MPI exhibits better scalability •  2.26x on 256 places compares with X10 •  MPI implementation performs more overlapping of communication Weak Scaling Problem size per Place: 131072 Thehigherthebetter Elapsed Time Thelowerthebetter 97.5x
  • 25. Table of Contents •  Introduction •  Implementation of lattice QCD in X10 –  Lattice QCD application –  Lattice QCD with APGAS programming model •  Evaluation –  Performance of multi-threaded lattice QCD –  Performance of distributed lattice QCD •  Related Work •  Conclusion
  • 26. Related work •  Peta-scale lattice QCD on a Blue Gene/Q supercomputer [1] –  Fully overlapping communication and applying node-mapping optimization for BG/Q •  Performance comparison of PGAS with MPI [2] –  Compare the performance of Co-Array Fortran (CAF) with MPI on micro benchmarks •  Hybrid programming model of PGAS and MPI [3] –  Hybrid programming of Unified Parallel C (UPC) and MPI, which allows MPI programmers incremental access of a greater amount of memory by aggregating the memory of several nodes into a global address space [1]: Doi, J.: Peta-scale Lattice Quantum Chromodynamics on a Blue Gene/Q supercomputer [2]: Shan, H. et al.:A preliminary evaluation of the hardware acceleration of the Cray Gemini Interconnect for PGAS languages and comparison with MPI [3]: Dinan, J. et al: Hybrid parallel programming with MPI and unified parallel C
  • 27. Conclusion •  Summary –  Towards highly scalable computing with APGAS programming model –  Implementation of lattice QCD application in X10 •  Include several optimizations –  Detailed performance analysis on lattice QCD in X10 •  102.8x speedup in strong scaling, 97.5x speedup in weak scaling •  MPI performs 2.26x – 2.58x faster, due to the limited communication overlapping in X10 •  Future work –  Further optimizations for lattice QCD in X10 •  Further overlapping by using point-to-point synchronizations •  Accelerates computational parts using GPUs –  Performance analysis on supercomputers •  IBM BG/Q, TSUBAME 2.5
  • 28. Source Code of Lattice QCD in X10 and MPI •  Available from the following URL – https://svn.code.sourceforge.net/p/x10/code/ applications/trunk/LatticeQCD/
  • 30. APGAS Programming Model •  PGAS (Partitioned Global Address Space) programming model –  Global Address Space •  Every thread sees entire data set –  Partitioned •  Global address space is divided to multiple memories •  APGAS programming model –  Asynchronous PGAS (APGAS) •  Threads can be dynamically created under programmer control –  X10 is a language implementing APGAS model •  New activity (thread) is created dynamically using “async” syntax •  PGAS memories are called Places (Processes) –  Move to other memory using “at” syntax •  Threads and places are synchronized by calling “finish” syntax
  • 31. Existing Implementation of Lattice QCD in MPI •  Partition 4D grid into MPI processes – Boundary data creation => Boundary exchange => Bulk update •  Computation and communication overlap – Overlap Boundary exchange and Boundary/ Bulk update •  Hybrid parallelization – MPI + OpenMP 31
  • 32. !  Simple in X10 (Put-wise) Effect of Communication Optimization 32 T: X: Y: Z: Comp. Comm. Boundary data creation Bulk Multiplication Boundary reconstruct Boundary exchange T: X: Y: Z: !  Overlap in X10 (Get-wise overlap) !  Overlap in MPI T: X: Y: Z: Barrier Synchronizations Time
  • 33. Hybrid Performance on Multiple Nodes•  Hybrid Parallelization on multiple nodes •  Measurement –  Use up to 4 nodes –  (# Places, # Threads) = (32, 2) shows best performance •  (# Places, # Threads) = (8, 2) per node Elapsed Time on multiple threads Thelowerthebetter
  • 34. Performance of Hybrid Parallelization on Single Node •  Hybrid Parallelization –  Places and Threads •  Measurement –  Use 1 node (2 sockets of 8 cores, HT enabled) –  Vary # Places and # Threads from 1 to 32 for each •  Best performance when (# Places, # Threads) = (16, 2) •  Best balance when (# Places) x (# Threads) = 32 Thread Scalability (Elapsed Time) Place Scalability (Elapsed Time) ThelowerthebetterThelowerthebetter
  • 35. Lattice QCD (Koichi) •  Time breakdown of Lattice QCD (non-overlapping version) •  Communication overhead causes the performance saturation –  Communication overhead increases in proportion to the number of nodes –  Communication ratio increases in proportion to the number of places Elapsed Time Thelowerthebetter Xeon(Sandy bridge) E5 2680 (2.70GHz, L1=32KB, L2=256KB, L3=20MB, 8 cores, HT ON) x2 DDR-3 32GB, Red Hat Enterprise Linux Server 6.3 (2.6.32-279.el6.x86_64) X10 trunk r25972 (built with “-Doptimize=true -DNO_CHECKS=true”) g++: 4.4.7 Compile option for native x10: -x10rt mpi -O -NO_CHECKS
  • 36. Lattice QCD (Koichi)•  Time breakdown of a part of computation in a single finish (64, 128 places) –  Significant degradation on 128 places compared to 64 places –  Similar behavior on each place between using 64 places and 128 places –  Hypothesis: invocation overhead of “at async” and/or synchronization overhead of “finish” finish 19565270 ns finish 3564331 ns Elapsed Time on 64 places Elapsed Time on 128 places 5.49x degradation Xeon(Sandy bridge) E5 2680 (2.70GHz, L1=32KB, L2=256KB, L3=20MB, 8 cores, HT ON) x2 DDR-3 32GB, Red Hat Enterprise Linux Server 6.3 (2.6.32-279.el6.x86_64) X10 trunk r25972 (built with “-Doptimize=true -DNO_CHECKS=true”) g++: 4.4.7 Compile option for native x10: -x10rt mpi -O -NO_CHECKS
  • 37. Lattice QCD (Koichi) •  at v/s asyncCopy on data transfer –  asyncCopy performs 36.2% better when using 2 places (1 place / node) Elapsed Time Thelowerthebetter Xeon(Sandy bridge) E5 2680 (2.70GHz, L1=32KB, L2=256KB, L3=20MB, 8 cores, HT ON) x2 DDR-3 32GB, Red Hat Enterprise Linux Server 6.3 (2.6.32-279.el6.x86_64) X10 trunk r25972 (built with “-Doptimize=true -DNO_CHECKS=true”) g++: 4.4.7 Compile option for native x10: -x10rt mpi -O -NO_CHECKS