2022_01_TSMP_HPC_Asia_Final.pptx

TERRESTRIAL SYSTEMS MODELLING PLATFORM
J. BENKE, D. CAVIEDES VOULLIEME, S. POLL, G. TASHAKOR, I. ZHUKOV
JÜLICH SUPERCOMPUTING CENTRE (JSC)
PORTING A COUPLED MULTISCALE AND MULTIPHYSICS EARTH SYSTEM MODEL TO
HETEROGENEOUS ARCHITECTURES (BENCHMARKING AND PERFORMANCE
ANALYSIS)
j.benke@fz-juelich.de, d.caviedes.voullieme@fz-juelich.de, g.tashakor@fz-juelich.de, s.poll@fz-juelich.de, i.zhukov@fz-juelich.de
http://www.fz-juelich.de/ias/jsc/slts
http://www.hpsc-terrsys.de
@HPSCTerrSys
HPSC TerrSys

TERRESTRIAL SYSTEMS
MODELLING PLATFORM (TSMP)
A QUICK OVERVIEW
Page 2

Member of the Helmholtz Association
TERRESTRIAL SYSTEM MODELLING PLATFORM (TSMP)
• Represents processes for soil, land, vegetation and
atmosphere
• Numerical modelling system coupling COSMO / ICON,
Community Land Model (CLM) and ParFlow
 Fully modular
• Physically-based representation of transport processes
of mass, energy and momentum
• Component models can have different spatio-temporal
resolution; explicit feedbacks between compartments
• Parallel Data Assimilation Framework (TSMP-PDAF)
• Multiple Program Multiple Data (MPMD) execution
model,OASIS provides common MPI_COMM_WORLD
• https://www.terrsysmp.org/
Page 3
Solving the terrestrial water and energy cycle from groundwater to the atmosphere
ICON / COSMO
CLM
Parflow
PDAF

Source codes used:
 TSMP (v1.3.3): Interface and modeling framework
 COSMO (v5.01): Atmospheric model
 CLM (v3.5): Land surface model (1D column model)
 ParFlow (v3.7): Surface and subsurface hydrological model
 OASIS3-MCT: MPI based coupler for all submodels
 PDAF: Parallel Data Assimilation Framework (not used here)
Parallelism:
 Hybrid (MPI, MPI-CUDA, MPI-OpenMP), depends on component model
 Heterogeneous computing enabled (ParFlow GPU + COSMO/CLM CPU)
 Performance analysis (profiling/tracing): First results exists (later)
TERRESTRIAL SYSTEM MODELLING PLATFORM (TSMP)
Software features

TSMP COUPLING SCHEME
Page 5
One coupling step:
1) COSMO and ParFlow are sending coupling fields to CLM
2) COSMO and ParFlow are idle while CLM is running
3) CLM sends coupling fields back to COSMO and ParFlow
4) COSMO and ParFlow are running simultaneously (CLM is idle)
Communication pattern between components (programs)

The Benchmark Scenario (SimDiCyPBL)
• Motivation:
 Systematic scaling tests and performance analysis were already performed (and published) over 8 years
ago (JUQUEEN)
 But since that time TSMP and its submodels evolved
 The same ist true for HPC architectures (especially for accelerator architectures)
 As the consequence new scaling tests and performance analysis results are needed on the new machines
• SimDiCyPBL: Simulation of the Diurnal Cycle of the Planetary Boundary Layer
 Synthetic scenario with limited complexity which is of interest only for checking computational correctness,
computational benchmarks and performance analysis tests
 Adaption of the TSMP Fall School 2019 scenario
 Predefined scenario in the TSMP Data Github (idealRTD)
 Advantage: Very adaptable problem, which can be run as a very small configuration or scaled indefinitely
for performance and scalability studies
 I.a. the geometry, mesh size and step width (time and space) can be easily adapted
Page 7

The Benchmark Scenario (SimDiCyPBL)
• The setup of the SimDiCyPBL scenario
 All three models are used (but not PDAF)
 Atmosphere (COSMO 5.01)
 Surface/Land/Vegetation (CLM 3.5)
 Hydrology/Hydrogeology (ParFlow 3.7)
 Area size (example case): 600 x 600 km
 Height atmosphere: 22 km, depth ground: 30m
 Flat ground with height 0 m (ParFlow)
 Constant initial aquifer head: 5 m below sea level
(see blue line with arrow)
 Spatially homogeneous and constant
unsaturated zone
 Homogeneous ground/soil (with initial constant
temperature of 287 K), radiative forcing
 Periodic Boundary Conditions in x and y
direction for COSMO
Graphic adapted from TSMP FallSchool 2019, Day 2, p 6
Page 8

The Benchmark Scenario (SimDiCyPBL) (contd)
• All job runs were performed on JUWELS Cluster
 CPU and GPU partition (for heterogeneous runs)
 Nodes are non shared
 Cluster-Booster runs are planned
• Every model component gets a predefined number
of processes (nodes)
 In our test cases every node is fully utilized, that
means 48 processes on 48 Cores
 Example (CPU only):
 8 COSMO nodes, 1 CLM node, 2 ParFlow
nodes (8-1-2 scenario)
 Results in COSMO = 384, CLM = 48 and
ParFlow = 96 processes
Graphic adapted from TSMP FallSchool 2019, Day 2, p 6
Page 9

The Machine (JUWELS Cluster and Booster)
• JUWELS (Cluster-Booster-System; batch system Slurm)
 Cluster
 2271 standard compute nodes, 56 accelerated compute nodes
 2x Intel Xeon Platinum 8168 CPUs, 2x 24 cores, 2.7 GHz, 2 Hts/core (CPU partition)
 96 GB DDR4-2666
 Infiniband EDR (ConnectX-4)
 Additional 4x NVIDIA V100 GPUs, 16 GB HBM (GPU partition of JUWELS Cluster)
 10.6 (CPU) + 1.7 (GPU) PetaFlops peak performance (System)
 Booster
 936 accelerated compute nodes
 2x AMD EPYC Rome 7402 CPUs, 2x 24 Cores, 2.8 GHz
 512 GB DDR4-3200
 4x NVIDIA A100 GPUs, 4x 40 GB HBM2e
 Infiniband HDR200 (ConnectX-6)
 73 PetaFlops peak performance (System)
Page 10

JUWELS (CLUSTER AND BOOSTER)
Page 11

STRONG SCALING
EXPERIMENTS
Page 12

Strong Scaling Experiment 300x300 surface mesh (config)
• Model domain: 300 x 300 km
• Model mesh size (nx x ny x nz):
 COSMO: 306 x 306 x 50 (approx 4.7 x 106
nodes)
 CLM: 300 x 300 x 10 (9.0 x 105
nodes)
 ParFlow: 300 x 300 x 30 (2.7 x 106
nodes)
 Step width (space): Δx = Δy = 1 km
• Model simulation time: 6 hours
• Model time step (all models): Δt = 18 seconds (200 time steps per model hour)
 Different time steps per model can be taken
• Coupling frequency OASIS-MCT: 18 seconds
• I/O interval: 6 hours (1 output of every model at the end of the benchmark)
• Used range of number of nodes in experiments: COSMO=1-16, CLM=1, ParFlow=1-8
• Process pinning and distribution (MPI only)
 Pinning: by core; Distribution: block : cyclic : cyclic
• The runtime measurement interval is from the start to the end of the job
Page 13

Strong Scaling Experiment 300x300 surface mesh (results)
Page 14

• Explanation of the function graphs on the previous slide:
 Function graphs left hand side: Runtime measurements for CPU only runs (in minutes)
 Function graphs right hand side: Runtime measurements for CPU-GPU runs (in minutes)
 For every measurement: #CLM nodes = 1
• Every node is fully utilized with 48 processes (CPU only) or with 4 processes (for ParFlow running on GPUs)
• Every function graph of one colour shows a discrete runtime function of the number of COSMO nodes (with
#ParFlow nodes = constant)
 E.g. the red discrete function is the runtime graph of the run with 1 CLM node, 1 Parflow node and 1,2,4,8
and 16 COSMO nodes (in the CPU only and the CPU-GPU case)
 Every dot shows one measurement point
 x axis: Number of COSMO nodes (#Nodes COSMO)
 y axis: Runtime of a measurement (job run) in minutes
 How to read it: If you are searching for the runtime of a 8-1-2 job (8 COSMO, 1 CLM, 2 ParFlow nodes), then
please have look at the x axis for the number 8, and then the intersection between the virtual vertical axis
through 8 with the green line (#CLM nodes are always 1)
Page 15

• Both functional groups (CPU only and CPU-GPU) are strictly decreasing monotonically
 Except for #Nodes COSMO>8
• COSMO limited parts of a function
 Graph of the function is nearly constant or increasing
 Interpretation: COSMO is waiting for ParFlow
• ParFlow limited parts of a function
 Graph of the function is (strictly) decreasing monotonically
 Interpretation: ParFlow is waiting for COSMO
• A (quasi) load balanced state is the “elbow” of a function
 Example CPU only case: 8 COSMO, 1 CLM, 2 ParFlow nodes (8-1-2 scenario)
• In all cases the runtime remains constant or increases with 16 COSMO nodes (no more speed up)
 The reason is the small size of the mesh of this scenario
• Interesting to observe: In the CPU-GPU case the fastest runs are those with #Nodes ParFlow = 1
 Regarding runtime and resource usage/energy efficiency the best choice would be 8-1-1 (CPU-GPU)
• Regarding runtime only the optimal point would be 16-1-2 (CPU only)
Page 16

Strong Scaling Experiment 1200x600 surface mesh (config)
• Model domain: 1200 x 600 km
• Model mesh size (nx x ny x nz):
 COSMO: 1206 x 606 x 50 (approx 3.7 x 107
nodes)
 CLM: 1200 x 600 x 10 (7.2 x 106
nodes)
 ParFlow: 1200 x 600 x 30 (approx 2.2 x 107
nodes)
 Step width (space): Δx = Δy = 1 km
• Model simulation time: 6 hours
• Model time step (all models): Δt = 18 seconds (200 time steps per model hour)
 Different time steps per model can be taken
• Coupling frequency OASIS-MCT: 18 seconds
• I/O interval: 6 hours (1 output of every model at the end of the benchmark)
• Used range of number of nodes in experiments: COSMO=1-16, CLM=1, ParFlow=1-8
• Process pinning and distribution (MPI only)
 Pinning: by core; Distribution: block : cyclic : cyclic
• The runtime measurement interval is from the start to the end of the job
Page 17

Page 18

• COSMO limited parts of a function
 Graph of the function is nearly constant or increasing
 Interpretation: COSMO is waiting for ParFlow
• ParFlow limited parts of a function
 Graph of the function is (strictly) decreasing monotonically
 Interpretation: ParFlow is waiting for COSMO
• A (quasi) load balanced state is the “elbow” of a function
 Example CPU only case: 4 COSMO, 1 CLM, 1 ParFlow nodes (4-1-1 scenario)
• CPU only case:
 #Nodes ParFlow = 1 is COSMO limited with #Nodes COSMO = 8,16
 #Nodes ParFlow = 2 is COSMO limited with #Nodes COSMO = 16
• CPU-GPU case:
 All runs are ParFlow limited (waiting for COSMO)
• In the CPU-GPU case the fastest runs are again with #Nodes ParFlow = 1 (16-1-1)
 Regarding runtime and resource consumption/energy efficiency best choice
Page 19

Strong Scaling Experiments (Summary)
• CPU only case:
 In some cases (especially #Nodes ParFlow = 1,2) the runtime is COSMO limited
• CPU-GPU case:
 Most of the runs are ParFlow limited
 In all CPU-GPU cases (scenarios) the best performance is reached in the #Nodes ParFlow = 1 case
 With only a minor runtime loss compared to the best CPU only case
 Regarding energy efficiency and runtime the best CPU-GPU version should be used
• To investigate this further larger meshes are needed
 Under construction, but problems arose with creating appropriate OASIS3 rmp* files
• CPU only and CPU-GPU haven’t big differences in runtime because they are coupled and after one time
step they synchronize via sending their data to CLM and wating for the scattered data
 Possible solutions to optimize this problem (under investigation):
 Different time step sizes for all models (first results are available)
 Larger coupling frequency (first results are available, but no large gain in runtime and cases exist
with problems of correctness of the results)
Page 20

Performance Analysis: Used Tools
Score-P (Version 7.1; www.score-p.org)
 Performance analysis tool infrastructure
 Instrumentation and measurement system
 Supports (call-paths) profiling and event tracing
Scalasca (Version 2.6; www.scalasca.org)
 Automatic trace analysis of parallel programs
 Automatic search for patterns and inefficient behaviour
 Classification and quantification of significance
• Cube (Version 4.6; https://www.scalasca.org/scalasca/software/cube-4.x)
 Performance report explorer for Scalasca and Score-P
 Incudes libraries, algebra utilities and GUI
 GUI supports interactive performance analysis and metrics exploration
• Vampir (Version 9.11;https://vampir.eu/ )
 Parallel performance analysis framework
 Graphical representation of performance metrics and dynamical processes
Page 22

PERFORMANCE ANALYSIS
(TRACING)
Page 23

Performance Analysis (Tracing)
• The following slides show exemplary the load balancing problems of TSMP
• Instrumentation for tracing was done with Score-P, visualization via Vampir
• Explanation of the next 3 slides:
 Presentation of Traces and Profiling of a 1-1-1 CPU only job run of TSMP
 300x300 surface mesh
 Structure of the tracing pictures (slide 1 and 2)
 COSMO: The upper block (48 procs)
 ParFlow: Centered block (48 procs)
 CLM: Lower block (48 procs)
 This structure of traces show the MPMD nature of TSMP
 in particular the concurrent execution of the three component models.
 Every row shows the runtime behaviour of one process (for a better overview all processes are shown,
but it’s possible to zoom in)
Page 24

Performance Analysis (Tracing) (cont’d)
• Explanation of the next 3 slides (cont’d):
 Coloring:
 Red bars are MPI operations (P2P communication, collective communication, MPI requests, MPI
initialization, etc)
 Vertical black lines is MPI communication, black dots MPI bursts resulting in MPI communication
 Green areas describe user functions
 Slide 3 shows the Accumulated Exclusive Time per Function (a kind of Profiling)
Page 25

Performance Analysis TSMP using ScoreP/Vampir (CPU only)
Page 26

Performance Analysis TSMP using ScoreP/Vampir (CPU only)
Page 27

Page 28

Performance Analysis (Tracing) (results)
• CPU only (slide 1 and 2)
 Slide 1 shows an overview over all 144 processes of the 3 component models over the full runtime
(approx 750 s)
 Most of the area is red (MPI) and a closer look shows, that most of the time MPI requests
(MPI_Waitall) are done (see also slide 3)
 Most of its time CLM is waiting for COSMO and ParFlow, since it’s the fastest (i.a. smallest mesh)
 Slide 2 shows a zoom into the first time steps (0 – 33 seconds)
 Second 0 to 14 shows the initialization interval
 Computation begins with second 14
 CLM starts with computing (approx second 14.2), while ParFlow and COSMO are waiting
 After finishing CLM scatters its results to COSMO and ParFlow (black vertical lines) and then waits
(second 14.2 to 15.5)
 ParFlow computes from second 14.3 to 14.5, sends its data to CLM and waits
 COSMO calculates from 14.3 to 15.5 and then sends its data to CLM
 After sending data, COSMO and ParFlow are waiting for CLM, which started to compute
 The cycle starts again
Page 29

Performance Analysis (Tracing) (results; cont’d)
• CPU – GPU (no slide)
 The same mechanism like in the previous slides
 CLM waits almost all of its runtime
 Loadbalancing issues because of different model complexities!
• Profiling (Slide 3)
 Slide shows Accumulated Exclusive Time per Function
 MPI_Waitall dominates all other functions
 But:
 Much MPI_Waitall time is spent in CLM, because its the by far the smallest and fastest model
 COSMO is the most complex model (with the largest mesh)
 Important: Reduction of load imbalance to an “optimal” point
 MPI communication (in per cent) relative to other regions decreases with increasing mesh size and
load balanced models
Page 30

Performance Analysis of TSMP with Scalasca
Goergen et al. | JSC SAC 2019 | 16/17 Sep 2019 | Jülich


Page 31
 Scalasca allows trace analysis to detect bottlenecks and
patterns/regions of poor performance
 Results can be shown with Cube
 An example scenario can be seen on the left hand side
 300x300 TSMP CPU only scenario (1-1-1)
 Only the metric tree of Cube is shown here
 The accumulated total time of the program can be seen in
line 1 and the number of visits in line 2
 Additionally Scalasca excerpts for example events like ...
 “Late Sender” or “Late Receiver” (line 7 and 8)
 Seems to be here a problem, since it takes a significant
amount of the runtime
 To locate the problems more accurately it’s possible to
select different objects
 Gives a better overview in which part of TSMP the
problems occur
 Further bottlenecks can be detected with Scalasca (see box
on the left hand side)

Parallel Performance Analysis with Scalasca (Late Sender)
Goergen et al. | JSC SAC 2019 | 16/17 Sep 2019 | Jülich

Page 32
 Scalasca allows trace analysis to detect bottlenecks and
patterns/regions of poor performance

ONGOING DEVELOPMENTS (SELECTION)
Page 33
• ICON-CLM5 coupling via OASIS3-MCT
• (ICON + ParFlow) on GPUs + (CLM) CPUs
• Flexible/adaptive grids for handling streams/rivers
• Enlarging the mesh size and further strong and weak scaling
experiments on JUWELS and DEEP Cluster and Booster
 Both are pure MSA architectures
• Continuing performance analysis with Scalasca/ScoreP and
Vampir/Cube
• In depth performance analysis of all components of the models
• Best practices performance analysis for TSMP (MPMD)
• Best practices optimization for TSMP
 Especially regarding load balancing (heterogeneous and MSA)
 Different time steps of models and coupling frequencies
(systematic tests)

THANK YOU VERY MUCH
FOR YOUR ATTENTION!
Page 34

2022_01_TSMP_HPC_Asia_Final.pptx

Recommended

Recommended

More Related Content

Similar to 2022_01_TSMP_HPC_Asia_Final.pptx

Similar to 2022_01_TSMP_HPC_Asia_Final.pptx (20)

More from Ghazal Tashakor

More from Ghazal Tashakor (6)

Recently uploaded

Recently uploaded (20)

2022_01_TSMP_HPC_Asia_Final.pptx