1. TERRESTRIAL SYSTEMS MODELLING PLATFORM
J. BENKE, D. CAVIEDES VOULLIEME, S. POLL, G. TASHAKOR, I. ZHUKOV
JÜLICH SUPERCOMPUTING CENTRE (JSC)
PORTING A COUPLED MULTISCALE AND MULTIPHYSICS EARTH SYSTEM MODEL TO
HETEROGENEOUS ARCHITECTURES (BENCHMARKING AND PERFORMANCE
ANALYSIS)
j.benke@fz-juelich.de, d.caviedes.voullieme@fz-juelich.de, g.tashakor@fz-juelich.de, s.poll@fz-juelich.de, i.zhukov@fz-juelich.de
http://www.fz-juelich.de/ias/jsc/slts
http://www.hpsc-terrsys.de
@HPSCTerrSys
HPSC TerrSys
3. Member of the Helmholtz Association
TERRESTRIAL SYSTEM MODELLING PLATFORM (TSMP)
• Represents processes for soil, land, vegetation and
atmosphere
• Numerical modelling system coupling COSMO / ICON,
Community Land Model (CLM) and ParFlow
Fully modular
• Physically-based representation of transport processes
of mass, energy and momentum
• Component models can have different spatio-temporal
resolution; explicit feedbacks between compartments
• Parallel Data Assimilation Framework (TSMP-PDAF)
• Multiple Program Multiple Data (MPMD) execution
model,OASIS provides common MPI_COMM_WORLD
• https://www.terrsysmp.org/
Page 3
Solving the terrestrial water and energy cycle from groundwater to the atmosphere
ICON / COSMO
CLM
Parflow
PDAF
4. Page 4
Source codes used:
TSMP (v1.3.3): Interface and modeling framework
COSMO (v5.01): Atmospheric model
CLM (v3.5): Land surface model (1D column model)
ParFlow (v3.7): Surface and subsurface hydrological model
OASIS3-MCT: MPI based coupler for all submodels
PDAF: Parallel Data Assimilation Framework (not used here)
Parallelism:
Hybrid (MPI, MPI-CUDA, MPI-OpenMP), depends on component model
Heterogeneous computing enabled (ParFlow GPU + COSMO/CLM CPU)
Performance analysis (profiling/tracing): First results exists (later)
TERRESTRIAL SYSTEM MODELLING PLATFORM (TSMP)
Software features
5. TSMP COUPLING SCHEME
Page 5
One coupling step:
1) COSMO and ParFlow are sending coupling fields to CLM
2) COSMO and ParFlow are idle while CLM is running
3) CLM sends coupling fields back to COSMO and ParFlow
4) COSMO and ParFlow are running simultaneously (CLM is idle)
Communication pattern between components (programs)
7. Member of the Helmholtz Association
The Benchmark Scenario (SimDiCyPBL)
• Motivation:
Systematic scaling tests and performance analysis were already performed (and published) over 8 years
ago (JUQUEEN)
But since that time TSMP and its submodels evolved
The same ist true for HPC architectures (especially for accelerator architectures)
As the consequence new scaling tests and performance analysis results are needed on the new machines
• SimDiCyPBL: Simulation of the Diurnal Cycle of the Planetary Boundary Layer
Synthetic scenario with limited complexity which is of interest only for checking computational correctness,
computational benchmarks and performance analysis tests
Adaption of the TSMP Fall School 2019 scenario
Predefined scenario in the TSMP Data Github (idealRTD)
Advantage: Very adaptable problem, which can be run as a very small configuration or scaled indefinitely
for performance and scalability studies
I.a. the geometry, mesh size and step width (time and space) can be easily adapted
Page 7
8. Member of the Helmholtz Association
The Benchmark Scenario (SimDiCyPBL)
• The setup of the SimDiCyPBL scenario
All three models are used (but not PDAF)
Atmosphere (COSMO 5.01)
Surface/Land/Vegetation (CLM 3.5)
Hydrology/Hydrogeology (ParFlow 3.7)
Area size (example case): 600 x 600 km
Height atmosphere: 22 km, depth ground: 30m
Flat ground with height 0 m (ParFlow)
Constant initial aquifer head: 5 m below sea level
(see blue line with arrow)
Spatially homogeneous and constant
unsaturated zone
Homogeneous ground/soil (with initial constant
temperature of 287 K), radiative forcing
Periodic Boundary Conditions in x and y
direction for COSMO
Graphic adapted from TSMP FallSchool 2019, Day 2, p 6
Page 8
9. Member of the Helmholtz Association
The Benchmark Scenario (SimDiCyPBL) (contd)
• All job runs were performed on JUWELS Cluster
CPU and GPU partition (for heterogeneous runs)
Nodes are non shared
Cluster-Booster runs are planned
• Every model component gets a predefined number
of processes (nodes)
In our test cases every node is fully utilized, that
means 48 processes on 48 Cores
Example (CPU only):
8 COSMO nodes, 1 CLM node, 2 ParFlow
nodes (8-1-2 scenario)
Results in COSMO = 384, CLM = 48 and
ParFlow = 96 processes
Graphic adapted from TSMP FallSchool 2019, Day 2, p 6
Page 9
13. Member of the Helmholtz Association
Strong Scaling Experiment 300x300 surface mesh (config)
• Model domain: 300 x 300 km
• Model mesh size (nx x ny x nz):
COSMO: 306 x 306 x 50 (approx 4.7 x 106
nodes)
CLM: 300 x 300 x 10 (9.0 x 105
nodes)
ParFlow: 300 x 300 x 30 (2.7 x 106
nodes)
Step width (space): Δx = Δy = 1 km
• Model simulation time: 6 hours
• Model time step (all models): Δt = 18 seconds (200 time steps per model hour)
Different time steps per model can be taken
• Coupling frequency OASIS-MCT: 18 seconds
• I/O interval: 6 hours (1 output of every model at the end of the benchmark)
• Used range of number of nodes in experiments: COSMO=1-16, CLM=1, ParFlow=1-8
• Process pinning and distribution (MPI only)
Pinning: by core; Distribution: block : cyclic : cyclic
• The runtime measurement interval is from the start to the end of the job
Page 13
14. Member of the Helmholtz Association
Strong Scaling Experiment 300x300 surface mesh (results)
Page 14
15. Member of the Helmholtz Association
Strong Scaling Experiment 300x300 surface mesh (results)
• Explanation of the function graphs on the previous slide:
Function graphs left hand side: Runtime measurements for CPU only runs (in minutes)
Function graphs right hand side: Runtime measurements for CPU-GPU runs (in minutes)
For every measurement: #CLM nodes = 1
• Every node is fully utilized with 48 processes (CPU only) or with 4 processes (for ParFlow running on GPUs)
• Every function graph of one colour shows a discrete runtime function of the number of COSMO nodes (with
#ParFlow nodes = constant)
E.g. the red discrete function is the runtime graph of the run with 1 CLM node, 1 Parflow node and 1,2,4,8
and 16 COSMO nodes (in the CPU only and the CPU-GPU case)
Every dot shows one measurement point
x axis: Number of COSMO nodes (#Nodes COSMO)
y axis: Runtime of a measurement (job run) in minutes
How to read it: If you are searching for the runtime of a 8-1-2 job (8 COSMO, 1 CLM, 2 ParFlow nodes), then
please have look at the x axis for the number 8, and then the intersection between the virtual vertical axis
through 8 with the green line (#CLM nodes are always 1)
Page 15
16. Member of the Helmholtz Association
Strong Scaling Experiment 300x300 surface mesh (results)
• Both functional groups (CPU only and CPU-GPU) are strictly decreasing monotonically
Except for #Nodes COSMO>8
• COSMO limited parts of a function
Graph of the function is nearly constant or increasing
Interpretation: COSMO is waiting for ParFlow
• ParFlow limited parts of a function
Graph of the function is (strictly) decreasing monotonically
Interpretation: ParFlow is waiting for COSMO
• A (quasi) load balanced state is the “elbow” of a function
Example CPU only case: 8 COSMO, 1 CLM, 2 ParFlow nodes (8-1-2 scenario)
• In all cases the runtime remains constant or increases with 16 COSMO nodes (no more speed up)
The reason is the small size of the mesh of this scenario
• Interesting to observe: In the CPU-GPU case the fastest runs are those with #Nodes ParFlow = 1
Regarding runtime and resource usage/energy efficiency the best choice would be 8-1-1 (CPU-GPU)
• Regarding runtime only the optimal point would be 16-1-2 (CPU only)
Page 16
17. Member of the Helmholtz Association
Strong Scaling Experiment 1200x600 surface mesh (config)
• Model domain: 1200 x 600 km
• Model mesh size (nx x ny x nz):
COSMO: 1206 x 606 x 50 (approx 3.7 x 107
nodes)
CLM: 1200 x 600 x 10 (7.2 x 106
nodes)
ParFlow: 1200 x 600 x 30 (approx 2.2 x 107
nodes)
Step width (space): Δx = Δy = 1 km
• Model simulation time: 6 hours
• Model time step (all models): Δt = 18 seconds (200 time steps per model hour)
Different time steps per model can be taken
• Coupling frequency OASIS-MCT: 18 seconds
• I/O interval: 6 hours (1 output of every model at the end of the benchmark)
• Used range of number of nodes in experiments: COSMO=1-16, CLM=1, ParFlow=1-8
• Process pinning and distribution (MPI only)
Pinning: by core; Distribution: block : cyclic : cyclic
• The runtime measurement interval is from the start to the end of the job
Page 17
18. Member of the Helmholtz Association
Strong Scaling Experiment 1200x600 surface mesh (results)
Page 18
19. Member of the Helmholtz Association
Strong Scaling Experiment 1200x600 surface mesh (results)
• COSMO limited parts of a function
Graph of the function is nearly constant or increasing
Interpretation: COSMO is waiting for ParFlow
• ParFlow limited parts of a function
Graph of the function is (strictly) decreasing monotonically
Interpretation: ParFlow is waiting for COSMO
• A (quasi) load balanced state is the “elbow” of a function
Example CPU only case: 4 COSMO, 1 CLM, 1 ParFlow nodes (4-1-1 scenario)
• CPU only case:
#Nodes ParFlow = 1 is COSMO limited with #Nodes COSMO = 8,16
#Nodes ParFlow = 2 is COSMO limited with #Nodes COSMO = 16
• CPU-GPU case:
All runs are ParFlow limited (waiting for COSMO)
• In the CPU-GPU case the fastest runs are again with #Nodes ParFlow = 1 (16-1-1)
Regarding runtime and resource consumption/energy efficiency best choice
Page 19
20. Member of the Helmholtz Association
Strong Scaling Experiments (Summary)
• CPU only case:
In some cases (especially #Nodes ParFlow = 1,2) the runtime is COSMO limited
• CPU-GPU case:
Most of the runs are ParFlow limited
In all CPU-GPU cases (scenarios) the best performance is reached in the #Nodes ParFlow = 1 case
With only a minor runtime loss compared to the best CPU only case
Regarding energy efficiency and runtime the best CPU-GPU version should be used
• To investigate this further larger meshes are needed
Under construction, but problems arose with creating appropriate OASIS3 rmp* files
• CPU only and CPU-GPU haven’t big differences in runtime because they are coupled and after one time
step they synchronize via sending their data to CLM and wating for the scattered data
Possible solutions to optimize this problem (under investigation):
Different time step sizes for all models (first results are available)
Larger coupling frequency (first results are available, but no large gain in runtime and cases exist
with problems of correctness of the results)
Page 20
24. Member of the Helmholtz Association
Performance Analysis (Tracing)
• The following slides show exemplary the load balancing problems of TSMP
• Instrumentation for tracing was done with Score-P, visualization via Vampir
• Explanation of the next 3 slides:
Presentation of Traces and Profiling of a 1-1-1 CPU only job run of TSMP
300x300 surface mesh
Structure of the tracing pictures (slide 1 and 2)
COSMO: The upper block (48 procs)
ParFlow: Centered block (48 procs)
CLM: Lower block (48 procs)
This structure of traces show the MPMD nature of TSMP
in particular the concurrent execution of the three component models.
Every row shows the runtime behaviour of one process (for a better overview all processes are shown,
but it’s possible to zoom in)
Page 24
25. Member of the Helmholtz Association
Performance Analysis (Tracing) (cont’d)
• Explanation of the next 3 slides (cont’d):
Coloring:
Red bars are MPI operations (P2P communication, collective communication, MPI requests, MPI
initialization, etc)
Vertical black lines is MPI communication, black dots MPI bursts resulting in MPI communication
Green areas describe user functions
Slide 3 shows the Accumulated Exclusive Time per Function (a kind of Profiling)
Page 25
26. Member of the Helmholtz Association
Performance Analysis TSMP using ScoreP/Vampir (CPU only)
Page 26
27. Member of the Helmholtz Association
Performance Analysis TSMP using ScoreP/Vampir (CPU only)
Page 27
29. Member of the Helmholtz Association
Performance Analysis (Tracing) (results)
• CPU only (slide 1 and 2)
Slide 1 shows an overview over all 144 processes of the 3 component models over the full runtime
(approx 750 s)
Most of the area is red (MPI) and a closer look shows, that most of the time MPI requests
(MPI_Waitall) are done (see also slide 3)
Most of its time CLM is waiting for COSMO and ParFlow, since it’s the fastest (i.a. smallest mesh)
Slide 2 shows a zoom into the first time steps (0 – 33 seconds)
Second 0 to 14 shows the initialization interval
Computation begins with second 14
CLM starts with computing (approx second 14.2), while ParFlow and COSMO are waiting
After finishing CLM scatters its results to COSMO and ParFlow (black vertical lines) and then waits
(second 14.2 to 15.5)
ParFlow computes from second 14.3 to 14.5, sends its data to CLM and waits
COSMO calculates from 14.3 to 15.5 and then sends its data to CLM
After sending data, COSMO and ParFlow are waiting for CLM, which started to compute
The cycle starts again
Page 29
30. Member of the Helmholtz Association
Performance Analysis (Tracing) (results; cont’d)
• CPU – GPU (no slide)
The same mechanism like in the previous slides
CLM waits almost all of its runtime
Loadbalancing issues because of different model complexities!
• Profiling (Slide 3)
Slide shows Accumulated Exclusive Time per Function
MPI_Waitall dominates all other functions
But:
Much MPI_Waitall time is spent in CLM, because its the by far the smallest and fastest model
COSMO is the most complex model (with the largest mesh)
Important: Reduction of load imbalance to an “optimal” point
MPI communication (in per cent) relative to other regions decreases with increasing mesh size and
load balanced models
Page 30
31. Member of the Helmholtz Association
Performance Analysis of TSMP with Scalasca
Goergen et al. | JSC SAC 2019 | 16/17 Sep 2019 | Jülich
Page 31
Scalasca allows trace analysis to detect bottlenecks and
patterns/regions of poor performance
Results can be shown with Cube
An example scenario can be seen on the left hand side
300x300 TSMP CPU only scenario (1-1-1)
Only the metric tree of Cube is shown here
The accumulated total time of the program can be seen in
line 1 and the number of visits in line 2
Additionally Scalasca excerpts for example events like ...
“Late Sender” or “Late Receiver” (line 7 and 8)
Seems to be here a problem, since it takes a significant
amount of the runtime
To locate the problems more accurately it’s possible to
select different objects
Gives a better overview in which part of TSMP the
problems occur
Further bottlenecks can be detected with Scalasca (see box
on the left hand side)
32. Member of the Helmholtz Association
Parallel Performance Analysis with Scalasca (Late Sender)
Goergen et al. | JSC SAC 2019 | 16/17 Sep 2019 | Jülich
Page 32
Scalasca allows trace analysis to detect bottlenecks and
patterns/regions of poor performance
33. ONGOING DEVELOPMENTS (SELECTION)
Page 33
• ICON-CLM5 coupling via OASIS3-MCT
• (ICON + ParFlow) on GPUs + (CLM) CPUs
• Flexible/adaptive grids for handling streams/rivers
• Enlarging the mesh size and further strong and weak scaling
experiments on JUWELS and DEEP Cluster and Booster
Both are pure MSA architectures
• Continuing performance analysis with Scalasca/ScoreP and
Vampir/Cube
• In depth performance analysis of all components of the models
• Best practices performance analysis for TSMP (MPMD)
• Best practices optimization for TSMP
Especially regarding load balancing (heterogeneous and MSA)
Different time steps of models and coupling frequencies
(systematic tests)
35. TERRESTRIAL SYSTEMS MODELLING PLATFORM
J. BENKE, D. CAVIEDES VOULLIEME, S. POLL, G. TASHAKOR, I. ZHUKOV
JÜLICH SUPERCOMPUTING CENTRE (JSC)
PORTING A COUPLED MULTISCALE AND MULTIPHYSICS EARTH SYSTEM MODEL TO
HETEROGENEOUS ARCHITECTURES (BENCHMARKING AND PERFORMANCE
ANALYSIS)
j.benke@fz-juelich.de, d.caviedes.voullieme@fz-juelich.de, g.tashakor@fz-juelich.de, s.poll@fz-juelich.de, i.zhukov@fz-juelich.de
http://www.fz-juelich.de/ias/jsc/slts
http://www.hpsc-terrsys.de
@HPSCTerrSys
HPSC TerrSys