2. What is new in 7.0?
Full
Intel
KNL
support
HBM memory debugging
PAPI
and
Lustre
metrics
Cross-platform
hardware counter
information
Extend IO profiling
capabilities
Create
your
own
metric!
Export
to
JSON
Integration to CI tools
3. As of December 2016, Allinea is part of ARM
Our objective:
Remain the trusted leader in cross platform HPC tools
• We will continue to work with our customers, partners and you!
The same successful team…
• We can now respond quicker and deliver our roadmap faster
… is stronger than ever…
• We remain 100% committed to providing cross-platforms tools
for HPC
… as committed as ever…
• We are working with vendors to support the next generations
of systems.
… and looking forward to the future.
4. The Road to Performance
https://youtu.be/l55w3Fy_J0Y
5. Analyze on Different Platforms
Very simple start-up
Low overhead
Powerful data analysis
For x86_64 and KNL
6. Understand IO/Lustre Efficiency
• Client request
information about the
file to the MDS
• Client R/W file to the
OST in parallel
• Not designed to
handle a large
number of small files
Lustre Client
Metadata
Server (MDS)
Object
Storage
Server
(OSS)
+ OST
Object
Storage
Server
(OSS)
+ OST
Object
Storage
Server
(OSS)
+ OST
Object
Storage
Server
(OSS)
+ OST
Open
Object
Storage
Target
(OST)
Read/Write
8. Improve Cache Usage Using PAPI
• Run “papi_install.sh” from the Allinea Forge
installation folder
• Select the metrics to collect in a configuration file
9. Code Optimization for All Platforms
• PAPI metrics are portable:
– x86_64, ARMv8 and OpenPower
• Allinea Forge 7.0 extends support to IBM Spectrum
MPI
• Data from MAP files can be exported in JSON
10. Make the Most of the HBM on KNL
• The high bandwidth memory (HBM) is on the same
CPU chip, next to processing cores
• 3 modes:
Cores
HBM
DDR4
Cores HBM
DDR4Cache Flat Hybrid
Cores
HBM
DDR4
HBM
11. Make the Most of the HBM on KNL
• Use the memory debugging feature of Allinea DDT
to track the DDR4 / HBM usage
12. Create your Own Metrics
System
introspection
• Specific counters /
files
Application
introspection
• Tracking application
characteristics
Application
Application
Library
Profile
Library
MAP
Profile
XML
Application Profiler
13. Update Your Tools Now!
• Debug
• Profile
• For x86_64, KNL,
ARMv8,
OpenPower
Portability
• Hardware counters
with PAPI metrics
• IO Lustre Metrics
In-depth
• Application/system
specific custom
metrics
• Export data to
integrate to your
workflow
Flexibility
• Download our latest version 7.0.2
– https://www.allinea.com/products/forge/download
– https://www.allinea.com/products/reports/download
• Request a trial!
– https://www.allinea.com/trials
14. Products and Licensing
Performance Report
Workstation /
Supercomputing
Forge
Workstation /
Supercomputing
Forge Professional
Workstation
/Supercomputing
GPU, accelerators
metrics, energy
metrics, PAPI metrics,
Lustre metrics
15. See our full menu of live or recorded webinars:
https://www.allinea.com/performance-webinars-menu
To take a trial visit:
https://www.allinea.com/get-your-free-allinea-forge-and-allinea-
performance-reports-trial
To contact sales email:
sales@allinea.com
Editor's Notes
Hello everyone, thank you very much for joining this webinar. I am Florent Lebeau, Application engineer at Allinea now part of ARM and during this webinar I will tell you more about the latest features of version 7 of our tools Allinea Forge, which includes Allinea MAP for profiling and Allinea DDT for debugging - and Allinea Performance Reports for application behaviour analysis. If you any have question during the presentation, please ask them to my colleague Mark Clarke using the webex chat window. Mark is here to assist me today: he will collect questions and we will answer to a few of them at the end of the presentation.
Version 7 was released a few months ago and it is a major step forward in our support for the latest architectures, including Intel KNL and OpenPower. This version also extends the capabilities of the tools in order to provide more insight about the hardware usage on different platforms and IO bottlenecks. Not only we are providing the HPC community with ready-to-use solutions, but we are also opening our tools for customisation so that our users can create their own! Finally, version 7.0 is also an evolution towards software development best practices by facilitating the integration to your testing workflows.
But before we start, I would like to get back in time… In December 2016, Allinea joined ARM. With 95 billion ARM-based chips shipped to date and more than 15 billion chips shipped every year, ARM is one of the largest semiconductor company in the world. By joining ARM, Allinea is bringing 15 year of experience in HPC and leading-edge tools for high performance code development. Our tools are used everyday, on the largest supercomputers, on all architectures and this will remain the same. Being part of ARM means that we have a stronger support than ever to deliver our roadmap for cross platform tools. We are committed to help the community overcome performance challenges and look forward to face the challenges to come.
On our YouTube channel, you will be able to find our “Performance Roadmap”. This video illustrates and give you guidelines on how to optimise your applications step by step. Because the reason for inefficiency is sometimes hidden by the symptoms and concentrated in small portions of code, premature optimisation is never a good thing. A good method combined to the right tools are required in order to achieve your performance goals.
The first step is to analyse before you optimise. By prefixing the original command with “perf-report” in your job script, Allinea Performance Reports enables to describe the application behaviour quickly and easily. The tool has very little overhead so the report that you get, as you can see on the left, is as close as possible to the application running in production. The version 7.0 supports x86_64 and KNL architectures to provide a rich set of data and powerful analysis about the bottleneck of the application running on these systems. For example, the report displayed on the slide clearly show that this application is IO bound. How can we go further?
On parallel Lustre filesystems, the data is split across different hard disk drives or solid state devices referred as OST in the model displayed on the screen. This is where the parallel IO performance comes from. However, in order to perform the IO on the OST, the information has to be retrieved from the metadata server or MDS. As a result, the MDS can sometimes be responsible for the bottleneck, especially if a large number of small files are being accesses for example. To detect and understand this kind of situation, the version 7.0 of the Allinea MAP profiler provides Lustre metrics. Allinea MAP will be able to display graphs about read and write transfer rates on the disks, but also about the number of metadata operations and the number of file open operations per second.
There are many more steps in the performance roadmap after improving the IO but at some point, memory accesses patterns need to be investigated. Performance Report can help you identify such inefficiencies thanks to the “Memory access” metric. If this value is high: this is a sign that the compute kernels are memory-bound. To optimise this, it important to take advantage of the levels of caches memory available on the CPU before accessing the main memory. This is a difficult thing to understand and optimise, especially if the same application runs on different architectures or different generations of the same platform but some tools can help. PAPI is one of them, this API developed at the Imperial College London enables to collect hardware counters data about floating point operations per second, vectorisation and cache usage. However, using PAPI requires important code changes to collect and display these information.
Instead of instrumenting your application with PAPI, Allinea Forge 7.0 allows you to rely on the profiler sampler to collect these information automatically with very little overhead. Furthermore, you can now rely on the profiler’s user friendly interface to display the data over time. To install the additional metrics, run the “papi_install.sh” script from the installation folder and follow the instructions displayed to specify the metrics you are interested in, for example:
Overview – with FLOPS and cycles per instruction.
CachesMisses – with L1,L2 and L3 caches misses.
PAPI metrics are cross-platform …
… and enable to understand cache usage for instance on ARM and Power architecture as well as on x86.
In addition to this regarding our cross-platform support Allinea Forge 7.0 extends the support to the new IBM Spectrum MPI for Power and x86_64 platforms.
Finally, Allinea Forge profiling data can be exported to JSON files in order to facilitate continuous integration. Thanks to this, performance regressions on multiple platforms can be tracked using Jenkins or Bamboo for example.
Optimisation of memory access patterns sometimes require the use of a debugger to catch errors or understand how and where the data are allocated in the source code of the application. This is particularly critical on the latest Intel KNL architecture …
… whose performance comes from the high bandwidth memory located on the same chip than the 72 compute cores.
The KNL can be configured to use the HBM as a cache between the core and the system memory of as a separate, distinct memory accesses explicitly by the programmer – this is the flat mode. The KNL can also be configured to use part of the HBM as a cache or part as a distinct memory.
Giving programmer the key to understand where allocation are performed on the system is necessary and that’s why
Allinea Forge 7.0 extends the support of KNL architecture to HBM memory debugging. HPC developers are now able to detect memory errors on the HBM but also track usage, as the screenshot show, where 2MB of data are allocated by process 0 on the main memory and 2 MB on the HBM.
The Lustre and PAPI metrics presented earlier are just an example of the flexibility of Allinea tools…
… Version 7.0 actually allows you create your own metrics, whether you are interested in the evolution of very specific hardware metrics when running your application or in how in the evolution of the application internal parameters. The idea to write your own metrics is the same: Allinea Forge’s profiler Allinea MAP requires a library to collect the samples and an XML descriptor to specify how the metric should be displayed in the GUI. Here are a few examples that have been developed for some projects…
This concludes this webinar. We have presented 7.0’s latest features that illustrate Allinea tools’ capabilities for all HPC architectures: x86, KNL, ARM and OpenPower. Not only this strengthen our position to be the cross-platform tool of choice, but it also shows our commitment to provide a broad solution that covers all aspects of development from debugging to I/O, MPI, fine-grain CPU and application-specific profiling. We also help enforce best practise by designing our tools to be easily integrated to development and testing workflow.
You can download and update your version of our tools from our website. The current revision is 7.0.2. If you don’t have a licence, feel free to request a trial licence for our website to try the latest features!
Thanks for very much for attending this webinar. We now have time for a few questions, please