jlesc@gsi: developer tools for porting & tuning parallel ... · e.g., juqueen: nekbone 28,672...
TRANSCRIPT
![Page 1: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/1.jpg)
2015-12-04 |
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Developer tools for porting & tuning parallel applications
Brian J. N. WylieJülich Supercomputing Centre
b.wylie @ fz-juelich.de
![Page 2: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/2.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 2
Overview
JLESC Collaborative Project (start May 2015)
■ JSC: Brian Wylie, Christian Feld – Scalasca/Score-P/CUBE■ RIKEN AICS: Miwako Tsuji, Hitoshi Murai – XcalableMP■ BSC: Judit Gimenez – Paraver/Extrae/Dimemas, OmpSs
Goal
■ Integration and improvement of respective developer tools for porting and tuning parallel applications on large-scale computer systems
■ Joint training activities applying tools
![Page 3: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/3.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 3
Tools training
XcalableMP tutorial at JSC (1 Dec 2016) Scalasca & Paraver tools training offered through VI-HPS
■ Virtual Institute – High-Productivity Supercomputing [vi-hps.org]■ focus on parallel performance, correctness & debugging tools■ VI-HPS Tuning Workshops several times each year
■ 3-5 days for application developers to get introduced to tools and receive assistance applying tools to their own codes
■ RIKEN AICS hosting VI-HPS-TW20 (24-26 Feb 2016) for users of K computer and related Fujitsu FX10/100 systems
■ additional workshops planned in Germany & France
![Page 4: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/4.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 4
Scalasca
Developed to support scalable performance analysis of large-scale parallel applications
■ available under open-source license from www.scalasca.org■ offers flexible runtime summarization/profiling and event tracing■ based on Score-P instrumentation & measurement infrastructure
and CUBE analysis report utilities & explorer GUI■ MPI + OpenMP, recently extended to support a variety of other
threading paradigms (Pthreads, OpenCL, Qt, etc.), CUDA, etc.■ e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads
![Page 5: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/5.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 5
Scalability to over 1.8M threads (MPI+OMP)
![Page 6: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/6.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 6
Scalability to over 1.8M threads (MPI+OMP)
![Page 7: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/7.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 7
Score-P architecture overview
Application
Vampir Scalasca Periscope TAU
Accelerator-based parallelism (CUDA, OpenCL, [OpenACC])
Score-P measurement infrastructure
Event traces (OTF2)
User instrumentation
Call-path profiles (CUBE4, TAU)
Online interface Hardware counter (PAPI, rusage, [PERF])
Process-level parallelism (MPI, SHMEM)
Thread-level parallelism (OpenMP, Pthreads, [OmpSs])
Instrumentation wrapper
Source code instrumentation (OPARI2, PDT)
CUBE TAUdb
![Page 8: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/8.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 8
PEPC 32 MPI ranks with 13 pthreads per rank on Juropa
![Page 9: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/9.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 9
Lock contention (pthreads)
![Page 10: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/10.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 10
MPI+OpenCL: SPECFEM3D_GLOBE
![Page 11: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/11.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 11
Tool integration prototypes
Analysis
■ Scalasca can direct Paraver or Vampir to show sections of event trace with severest instances of communication/synchronisation inefficiencies
Measurement
■ OmpSs runtime can generate events for Extrae/Score-P measurements of tasks
■ XMP compiler can include instrumentation for Scalasca measurements of parallel regions
![Page 12: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/12.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 12
Score-P MPI+OmpSs profile
![Page 13: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/13.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 13
XcalableMP-Scalasca instrumentation
EPIK_USER_REG(r_loop#, "loop#"); EPIK_USER_START(r_loop#); #pragma xmp loop profile for(i = 0; i < 100; i++) array[i] = func(i); EPIK_USER_END(r_loop#); EPIK_GEN_OFF(); #pragma xmp bcast (a) EPIK_GEN_ON();
xmpcc -profile profile (specific) pragma-blocks
#pragma xmp loop profile for(i = 0; i < 100; i++) array[i] = func(i); #pragma xmp bcast (a)
XMP code
*Masahiro NAKAO, Center for Computational Sciences, University of Tsukuba
![Page 14: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/14.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 14
XMP -allprofile
![Page 15: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/15.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 15
XMP -profile
![Page 16: JLESC@GSI: Developer tools for porting & tuning parallel ... · e.g., JUQUEEN: Nekbone 28,672 MPI x 64 OMP = 1.8 M threads 2015-12-04 | Developer tools for porting & tuning parallel](https://reader035.vdocuments.mx/reader035/viewer/2022071211/6022cc7913cd6978022da6b3/html5/thumbnails/16.jpg)
2015-12-04 | Developer tools for porting & tuning parallel applications 16
Score-P on-going developments
Support for MPI_THREAD_MULTIPLE Support for MPI-3 Support for OMPT Support for OpenACC Support for additional architectures & platforms Support for binary instrumentation
Support for I/O and memory events Support for orphan (parentless) threads Support for sampling and callstack walking ...