d4.2 final report about the porting of the full-scale ......o supermuc: a 3.2 pflops ibm idataplex...

47
D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0 Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline M24 Dissemination Level PU Nature Report Author S. Requena (GENCI) Contributors B. Videau (IMAG), D. Brayford (LRZ), P. Lanucara (CINECA), X. Saez (BSC), R. Halver (JSC), D. Broemmel (JSC), S. Mohanty (JSC), N. Sanna (CINECA), C. Cavazzoni (CINECA), JH. Meincke (JSC), D. Komatitsch (Université de Marseille) and V. Moureau (CORIA) Reviewer Petar Radojkovic (BSC) Keywords Exascale, scientific applications, porting, profiling, optimisation Notices: The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777. 2011 Mont-Blanc Consortium Partners. All rights reserved.

Upload: others

Post on 09-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications”

Version 1.0

Document Information Contract Number 288777

Project Website www.montblanc-project.eu

Contractual Deadline M24

Dissemination Level PU

Nature Report

Author S. Requena (GENCI)

Contributors

B. Videau (IMAG), D. Brayford (LRZ), P. Lanucara (CINECA), X. Saez (BSC), R. Halver (JSC), D. Broemmel (JSC), S. Mohanty (JSC), N. Sanna (CINECA), C. Cavazzoni (CINECA), JH. Meincke (JSC), D. Komatitsch (Université de Marseille) and V. Moureau (CORIA)

Reviewer Petar Radojkovic (BSC)

Keywords Exascale, scientific applications, porting, profiling, optimisation

Notices: The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

2011 Mont-Blanc Consortium Partners. All rights reserved.

Page 2: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

2

Change Log

Version Description of Change

V0.1 Initial draft released to the WP4 contributors

V0.2 Version released to internal reviewer

V0.3 Comments of the internal reviewer

V1.0

Final version sent to EC

Page 3: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

3

Table of Contents

Introduction ..........................................................................................................................5 1

Platforms used by WP4 ..........................................................................................................6 2

Report on the WP4 applications...............................................................................................7 3

3.1 Summary ..........................................................................................................................7 3.2 BigDFT .............................................................................................................................9

3.2.1 Description of the code ...........................................................................................9 3.2.2 Report of the progress on the porting of the code ........................................... 10

3.3 BQCD ........................................................................................................................... 13 3.3.1 Description the code ........................................................................................ 13 3.3.2 Report on the progress of the porting of the code ........................................... 13

3.4 COSMO ......................................................................................................................... 16 3.4.1 Description the code ........................................................................................ 16 3.4.2 Details on the COSMO benchmark version(*) ................................................. 17 3.4.3 Report on the progress of the porting of the code ........................................... 17

3.5 EUTERPE ...................................................................................................................... 22 3.5.1 Description of the code .................................................................................... 22 3.5.2 Report on the progress of the porting of the code ........................................... 22

3.6 MP2C .......................................................................................................................... 24 3.6.1 Description the code ........................................................................................ 24 3.6.2 Report on the progress of the porting of the code ........................................... 24

3.7 PEPC ............................................................................................................................ 25 3.7.1 Description of the code ........................................................................................ 25 3.7.2 Report of the progress of the porting ............................................................... 26

3.8 ProFASI ......................................................................................................................... 30 3.8.1 Description the code ........................................................................................ 30 3.8.2 Report on the progress of the porting of the code ........................................... 30

3.9 Quantum Espresso .......................................................................................................... 31 3.9.1 Description the code ........................................................................................ 31 3.9.2 Report on the progress of the porting of the code ........................................... 32

3.10 SMMP ........................................................................................................................... 35 3.10.1 Description the code ........................................................................................ 35 3.10.2 Report on the progress of the porting of the code ........................................... 35

3.11 SPECFEM3D .................................................................................................................. 37 3.11.1 Description of the code .................................................................................... 37 3.11.2 Report of the progress of the porting ............................................................... 37

3.12 YALES2 ......................................................................................................................... 39 3.12.1 Description of the code .................................................................................... 39 3.12.2 Report of the progress of the porting of Code YALES2 .................................. 40 3.12.3 Interactions with others WP ............................................................................. 42

Conclusions and next steps ................................................................................................. 44 4

List of figures ............................................................................................................................. 45

List of tables ............................................................................................................................... 46

Acronyms and Abbreviations ........................................................................................................ 46

List of references ........................................................................................................................ 47

Page 4: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

4

Executive Summary

The Mont-Blanc project aims to assess the potential of low power embedded components based clusters to address future Exascale HPC needs. The role of work package 4 (WP4, “Exascale applications”) is to port, co-design and optimise up to 11 real exascale-class scientific applications to the different generation of platforms available in order to assess the global programmability and the performance of such systems. After the first report D4.1 “Preliminary report of progress about the porting of the full-scale scientific applications” [1] this report aims to give an overview and the results about the final porting of all the 11 applications on the different system made available by the project or by partners.

Page 5: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

5

Introduction 1

As complement of the activities of work package 3 (WP3, “Optimized application kernels”), a part of the activities of Mont-Blanc will be to assess on the different generation of platforms made available by the project the behaviour of up to 11 “real” exacale-class scientific applications. The objective of work package 4 (WP4, “Exascale applications”) is to evaluate the global programmability and the performance (in terms of time and energy to solution) of the architecture and to assess the efficiency of hybrid OmpSs/MPI programming model. These eleven real scientific applications, used by academia and industry, running daily in production into existing European (PRACE Tier-0 systems) or national HPC facilities have been selected by the different partners in order to cover a wide range of scientific domains (geophysics, fusion, materials, particle physics, life sciences, combustion, weather forecast) as well as hardware and software needs. Some of these applications are also part of the 2010 PRACE benchmark suite (flagged with the P symbol after the name of the code):

Table 1 - List of the 11 WP4 scientific applications

Code Sc. Domain Contact Institution

YALES2 Combustion V. Mouveau CNRS/CORIA

EUTERPE (P) Fusion X. Saez BSC

SPECFEM3D (P) Geophysics D. Komatitsch Univ. Marseille

MP2C Multi-particle collision G. Sutmann, A. Schiller

JSC

BigDFT Elect. Structure B. Videau IMAG

Quantum Expresso (P)

Elect. Structure C. Carvazonni, N. Sanna

CINECA

PEPC (P) Coulomb + gravitational forces

P. Gibbon, L. Arnold

JSC

SMMP Protein folding J. Meinke JSC

ProFASI Protein folding S. Mohanty JSC

COSMO Weather forecast P. Lancura CINECA

BQCD (P) Particle physics D. Brayford LRZ

Page 6: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

6

This report refers to the activities planned in WP4 under Task 4.1 and Task 4.2:

T4.1. Porting of the applications (m6:m24)

The 11 applications will be first ported to the prototypes made available by WP7 and several benchmarks will be conducted in order to evaluate the maturity, the performance of the software stack and the time to port. These porting activities will be conducted in a strong collaboration with WP3, benefiting from the porting of the kernels into MPI/OmpSs, and WP5, benefiting from the different component of the software stack made available.

T4.2. Profiling, benchmarking and optimization (m6:m36)

Following the work performed into task 4.1, a subset of applications which finally offers the best potential for exploiting the hardware and software characteristics of these prototypes will be elaborated. This selection will include the results of WP3 and WP5 activities in term of kernel and software libraries availability/performance as well as all the others components of the software stack. On this subset of applications, dedicated optimisation efforts will be focused on effective usage of SIMD vector units or hybridisation with potential accelerators using portable programming models like OpenCL8 since some of the proposed codes have already some OpenCL versions.

Platforms used by WP4 2

During the second year of Mont-Blanc, the WP4 team worked on:

Finalising the porting of the applications on Tibidabo, the first low power system based on ARM architecture made available by Mont-Blanc partners.

Evaluating performance of novel low-power architectures like Arndaleboards with one ARM 15 dual core, a MALI T604 GPU and 2 GB of DDR3L memory with a bandwidth of 12.8 Gbytes/sec.

Figure 1- Picture of a single Arndale board

Providing scalability curves on traditional large scale HPC systems like the PRACE Tier0 systems:

o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel Xeon Sandy Bridge-EP processors (147,456 cores) and 288 TB of distributed memory. The compute nodes of SuperMUC are interconnected through an Infiniband FDR

Page 7: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

7

network and access to a shared GPFS parallel filesystem of 12 PB. o JUQUEEN: a 5.9 PFlops IBM BlueGene/Q supercomputer installed at Juelich

Supercomputing Center (Germany) with 28 racks (28,672 nodes for a total of 458,752 cores) and 448 TB of distributed memory. The compute nodes of JUQUEEN are interconnected through an proprietary 5D torus network and access to a shared GPFS parallel filesystem.

o CURIE: a 2 PFlops BULL Bullx supercomputer installed at TGCC/GENCI (France) with 3 different partitions (thin nodes, fat nodes and hybrid nodes). The biggest partition, the thin nodes one, is composed by 5,040 compute blades for a total of 80,640 Intel Xeon Sandy Bridge-EP processors cores and 322 TB of main memory. The 3 partitions are interconnected together by an Infiniband QDR fat tree network and access to a 15 PB dual level shared Lustre parallel file system.

Report on the WP4 applications 3

3.1 Summary

The following table shows, for each application, the current status of the porting across the different programming models:

Code MPI MPI+OpenMP CUDA OpenCL OpenACC OmpSs

BigDFT PORTED NOT PLANNED

PORTED PORTED NOT PLANNED

NOT PLANNED

BQCD PORTED PORTED NOT PLANNED

ONGOING NOT PLANNED

PORTED

COSMO PORTED PORTED PORTED PLANNED PLANNED ONGOING

EUTERPE PORTED PORTED PLANNED ONGOING NOT PLANNED

PORTED

MP2C PORTED PLANNED PLANNED PLANNED NOT PLANNED

ONGOING

PEPC PORTED PORTED using pthreads

PLANNED PLANNED ONGOING ONGOING

ProFASI PORTED NOT PLANNED

NOT PLANNED

PLANNED NOT PLANNED

NOT PLANNED

Quantum Espresso

PORTED PORTED PORTED ONGOING NOT PLANNED

ONGOING

SMMP PORTED NOT PLANNED

PORTED PORTED NOT PLANNED

ONGOING

SPECFEM3D PORTED NOT PLANNED

PORTED ONGOING NOT PLANNED

ONGOING

YALES2 PORTED ONGOING NOT PLANNED

NOT PLANNED

NOT PLANNED

NOT PLANNED

Table 2 - Status of the porting for each MB application

Page 8: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

8

In parallel with porting, significant effort has been performed by WP4 partners in collaboration with BSC to analyse the scaling of all the 11 applications on various HPC configurations including PRACE large-scale systems (like CURIE, SuperMUC, JUQUEEN or FERMI) and national systems (CRAY XC30 at CSCS). The following figure shows all the strong scaling results with by number of cores the speedup performed by 9 of the 11 MB applications. For SMMP application only weak scaling was available. During the period, ProFASI code had a lot of modifications, thus it has not been possible to perform performance evaluation.

Figure 2- Strong scaling of the Mont-Blanc applications

The following sections describe in detail the conclusions of the porting for each of the applications. The overall conclusions are:

The software stack provided by the Linux distributions and by WP5 is sufficient to port quite easily all the applications and their external dependencies (meshers, post processing tools, I/O and numerical libraries).

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

524288

1 4 16 64 256 1024 4096 16384 65536 262144

(linear)

YALES2

SPECFEM3D

MP2C (N=100000/core)

MP2C (N=10000/core)

MP2C (N=2x10^7)

COSMO RAPS (no IO)

EUTERPE

BigDFT

PEPC (32B particles)

BQCD

QE

Page 9: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

9

The Tibidabo cluster which was designed primarily for porting issues has allowed us not only to port but also to profile and to scale-out some of the applications.

After the initial first results presented one year ago in D4.1, some improvement have been possible by changing the job scheduler of Tibidabo, rebooting more often the CISCO network switches or using larger datasets.

The partners worked also on updating performance numbers on PRACE petascale systems like CURIE, SuperMUC or JUQUEEN as well as on some Arndaleboards, which are prefiguring the next Mont-Blanc prototype. Most of the applications show strong scaling when running on large-scale HPC systems.

3.2 BigDFT

3.2.1 Description of the code

BigDFT1 is an ab-initio simulation software based on the Daubechies wavelets family. The software computes the electronic orbital occupations and energies. Several execution modes are available, depending on problem under investigation. Cases can be periodic in various dimensions, and use K-points to increase the accuracy along periodic dimensions. Test cases can also be isolated.

Figure 3 - Nitrogen (N2) electronics orbitals

The BigDFT project was initiated during a European project (FP6-NEST) from 2005 to 2008.

1 http://inac.cea.fr/L_Sim/BigDFT/

Page 10: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

10

Four institutions were implicated in the BigDFT project at that time:

Commissariat à l'Énergie Atomique (T. Deutsch),

University of Basel (S. Goedecker),

Université catholique de Louvain (X. Gonze),

Christian-Albrechts Universität zu Kiel (R. Schneider).

BigDFT is an Open Source project and the code is available at: http://inac.cea.fr/L_Sim/BigDFT/.

From 2010, four laboratories are contributing to the development of BigDFT: L_Sim (CEA), UNIBAS, LIG and ESRF and BigDFT is mainly used by academics.

The code is written mainly in FORTRAN (121k lines) with part in C/C++ (20k lines) and it is parallelized using MPI, OpenMP and an OpenCL support. It also uses BLAS and LAPACK libraries.

BigDFT scalability is good and multiple runs using more than 4096 cores of an x86 cluster (CURIE) have been conducted. Hybrid runs using up to 288 GPUs of CURIE hybrid have also been realized.

3.2.2 Report of the progress on the porting of the code

Porting BigDFT to ARM using the GNU tool chain was straightforward. As BigDFT possesses a large non-regression test library, asserting its proper behaviour was simple.

In order to have comparison opportunities and to conduct some experiments before the availability of Mont-Blanc prototypes we also used a Snowball Cortex ARM A9 platform for testing. Results obtained on the Snowball cards are similar to those of Tibidabo, with about 5% improvement in favour of Tibidabo. We also recently benchmarked BigDFT on the Exynos 5 Dual processor (Arndale board), the reference SoC chosen for the Mont-Blanc prototype. Energy results shown in Table 3 refer to the worst-case energy consumption scenario. The consumption for the Xeon platform is set at 95 watts (the TDP of the chip), while the Snowball and Exynos are set at 2,5 and 5 watts respectively (the TDP of the platform).

The test used in this case is a non-regression test of BigDFT, it computes the electronic density surrounding a Silane molecule (SiH4):

Snowball board (1 core)

Xeon X550 (4 cores)

Exynos 5 dual (1 core)

Exynos 5 dual (2 cores)

Execution time (s) 420.4 18.1 159 97.4

Energy (J) 1050 1720 795 487

Table 3 - BigDFT execution time and energy on ARM vs x86

As it can be seen, the energy ratio favours the ARM processors.

The benchmarking of BigDFT revealed a communication issue with the network infrastructure

Page 11: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

11

of the prototype. Without network problems the Tibidabo prototype was found to be 10 times slower than a Xeon processor but 4 time more power efficient, using the worst-case energy consumption scenario.

A network issue due to congestions observed at the level of the Ethernet switches causes to have a limited scaling of the application on Tibidabo from 16 to 36 cores (8 to 18 nodes):

Figure 4 - BigDFT initial scaling on Tibidabo

We investigated this problem and it was found to be due to collective communication problems where some node participating in the communication lost packets. Those packets had to be resent, adding an additional delay in the communication. Sometimes even those resent packets would be lost and the communication had to suffer yet another delay. The delay is 10 times longer than the communication it hindered, and the problem is more likely to occur if the number of nodes is high. This explains the bad scaling in the previous figure. In order to confirm the scalability of the code we ran it on 4096 cores of the CURIE CEA/GENCI cluster. BigDFT is showing an efficiency (in blue) of 95 % and a speedup of 15 between 256 and 4096 cores when simulating a co-metalloporphyrene on top of charged graphene. The bars show the scalability of each of the major parts of BigDFT: numerical algorithms (LinAlg), convolution (Conv), potential computations (Potential), MPI communications (Comms) and the rest of the code (Other).

Page 12: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

12

Figure 5 - BigDFT strong scaling on an x86 system (CURIE)

In order to investigate those problems, we instrumented the code using Extrae and PAPI. The results we obtained are shown in the next figure:

Figure 6 - Traces of BigDFT on Tibidabo using 36 cores (18 boards)

After these observations were made, some optimization was done on the network, providing a better load balancing on the system, and less contention when running experiments. New experiments were performed on Tibidabo, scaling up to 128 nodes, and shown better results.

In parallel, a simulation of the platform was made. Calibration scripts were executed on the prototype to observe the behaviour of MPI benchmarks, and to inject these details in the simulation. These simulation efforts will be continued in the future to perform energy consumption simulation, and to anticipate scaling to larger platforms.

Results of these new runs are shown on the figure below. We can see that small scaling issues still arise at 128 nodes, when the links between different switches become saturated by the

0

10

20

30

40

50

60

70

80

90

100

256 512 1024 2048 4096

2

4

6

8

10

12

14

16

Pe

rce

nt

Sp

eed

up

No. of cores

Run analysis, strong scaling

CommsLinAlgConvPotentialOtherSpeedupEfficiency (%)

Page 13: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

13

collective calls BigDFT uses. But overall scaling is now satisfying on the Tibidabo prototype with a speedup of up to 12.5 when moving from 8 to 128 cores.

Figure 7 - New BigDFT scaling on Tibidabo. Real vs simulated

(Speedup reference: 1 for 8 cores)

3.3 BQCD

3.3.1 Description the code

BQCD is used in benchmarks for supercomputer procurement at LRZ, as well as in the DEISA and PRACE projects, and it is a code basis in the QPACE project [2]. The benchmark code is written in Fortran90. BQCD is a program that simulates QCD with the Hybrid Monte-Carlo algorithm. QCD is the theory of strongly interacting elementary particles. The theory describes particle properties like masses and decay constants from first principles. The starting point of QCD is an infinite-dimensional integral. In order to study the theory on a computer space-time continuum is replaced by a four-dimensional regular finite lattice with (anti-) periodic boundary conditions. After this discretisation, the integral is finite-dimensional but still rather high dimensional. The high-dimensional integral is solved by Monte-Carlo methods. Hybrid Monte-Carlo programs have a compute intensive kernel, which is an iterative solver of a large system of linear equations. In BQCD we use a standard conjugate gradient solver (CG). Depending on the physical parameters 80% or up to more than 95% of the execution time is spent in this solver.

3.3.2 Report on the progress of the porting of the code

Regarding the first conclusions of the porting of BQCD on Tibidabo, the LRZ team worked with BSC experts in order to understand the limitations of scalability shown in D4.1.

Page 14: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

14

One of the reasons found was a bad processes placement policy of the Torque job scheduler used at the beginning of exploitation of Tibidabo. By moving the scheduler to SLURM, giving knowledge of the network topology and providing bigger datasets (bigger lattices) the overall performance increase of BQCD on Tibidabo has been improved on up to 128 cores. The following figure illustrate with 2 different datasets (8x8x8x16 in blue and a bigger one of 32x32x32x48 in green) the scaling in performance of BQCD on up to 128 cores (64 boards) of Tibidabo:

Figure 8 - Increased scalability of BQCD using SLURM on Tibidabo

In July 2013, the Leibniz Supercomputing Centre held the first workshop to test extreme scaling on SuperMUC, a 3.2 PFLOP/s system with 147,456 Intel Sandy Bridge CPU cores. Groups from 15 international projects came to the LRZ with codes that had could scale up to 4 islands (32,768 cores). During the workshop, the participants tested the scaling capabilities on the whole system and BQCD was one the 15 applications which benefited from this scaling out exercise. The following figures show the strong scaling in time of BQCD Conjugate Gradient solver (which consumes around 95% of the total time) with a big dataset (lattice of 96x96x96x192) on up to 16 384 cores for the MPI version and on up to 131 072 cores for the hybrid (MPI+OpenMP) version of BQCD on SuperMUC.

Page 15: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

15

Figure 9 - Strong scaling of the MPI solver of BQCD

Figure 10 - Strong scaling of the MPI/OpenMP solver of BQCD

The BQCD code has also been totally ported and compiled on JUDGE (a small x86 system at JSC) using the Mercurium Fortran compiler provided by BSC. Then the solver in the field of WP3 and WP4 joint activity has been ported in OmpSs with a strong support from BSC, and first performance results have been obtained recently:

Page 16: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

16

Processors Lattice size Time (s)

Max MFlops

Total GFlops

OMPSs compiler

16 8, 8, 8, 16 68 20.92 0.33

OMPSs compiler

32 8, 8, 8, 16 90.75 7.82 0.25

Intel compiler

16 8, 8, 8, 16 35 39.51 0.64

Intel compiler

32 8, 8, 8, 16 93 7.65 0.24

Table 4 - Performance of the CG solver using Intel and OmpSs compiler

From comparing the conjugate gradient solver performance results of the standard hybrid MPI OpenMP version of BQCD, it appears that the BQCD binary generated by the Intel ifort compiler has approximately twice the performance of the OMPSs compiler binary. This difference in performance is not surprising (different stages in compiler evolution). The behaviour observed between 16 and 32 cores need further investigations by the developers. However, what is interesting is with relatively small input lattice sizes the OMPSs generated binary performs slightly better than the Intel compiled binary. The reason for this is that the conjugate gradient solver employs a domain decomposition method and if the lattice (matrix) is small, communication between processors become the dominant factor. Which is also the reason why larger input lattices can have higher performance characteristics than smaller input lattices, due to the each processor having to perform more arithmetic operations and less inter-processor communication. Finally specific work was conduced in order to improve the vectorisation of the code and to port the solver in OpenCL (see D3.3 deliverable for more information).

3.4 COSMO

3.4.1 Description the code

The principal objective of the COnsortium for Small-scale MOdeling (COSMO) is the creation of a meso-to-micro scale prediction and simulation system. This system is intended to be used as a flexible tool for specific tasks of weather services as well as for various scientific applications on a broad range of spatial scales.

COSMO is under specific licence, used by academic and industrial users and written in Fortran 90 and parallelised using MPI. All the I/O operations are managed through external linkage with the NetCDF library. The code has been ported on standard Linux clusters (PPC, x86 and ARM cores) and some efforts have been undertaken outside of this project to port it to GPUs. Scalability curves on the PLX hybrid cluster at CINECA (http://www.hpc.cineca.it/content/ibm-plx-gpu-user-guide-0) are reported on D3.1 of the Mont-Blanc project.

Page 17: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

17

3.4.2 Details on the COSMO benchmark version(*)

(*) For the sake of readability, we report in these sections part of the contribution already given in D3.2

The COnsortium for Small-scale MOdeling (COSMO [3]) was formed in October 1998 with the objective to develop, improve and maintain a non-hydrostatic limited area atmospheric model, to be used both for operational and research applications by the members of the consortium. To meet the computational requirement of the model, the program has been coded in Standard Fortran90 and parallelized using the MPI library for message passing on distributed memory machines. Several codes are part of the general model (COSMO-ART, COSMO-CCLM, etc) suitable for specific purposes. Among this COSMO RAPS is a reduced version of COSMO code and is used mainly for benchmarking purposes from vendors, research communities and consortium members. The release first used on this project was COSMO_RAPS 5.0. Together with RAPS, in order to better address the “operational” environment of COSMO code and in the framework of the HP2C Swiss project, OPCODE testbed was established in 2011. The OPCODE[4] project (and testbed) is a sort of “demonstrator” of the entire operational suite of MeteoSwiss, ranging from IFS (Integrated Forecast System) boundary conditions to post processing and presentation of results. The numerical core is based on COSMO operational release 4.19.2.2. Both RAPS and OPCODE have been selected for the initial porting to ARM architecture. The main reason for this choice is the evidence that both are of interest for COSMO community (RAPS for benchmarking purposes and OPCODE for “operational”) and are the versions best-suited to be implemented on the Mont-Blanc prototype architecture. RAPS code was the first ported to Tibidabo machine during T3.1 and T4.1. While the complete porting on ARM prototype of RAPS has been completed, some problem occurred during the simulation step. In order to fix those problems and to finalize the activity within the WP3 and WP4, OPCODE toolchain was considered for the initial porting instead of RAPS. Among the advantages of OPCODE with respect to RAPS we highlight the following:

The chance to easily change the COSMO code structure, for example the “dynamical core” or the communication library (“stencil”) without minimal effort.

The advanced status of implementation of COSMO model on GPU architectures. The GPU implementation of OPCODE is still at a prototype level. Nevertheless, the main computational parts (Dynamical core and Physics) have been already ported to GPU. In particular, the porting of Physics to GPU was carried out using OpenACC tool and this aspect could represent a great advantage toward the porting over the OmpSs toolkit (e.g., see next section for details).

3.4.3 Report on the progress of the porting of the code

As already reported in D4.1 the COSMO RAPS code has been ported to Tibidabo using the GNU toolchain and no particular issues have been encountered in the process. However, as of M18 the reference version in WP4 has been switched to OPCODE and as also reported in D3.2 OPCODE was successfully ported on Tibidabo cluster. The code was build using GNU 4.6.2 gfortran compiler, used with options:

Page 18: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

18

-ftree-vectorize –mcpu=cortex-a9 –mtune=cortex-a9

During the activities in WP4 a comprehensive set of benchmarks has been carried out on the ARM version of the code using the Tibidabo prototype. As input dataset, the “artificial” configuration of OPCODE was chosen, this in order to guarantee an easy setup of the simulation with minimal data movement. Up to now, strong scaling up to 128 Tibidabo nodes was painlessly reached. The following figure illustrates the scaling in time of COSMO on a mesh of 256x256x60 elements on up to 128 cores of Tibidabo:

Figure 11 - Strong scaling of COSMO OpCode on Tibidabo

A limitation encountered with asynchronous I/O (nprocio>0) was solved on Tibidabo and now OPCODE is able to use synchronous and asynchronous I/O capability where the CPU bound part of the code (DYN:Dynamical + PHY:Physics) is reported in blue while the communication and I/O timing is given in light green. It is clearly evident from the figure that an almost linear scalability of the CPU bound code sections has a steady counterpart that remains constant as the number of processors increase. This behaviour is typical on this sort of input-dependent applications and the optimization guidelines were oriented along two directives:

1. Optimizing the input dataset using a large, more realistic (x,y,z) grid ; 2. Relying on the OmpSs tool for the superposition of the computation and

communication phase of the run. For the optimization of the input dataset once the input is configured as “artificial” it is enough to have a sufficient number of nodes so that the full grid could be maintained in memory. Of course such a kind of optimization requires a sufficiently large computing system and at present we were not able to implement it in any Mont-Blanc prototypes. Just to show a feasible benchmark workout, we report in the following figure a test case carried out by MeteoSwiss on

Page 19: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

19

a Cray XC machine with up to 2304 physical cores (blue: measured speedup; red: linear speedup)

Figure 12 - Strong scaling of COSMO on a Cray XC30 system from 288 to 2304 physical codes

using from 576 to 4608 virtual cores

As detailed at the joint CRESTA/DEEP/Mont-Blanc meeting in Barcelona in June 2013[5], this trend of super-linear speedup for the strong scalability of any COSMO (MPI) version can be obtained by switching the I/O off and by using a sufficiently large (x, y, z) grid. By taking into account the grid used in this benchmark that was more than 40 times larger the one we used in our benchmark on Tibidabo, we are confident that similar or even larger grids could eventually be used on the final Mont-Blanc prototype, thus reaching the best performance with respect to the input dataset. For what concerns the porting over the OmpSs toolchain and then, in order to benefit the overlapping of the computation and communication phase of the run, we first proceeded with a comprehensive profiling of OPCODE. As a whole, both Dynamical core and Physics are responsible for the most computational demanding parts of OPCODE. Dynamical core porting on GPU involved a complete rewritten in C++ made out by the COSMO consortium and the use of the so-called “stencil-library” in order to execute. This part seems at a more prototypal stage (and “experimental”) with respect to the Physics part. On the other hand, Physics is approximately 20% of the complete execution time of an OPCODE simulation and is more “compute bound” with respect to the other code parts. Dynamical core Porting to OmpSs We recall that Dynamical core of OPCODE together with Physics is responsible for most of the computing time of a given simulation. YUTIMING analysis shows that within Dynamical Core

Page 20: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

20

most of the computing time is spent in fast_wave solver which is used in OPCODE to compute fast wave terms related to the prognostic variables update in the Dynamical core. For the sake of simplicity, simulations were carried out using this routine in the simplest way, letting numerical discussions to further studies. Porting to OmpSs was done working on source code directly. For example a typical loop in fast_wave: DO k = 2, ke

DO j = jstart , jend

DO i = istart , iend

zrofac = ( rho0(i,j,k)*dp0(i,j,k-1) + rho0(i,j,k-1)*dp0(i,j,k) )&

/( rho (i,j,k)*dp0(i,j,k-1 )+ rho (i,j,k-1)*dp0(i,j,k) )

zcw1(i,j,k) = 2.0_ireals*g*dts*zrofac/(dp0(i,j,k-1)+dp0(i,j,k))

ENDDO

ENDDO

ENDDO

has been analyzed and taskified, identifying the scope of each variable that is responsible for this task and task dependencies:

!$OMP TARGET DEVICE(OPENCL) NDRANGE(3, MII,MJJ,MKK,32,8,4)

FILE(fast_waves_1.cl) COPY_DEPS

!$OMP TASK IN(T1,T2,T3) OUT(T4)

SUBROUTINE FAST_WAVES_1(MII,MJJ,MKK,K1,K2,J1,J2,I1,I2,A,B,T1,T2,T3,T4)

INTEGER::MII,MJJ,MKK,K1,K2,J1,J2,I1,I2

REAL*8,DIMENSION(MII,MJJ,MKK)::T1,T2,T3,T4

REAL*8 A,B

END SUBROUTINE FAST_WAVES_1

After that, as OmpSs greatly simplify the memory allocation, data copies to/from device, etc., the work has been to create the kernels (CUDA and then OpenCL) starting from the corresponding F90 function(s):

#ifdef cl_khr_fp64

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

#elif defined(cl_amd_fp64)

#pragma OPENCL EXTENSION cl_amd_fp64 : enable

#else

#error "Double precision floating point not supported by OpenCL

implementation."

#endif

__kernel void fast_waves_1(int n, int p,int q,int lk,int uk,int lj,int

uj,int li,int ui,double g, double dts,__global double* a, __global

double* b, __global double* c, __global double* d)

{

#define idxyz(I,J,K) ((I)+n*((J)+p*(K)))

const int I = get_global_id(0);

const int J = get_global_id(1);

const int K = get_global_id(2);

if(I>=li-1 && I<=ui-1 && J>=lj-1 && J<=uj-1 && K>=lk-1 && K<=uk-1) {

double zrofac = ( a[idxyz(I,J,K)]*b[idxyz(I,J,K-1)] + a[idxyz(I,J,K-

1)]*b[idxyz(I,J,K)] )/(c[idxyz(I,J,K)]*b[idxyz(I,J,K-1)]+

c[idxyz(I,J,K-1)]*b[idxyz(I,J,K)]);

d[idxyz(I,J,K)] = 2.0*g*dts*zrofac/(b[idxyz(I,J,K-

1)]+b[idxyz(I,J,K)]);

Page 21: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

21

At this stage of development we are finalizing the inclusion of the new OmpSs structure into the fast_wave solver routine. The porting of fast_wave will be carried out by inserting into the DYN part of OPCODE the code structure developed for the Himeno porting to OmpSs:

At the end of T4.2 we expect to have a preliminary version of the new DYN part of OPCODE suitable enough to begin the numerical assessment of this part of the code. Nonetheless, the performance evaluation phase expected to start in this task will be carried out into T3.4 when the first parallel Exynos prototype of Mont-Blanc will be released. Physics Porting to OmpSs+OpenCL Physics has already been ported to GPU using OpenACC directives toolchain and the use of directives let the code almost unchanged. In fact, as reported in D3.2 we recall that a typical loop in PHY:

do j=1, niter

do i=1, nwork

c(i) = a(i) + b(i) * ( a(i+1) – 2.0d0*a(i) + a(i-1) )

end do

end do

can be almost straightforwardly ported to OpenACC as:

!$acc update device(a,b)

do j=1, niter

!$acc region do kernel

do i=1, nwork

c(i) = a(i) + b(i) * ( a(i+1) – 2.0d0*a(i) + a(i-1) )

end do

!acc update host(c)

end do

!acc update host(c)

Thus, as OmpSs greatly simplify the memory allocation, data copies to/from device, etc., the work will be to create the kernels (CUDA and then OpenCL) starting from the corresponding F90 function(s) and to translate OpenACC directive to OmpSs. This activity will also be accomplished with the support of some tool able to translate OpenACC directives into CUDA

COSMO OpCode

Himeno solver Dynamical

core

Physics

Page 22: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

22

and/or OpenCL kernels. An evaluation of one of these tools is in progress and we are very confident about the good results of this investigation [6]. At present, we have the PHY part of the OPCODE version of COSMO code completely ported to GPU with OpenACC and then already taskified for developing the OmpSs version. Nonetheless, we have to underline that as for the DYN part of OPCODE, the performance evaluation phase expected to start in this task will be carried out into T3.4 when the first parallel Exynos prototype of Mont-Blanc will be released.

3.5 EUTERPE

3.5.1 Description of the code

EUTERPE is a code for the simulation of micro-turbulences in the fusion plasma, so it is focused on the plasma physics area. EUTERPE code solves the gyro-averaged Vlasov equation for the distribution function of each kinetically treated species (ions, electrons and third species). The code follows the particle-in-cell (PIC) scheme, where the distribution function is discretized using markers. EUTERPE is mainly written in Fortran90 with a few C preprocessor directives. It has been parallelized using MPI and Barcelona Supercomputing Center (BSC) has introduced OpenMP at the version 2.61. The application uses the following free libraries: FFTW (for computing Fast Fourier Transform) and PETSc (for solving sparse linear system of equations). Moreover, the application includes a tool to generate the electrostatic equilibrium for input.

3.5.2 Report on the progress of the porting of the code

The activity of porting EUTERPE to OmpSs in SMP environments is now finished. From the previous deliverable, all the reported tickets to the Mercurium compiler team have been solved. Some of the latest found problems were:

variables not detected inside parallel loop,

wrong translation of optional arguments in subroutines,

loop labels do not work inside parallel loops. The profiling results indicated that the most time-consuming sections of the code were: the particle push and the charge/current calculation on the grid. So we focused the work in these routines and they have been parallelized using parallel loops, so the porting to OmpSs was almost direct (after identifying the dependences of the variables).

EUTERPE has been compiled using Mercurium and it has been executed on Tibidabo. The results have been validated using a small test case based on a 32x32x16 grid.

However, the porting to OpenCL is in progress, because it is taking more time than expected because the code has to be translated from Fortran to C language, and this implies changes in the structures. Some parts of the code have been tested and others are partially finished. So, we cannot predict accurately what will be the GPU performance, but we hope it will be more efficient.

Page 23: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

23

Once the OpenCL code is completed, the kernels will be translated to CUDA. We expect this task will take less time, because the code in C will be similar.

On a different topic, EUTERPE has changed its policy. Currently, the authors want to control the code distribution, so it is not Open Source because you need the authorization of them to get it.

Page 24: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

24

3.6 MP2C

3.6.1 Description the code

MP2C (Massively Parallel Multi-Particle Collision) is a highly scalable parallel program which couples Multi-Particle Collision dynamics (MPC) to Molecular Dynamics (MD). MPC is a particle-based method to model hydrodynamics on a mesoscopic scale. The code is developed by the Simulation aboratory Molecular Systems at the u lich Supercomputing Center ( orschungszentrum u lich). As of now no final licence policy was agreed upon. Its actual usage is academic and it is used within different national and international projects. MP2C is written in Fortran 90 and uses MPI for parallelization. There are some approaches to include different other parallelization models, like OpenMP or CUDA, but these versions are still under development and experimental. It is possible to use the parallel SionLib I/O library which on the one hand highly improves the I/O performance but which on the other hand limits the use to a specific file format.

3.6.2 Report on the progress of the porting of the code

Since D4.1 the MP2C code was not ported to other types of architectures. Since JSC had no access to the new ARM prototypes due to legal issues, no tests or porting work was done concerning the Exynos environment. Since these legal issues seem to be cleared by now, these work steps will be tackled in the near future. The code was successfully ported to the Tibidabo cluster and showed a similar scaling behaviour as on the JUDGE machine at JSC. During the last few months, we constantly improved the memory consumption of the code and worked on improving our data structures, for a better support of multi-threaded approaches. For example we changed arrays of structures to structures of arrays to facilitate data transport to OmpSs tasks and GPU kernels (considered later on). Since we currently are at the stage of finally including OmpSs tasks into the code and still have to solve some problems with our data structures, we cannot make any prediction at present about the efficiency of using GPUs. We expect an efficient GPU support for CPU intense parts of the code, e.g. for propagation of particle coordinates and velocities. While we did not port the code to any new architecture, the low scaling behaviour of the code was investigated in more detail on BG/Q and Tibidabo. In order to characterise the scaling behaviour of the code, some scaling tests on a BG/Q system were conducted. After some memory optimizations, the code now is able to handle 106 MPC particles per MPI process and scaling up to 1 048 576 MPI processes, while running 64 MPI processes on a IBM BG/Q node. After the initial results provided in D4.1 and in order to understand some low scalability of MP2C, some more scaling tests were conducted for the MPC part of MP2C on Tibidabo with the support of BSC team, which show that the application scales well on the machine. Such evaluation showed that the low-cost Cisco SLM2048T switches were responsible for a network performance and stability steadily decrease in time. Such network performance problems can be solved by rebooting the network switches, which is done periodically. The next figure shows the difference in execution time for the MP2C application, before (in green on the upper part of the following figure) and after (in blue in the following figure) rebooting the switches. A 32-node run after switches are rebooted shows a performance improvement of 100x.

Page 25: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

25

Figure 13 - Performance of MP2C on Tibidabo before and after rebooting network switches

Therefore the porting of MP2C to Tibidabo is considered successful. The results shown here are based on the version of MP2C from April 2013, which is still purely MPI parallelized. Then in the field of interactions with WP3, the MP2C team worked in the porting of one of the kernels of MP2C called thermostat(). Since it is an important part of the code, the routine was updated in the same way as the rest of the code. Main concerns for the thermostat module are the high demand for random numbers. This requires one of two solutions: (i) a serial pre-calculation of a big amount of random numbers which could be distributed to different tasks; (ii) defining a unique seed value on each thread allowing to compute independent random numbers within the threads. Currently the second approach is favoured due to a better perspective for high parallel scaling. When porting the code to GPUs, another approach could be to calculate the random numbers on the GPU. Besides calculating uniform random numbers, this might induce some additional work, as the random numbers are used within an acceptance-rejection method, which has to be efficiently implemented on the GPU as well.

3.7 PEPC

3.7.1 Description of the code

PEPC is a N-body solver for Coulomb or gravitational systems. It is used by diverse user communities in areas such as warm dense matter, magnetic fusion, astrophysics, complex atomistic systems and vortex fluid dynamics. PEPC has also formed part of the extended PRACE benchmark suite used to evaluate Petaflops computer architectures. Current projects use PEPC for laser- or particle beam-plasma interactions as well as plasma-wall interactions in tokamaks, simulating fluid turbulence using the vortex particle method, and for investigating planet formation in circumstellar discs consisting of gas and dust.

Page 26: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

26

The code is open source and developed at Juelich [7] within the Simulation Laboratory Plasma Physics [8] under GPL license. PEPC is written in Fortran 2003 with some C wrappers to enable pthreads support, thus making use of a hybrid MPI/pthreads programming model. There is also a branch of PEPC that uses the hybrid MPI/SMPSs programming model instead. The only external dependency is a library for parallel sorting written in C that is included in the source tree.

3.7.2 Report of the progress of the porting

Following the D4.1 report where some limitations of scalability were encountered a joint action between the PEPC team and BSC have been able to increase the scalability of the code by using bigger datasets (512k in blue and 1024k in green on the following string scaling figure) on Tibidabo on up to 128 cores (64 nodes):

Figure 14 - Improved speedups of PEPC on Tibidabo using bigger datasets

Since the last D4.1 report, the main development branch of PEPC has undergone some changes to increase performance in a multi-threaded environment. Those changes enter the GPU fork of PEPC directly as we explain later. The latest improvements include an optimised tree traversal for a large number of concurrent threads. The highly dynamic nature of the algorithm made a fine-grained load balancing between different threads necessary to achieve a high performance. This required improved synchronisation operations between the worker threads. For this, we switched from the previously used very general but badly performing locking mechanism to lightweight atomic operations.

Page 27: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

27

Figure 15 - PEPC scaling on a single BG/Q node

Previous figure shows the parallel efficiency for a strong scaling of the algorithm on a single BlueGene/Q compute chip. One chip consists of 16 CPU cores and can support up to 64 hardware threads. However, since the chip has only two arithmetic units, in theory any code already exhausts the hardware resources with 32 threads and the ideal parallel efficiency drops to only 0.5 for 64 threads which is indicated by the dashed line. The improved tree traversal in PEPC comes close to this ideal behaviour. The increased performance led to successful runs on the full JUQUEEN supercomputer and PEPC qualifying for the High-Q Club [9]. The following figure shows the strong scaling of PEPC while executed on JUGENE (a BG/P system at Juelich Supercomputing Center) from 4 to 256k cores using 8 different datasets from the smallest one (0.125x106 particles) to the biggest one (2048x106 particles):

Figure 16 - Overall strong scaling of PEPC on up to 256k cores on JUGENE (IBM BG/P)

The developments of PEPC for the Mont-Blanc project focused on creating a GPU version of PEPC. Since PEPC has a modular structure with the tree-code itself and multiple front-end

Page 28: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

28

parts (user code, tailored for the specific n-body simulation) and interaction specific back-ends, integrating simple, proof-of-concept GPU kernels is not too cumbersome. We picked the Coulomb interaction as a trial example (it is used for the benchmarking and testing front-end that is being used for Mont-Blanc) and started adding a GPU version of it. Changes to the main tree-code were minimal, so that all recent improvements on the hybrid scaling were automatically retained. The strategy for GPU kernels is as follows: During the tree-traversal, PEPC identifies either individual particles or tree-nodes a local particle interacts with and computes this interaction on the fly, irrespective of the type of interaction or how it is computed. These single interactions are too short and of scalar nature and thus not suited for GPU kernels or tasks. To efficiently run GPU kernels, we instead fill interaction lists holding information of the interaction partners. This frees the tree-traversal from a substantial part of floating point operations and hence reduces computations on the CPU. The interactions based on those interaction lists can then be computed on the GPU, independent of the tree-traversal still on going on the CPU. We have thus created potential for asynchronous tasks on a GPU with minor changes to the interaction back-end. To ease the porting and have a wider applicability, the GPU kernels are implemented in OpenACC for the time being. Since PEPC is multi-threaded via Posix threads, there are multiple threads generating interaction lists. We have added an additional thread handling the GPU kernels, the necessary data transfers and the thread synchronisation. Launching this thread is the only difference to the 'standard' PEPC version. Filling interaction lists is transparently handled in the newly created back-end. The extra thread handling the GPU also takes care of collecting and converting the data to and from the GPU.

Figure 17 - Profiling of the initial OpenACC version of PEPC (using the NNVP tool)

The first version is shown in previous figure. Displayed is an NVVP profile with two different resolutions in time. The reduction of the interactions computed from the list is performed on the GPU. However, there is no overlap of memory transfers and computation since only a single stream is present. There are also many synchronisation points present, enforced by the reduction. To overcome this, we changed the structure of the OpenACC part such that multiple streams are possible on the GPU, see next figure. In principle, this should allow for overlapping data transfers and computing on the GPU which still seems not to be the case. Our final aim for

Page 29: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

29

the ARM/OpenCL version for PEPC is to overcome this via the shared memory between ARM CPU and GPU and adjusting list sizes.

Figure 18 - Profiling of the optimised OpenACC version of PEPC

To get an idea of the performance of the GPU version of PEPC we ran first, simple experiments on a single node (Intel Xeon X5650) with an NVidia Tesla M2070 GPU. Without tuning of list sizes or the number of lists, we executed an example run with 10 time steps and 4 worker threads. The pure MPI/pthreads version (without interaction lists) was then compared to the version of PEPC with extra GPU thread and GPU kernels implemented:

Wall-clock [s] CPU version OpenACC version 5000 particles 27.2 11.2 50000 particles 739.2 289.2

There is a considerable speed-up for the GPU version of PEPC, albeit limited for now. Reasons for this are the still lacking overlap between data transfers and computation and the extra overhead for collecting and storing the interaction lists. We expect the GPU version of PEPC to perform much better on the ARM prototypes when compared to a pure MPI/pthreads run. The GPU is expected to be much more powerful than the CPU, so that the floating-point part on the GPU will be much more accelerated. In addition, we hope to profit from the shared memory between GPU and CPU. Further speed-up improvements could also be expected for tree-code front-ends with more complex interaction kernels, such as higher order multipole expansions, or vector potentials, which will entail more floating-point operations per interaction. Once the OpenACC implementation of the GPU kernels is stable, we will port those to OpenCL and convert the GPU thread to an OmpSs task. The necessary synchronizations and transfers should then be handled transparently by OmpSs. At this point we will also tune the number and size of the interaction lists to achieve the best possible performance on the Mont-Blanc prototype.

Page 30: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

30

3.8 ProFASI

3.8.1 Description the code

PROFASI, (Protein Folding and Aggregation Simulator) is a C++ application for the Monte Carlo study of large-scale conformational transitions in systems of one or more protein molecules. In the context of the Mont-Blanc project, effort has been made to port and optimize the application to run well on the Mont-Blanc platform, while continuing the scientific development of the theoretical model used in the code. The PROFASI code base has been largely re-written in C++11, the latest standard of the C++ language. Following frequent request from the user community, PROFASI has been extended for simulations of mixed systems with protein as well as non- polypeptide molecules. The program can now be used for some research work involving interaction of proteins with other entities (ions, small molecules of therapeutic value etc.). PROFASI is a highly hand optimized code to achieve the shortest application return time. Quite often this meant work-avoiding computational tricks, which are seemingly counterproductive from the point of view of achieving high numerical intensity, but end up saving execution time. It was therefore necessary to carry-over as much of the performance tricks of PROFASI as possible, while adding the benefits of accelerator technologies.

3.8.2 Report on the progress of the porting of the code

Efforts towards making the PROFASI energy classes more accelerators friendly continue. More concretely, the following has been achieved in the last quarter of the second year : (a) excluded volume calculation has been re-factorized into a series of near independent calculations, while still retaining the cell-list technique and delta calculations. The delta calculations are the crucial differentiating advantage in PROFASI, which enable it to gather more statistics in less time. Hitherto it has been the most recalcitrant part of PROFASI towards any form of parallelisation. (b) A method has been found to eliminate all conditionals at the core region of excluded volume, enabling automatic vectorisation and better cache usage. The partitioning of excluded volume into non-overlapping vector-friendly calculations is an important step forward towards a fast implementation on an accelerator using OmpSs and OpenCL. The above technique has now also been applied to the terms evaluating hydrophobic and charge interactions in PROFASI. This remains to be repeated for hydrogen bonds because profiling shows that PROFASI spends roughly equal amounts of time in these three calculations.

Page 31: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

31

3.9 Quantum Espresso

3.9.1 Description the code

Quantum ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modelling at the nanoscale, based on density-functional theory, plane waves, and pseudo-potentials (norm conserving, ultra soft, and PAW). Quantum ESPRESSO stands for opEn Source Package for Research in Electronic Structure, Simulation, and Optimization. Quantum ESPRESSO is an initiative of the DEMOCRITOS National Simulation Center (Trieste) and SISSA (Trieste), in collaboration with CINECA National Supercomputing Center, the Ecole Polytechnique Fédérale de Lausanne, Université Pierre et Marie Curie, Princeton University, and Oxford University. Courses on modern electronic-structure theory with hands-on tutorials on the Quantum ESPRESSO codes are offered on a regular basis in collaboration with the Abdus Salam International Centre for Theoretical Physics in Trieste. The code is under GPL licence and can be downloaded on the following links:

http://www.quantum-espresso.org

http://www.qe-forge.org

Both users from academia and industry use Quantum ESPRESSO (QE). It is mainly written in Fortran90, but it contains some auxiliary libraries written in C and Fortran77. The whole distribution is approximately 500K lines of code, even if the core computational kernels (CP and PW) are roughly 50K lines each. Both data and computations are distributed in a hierarchical way across available processors, ending up with multiple parallelization levels that can be tuned to the specific application and to the specific architecture. More in detail, the various parallelization levels are geared into a hierarchy of processor groups, identified by different MPI communicators. The single task can take advantage both of shared memory nodes using OpenMP parallelization and NVIDIA accelerating devices thanks to the CUDA drivers implemented for the most time consuming subroutines. QE distribution is by default self contained, all what you need are a working Fortran and C compiler. Nevertheless it can be linked with most common external libraries, such as FFTw, MKL, ACML, ESSL, ScalaPACK and many others. External libraries for FFT and Linear Algebra kernels are necessary to obtain optimal performance. QE contain dedicated driver for FFTW, ACML, MKL, ESSL, SCSL and SUNPERF FFT specific subroutines. Quantum ESPRESSO is not an I/O intensive application, and it does significant I/O activities only at the end of the simulation to save electronic wave functions, used both for post-processing and as a checkpoint restart. As a consequence I/O activities are expected also at the beginning of the simulation in a restart run. Each task saves its own bunch of data using Fortran direct I/O primitives. The code has been ported to almost all platforms. Its scalability depends very much on the simulated system. Usually, on architecture with high performance interconnect, the code display a strong scalability over two orders of magnitude of processors (e.g. between 1 and

Page 32: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

32

100), considering a dataset size that saturate the memory of the nodes used as basis for the computation of the relative speed-up. On the other hands the code display good weak scalability. Recently, on a large simulation, a good scalability up to 65K cores has been obtained, the figure bellows shows the performance of QE while running a significant dataset on a BG/Q system (FERMI at CINECA) from 2048 cores (with 4096 theads) to 32 768 cores (with 65 536 virtual cores). Each color bar is relative to the different major subroutines of the code.

Figure 19 - Scalability of the CP kernel of QE on BG/Q system using the CNT10POR8

benchmark

3.9.2 Report on the progress of the porting of the code

To reduce the possible porting problems we have configured Quantum ESPRESSO as to use all internal libraries rather than external one. In fact, Quantum ESPRESSO is self-contained and external libraries can be used as on optimization step. Then we do not link the code with external BLAS, FFT, LAPACK, SCALAPACK. The source code of Quantum ESPRESSO is mainly Fortran90 with a little subset of C source code. Moreover the compilation of the whole package using gfortran and gcc is routinely checked. Then we do not find any problem related to the compilation of the code. The code has been compiled using MPI and OpenMP. To validate the porting we have selected a well-known test case (water molecule), already used on many other systems and with different codes. To profile the code we have used the internal profiling feature of Quantum ESPRESSO allowing us to monitor the performance of the most time consuming subroutine of the applications and we have compared them with the behaviour on other machines in order that, a part the absolute performance, there were relative differences. The overall performance on Tibidabo against x86 or BG/P systems as expected, is moderate since Tibidabo is mainly a porting platform and not a performance-oriented platform. We have

0

50

100

150

200

250

300

350

400

4096 8192 16384 32768 65536

2048 4096 8192 16384 32768

1 2 4 8 16

seco

nd

s /s

tep

s

CNT10POR8 - CP on BGQ

calphi

dforce

rhoofr

updatc

ortho

Virtual cores

Real cores

Band groups

Page 33: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

33

performed several tests using different combination of task/threads per nodes and for the sake

of readability; we report here some results preliminarily sketched in D3.1. In Figure 20 an overall profiling of cp.x run on Tibidabo (time to solution in seconds against number of cores) is given where is clearly evident that the subroutines involving parallel linear algebra (updatec, ortho, calbec) scale reasonably well up to 8 nodes, whereas subroutines involving 3D-FFTs (dforce, vofrho, rhoofr) start to saturate between 4 and 8 processors.

Figure 20 - Profiling of cp.x varying the number of tasks on Tibidabo

Furthermore, as reported in D3.1, the 3D-FFT is computed in parallel distributing the z-axis, i.e. each processor takes a subset of the total number of planes. These well-known distribution schemes imply the presence of a global MPI_ALLTOALL collective operation thus making the quality of the communication subsystem a key factor for the performance. To this end, we report in Figure 21, the wall-clock time of the communication and computation timing vs. the number of tasks of cp.x running on Tibidabo.

Page 34: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

34

Figure 21 - Time spent in 3D-FFTs varying the number of tasks (time to solution in seconds

against number of cores of Tibidabo)

The trend of the timing shows as the on-core computations of the 3D-FFTs scale almost linearly, (with small deviations from linearity being possibly due to the limited number of planes per processor) while the communication timing raise constantly as the number of nodes (messages) increase. This behaviour, common to many similar application, nonetheless, it paves the route for the most suitable application of the OmpSs compiler which, relying on the Nanos++ runtime, may be able to optimize the superposition of the communication and computation phases of cp.x. This particular behaviour of cp.x has been also revealed by the activities carried out in WP3 where it has been identified the most used low-level routines as to be related to GEMM and 3D-FFT operations. Thus, in accord with the conclusion given in D3.3, we started the porting over OmpSs of the DGEMM() function in the phiGEMM library as well as of the main FFT driver in CP. Following the best practices for the porting we have gained after two years of experience on the Mont-Blanc development environment, we had first identified the code of the two drivers (GEMM and FFT) as the zone where to taskify the call. Then, after the call structure has been included into an F90 interface with the definition of the CPU and GPU subroutines, one has to finalize the porting by translating the .cu code into the OpenCL language. Together with the work done so far regarding the consolidated QE kernels and datasets, we are also tracking the behaviour and interests of the QE user community. In this respect, thanks to the increasing computational power available on high-end HPC systems, we observe an increasing attention to the so-called Exact-Exchange functional implemented in the pw.x QE kernel (this code also under study in WP3). These kind of computations offers the potentiality to exploit massively parallel machines, since the computation of this functional (much more heavier than the other implemented in QE and other codes as well) can be implemented as the sum of many independent wave-functions products that can be performed in parallel. We believe this new kind of computations that are becoming more and more common in the community may benefit from architecture as foreseen by Mont-Blanc project and deserve a closer look for a possible implementation using OmpSs.

Page 35: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

35

At present, the zone of the sources considered eligible to be ported on OmpSs have been identified and we are currently porting the driver routines for GEMM and FFT operations. We expect to have this porting concluded by the release of the BSC/PRACE prototype Pedraforca where to test the OmpSs interface with the CUDA kernel already distributed with QE. Soon after, we will start the translation of the CUDA kernels so to have the final version of the application ready for the general availability of the final Mont-Blanc prototype.

3.10 SMMP

3.10.1 Description the code

SMMP is used to study the thermodynamics of peptides and small proteins using Monte Carlo. It uses a protein model with fixed bond lengths and bond angles reducing the number of degrees of freedom significantly while maintaining a complete atomistic description of the protein under investigation. Currently, four different force fields, which describe the interactions between atoms, are available. The interaction with water is approximated with the help of implicit solvent models. SMMP is written in Fortran and includes Python bindings (PySMMP). The parallelization of the code is done using MPI and OpenCL (or CUDA) and the specific parallelization of the energy function (using MPI, OpenCL, or CUDA) is often combined with parallel tempering leading for a two-level parallelization. The program is written using standard Fortran and has been ported to a large number of platforms including Intel x86 and Xeon Phi, IBM Blue Gene L/P, Power 7, and CellBE.

3.10.2 Report on the progress of the porting of the code

Performance analysis on Tibidabo of the initial port of SMMP made clear that there are two performance bottlenecks: the speed of MPI_Allreduce and the speed of the energy calculation. The focus of the following work was on porting the calculation of the energy to OpenCL since it had been announced that the target platform would use a Samsung Exynos 5 Dual Cortex-A15 with a Mali T604 GPU. The T604 can be used for general-purpose programming but it cannot be programmed using CUDA. The energy function was implemented in several ways with increasing attention to vectorization and local memory usage (see Table 4 and Figure 21). Interestingly, the choice of OpenCL platform can change the behaviour of an application significantly even on the same hardware. On an Intel Xeon E5-2650, for example, the same reduction code runs several times slower using the Intel SDK than using the AMD SDK, while the energy kernel is two times faster. Unfortunately, it turned out that the time needed for summing up the partial energies (reduction) nearly wipes out the time gained from the faster kernel execution on the GPU (see ¡Error! No se encuentra el origen de la referencia.2):

Page 36: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

36

Kernel ID Description

0 Simple kernel with local memory reduction

1 Use local memory to store coordinates

2 Use float4 for distance calculation

3 Calculate several interactions per work item

4 Unrolled loop and vectorized calculation of interactions.

Table 5 - Short description of kernel variants implemented to calculate

the potential energy of a protein.

To avoid the data transfer it is necessary to move an entire Monte Carlo sweep consisting of n elemental updates, where n is the number of degrees of freedom. This work is ongoing. The first step was to calculate also the implicit solvent interaction to the GPU, which turned out to be much more difficult than expected. The porting to OpenCL has been done using PySMMP and PyOpenCL. The energy kernel has also been integrated into the Fortran code using a C wrapper. This is being ported to OmpSs as it provides an easier mechanism to call the kernel and deal with the data transfer.

Figure 22 - Break down of the time needed for various parts of the calculation of the potential

energy. Times (in ms) shown are for a dual socket Intel Xeon E5-2650 running at 2.0 GHz using

the Intel OpenCL driver version. The initialization time (blue) is the same for all kernels. The

computational kernels (green) become faster with increasing kernel ID. The reduction kernels

(yellow) are rather slow if done in OpenCL (0—2). For kernel 3 and 4 the data transfer becomes

dominant. The summation is done on the CPU.

Page 37: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

37

Figure 23 - Comparison of OpenCL devices

3.11 SPECFEM3D

3.11.1 Description of the code

SPECFEM3D is an application that models seismic waves propagation in complex 3D geological models using the spectral element method (SEM). This approach, combining finite elements and pseudo-spectral methods, allows the formulation of the seismic waves equations with a greater accuracy and flexibility if compared to more traditional methodologies. SPECFEM3D is a Fortran application, but a subset of the globe version has been ported to C, to experiment with CUDA, and with StarSs for the TEXT project. This subset contains the main computation loop of the main application. The full application is composed of 50k lines of Fortran, while the subset contains 3k lines of C. SPECFEM3D scalability is excellent, showing strong scaling up to 896 GPUs and more than 21,675 CRAY XE nodes with 693,600 MPI ranks and sustained over 1 PF/s on the NCSA BlueWaters petascale system.

3.11.2 Report of the progress of the porting

After the first scalability results consolidated into the D4.1 previous document, the team of BSC performed more tests with bigger datasets on more cores of Tibidabo. The following picture shows a very good scalability of SPECFEM3D on up to 192 cores (96 nodes) with 2 different datasets:

Page 38: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

38

Figure 24 - Strong scaling of SPECFEM3D on Tibidabo

Work of the SPECFEM3D team is now focused on generating an OpenCL version in generative language. The kernels created should be able to be easily imported into the OMPSs mini-app which development progressed during the last year. The following figure shows the scaling in time to solution reduction between 9,216 and 36,846 cores on the Hector Cray system in UK using the OmpSs version of SPECFEM3D.

Figure 25 - Scaling of the OmpSs version of SPECFEM3D on Hector

Page 39: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

39

3.12 YALES2

3.12.1 Description of the code

YALES2 is a research code that aims at the solving of two-phase combustion from primary atomization to pollutant prediction on massive complex meshes. It is able to handle efficiently unstructured meshes with several billions of elements, thus enabling the Direct Numerical Simulation of laboratory and semi-industrial configurations. The solvers of YALES2 cover a wide range of phenomena and applications and they may be assembled to address multi-physics problems. YALES2 solves the low-Mach Navier-Stokes equations with a projection method for constant and variable density flows. These equations are discretized with a 4th-order central scheme in space and a 4th-order Runge-Kutta like scheme in time. The efficiency of projection approaches is usually driven by the performances of the Poisson solver. In YALES2, the linear solver is a highly efficient Deflated Preconditioned Conjugate Gradient with two mesh levels.

YALES2 is distributed among a large number of research laboratories through the Scientific Group SUCCESS (SUper-Computing for the modeling of Combustion, mixing and complex fluids in rEal SyStems, http://success.coria-cfd.fr). YALES2 has a free academic licence for French labs and specific licences for industrial users.

Figure 26 - Small vortices of the turbulent flow in an industrial swirl burner

Page 40: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

40

Figure 27 - Example of primary atomisation simulations on CURIE using YALES2 and

comparison with experiment (on the right)

YALES2 is written in Fortran 90 and parallelised using MPI-1, it has external dependencies with HDF5, PETSc, FFTW, METIS, SCOTCH and SUNDIALS libraries. The I/O operations are handled with in-house parallel IO with checkpoint and restart. The code has been ported on various platforms including IBM Blue Gene/P (Babel @ IDRIS, Jugene @ JUELICH), IBM Blue Gene/Q (Turing @ IDRIS), BullX Intel cluster (Curie and Airain @ TGCC) and IBM Power 6 (Vargas @ IDRIS).

Figure 28 - Strong scaling of YALES2 on an IBM BG/P system with a 2.2 billion elements mesh

The Figure 28 - Strong scaling of YALES2 on an IBM BG/P system with a 2.2 billion elements mesh shows the scalability of YALES2 on a BlueGene/P system at IDRIS (France) from up to 16,384 cores using 2 different solvers, the A-DEF2 (in triangle) and the RA-DEF2(d) (in circle) which implements mesh reduction to coarse grain.

3.12.2 Report of the progress of the porting of Code YALES2

The code was ported on Tibidabo with the gcc toolchain. After the compilation of the missing required external libraries, the porting of the code itself went smooth. The validation was performed for the simulation of the flow around a 2D cylinder

Page 41: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

41

Figure 29 - Validation test case: simulation of the wake behind a 2D cylinder at Re=100. The

color represents the velocity magnitude and the white dots are Lagrangian particles emitted from

the cylinder

The profiling and the thermal efficiency measurements (comparison with a 6-core x86 Xeon X5675 3.06 GHz processor assuming a 95W power consumption). The reduced thermal efficiency, i.e. the amount of energy required to perform the simulation for one time step (iteration) for one control volume (node) on a single core is 50 to 60% better for the ARM cores than for the x86 Xeon.

Number of

cores

X86 Xeon

Reduced time efficiency

(s*nb_core/nb_ite/nb_nodes)

ARM

Reduced time efficiency

(s*nb_core/nb_ite/nb_nodes)

Ratio

(ARM/X86)

1 11.8 295.2 25.1

2 11.4 300.3 26.3

4 14.3 442.5 31.0

Number of

cores

X86 Xeon

Reduced thermal efficiency

(J*nb_core/nb_ite/nb_nodes)

ARM

Reduced thermal efficiency

(J*nb_core/nb_ite/nb_nodes)

Ratio

(ARM/X86)

1 186.4 73.8 0.396

2 181.1 75.1 0.414

4 225.9 110.6 0.490

Table 6 - Performance and power profiling of YALES2 - Xeon vs ARM

Page 42: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

42

Figure 30 - Strong scaling of YALES2 on up to 4 cores of Tibidabo

3.12.3 Interactions with others WP

In parallel to the Mont-Blanc project, several developments have been performed in the YALES2 code. Among these main features, the novel complex chemistry solver is interesting for the Mont-Blanc project because it may benefit from a hybrid CPU/GPU approach such as those of WP3 kernel assessment and optimisation. The complex chemistry solver of YALES2 relies on the solving of the full set of Navier-Stokes equations with a large number of species. Transport properties of the species are based on the Hirschfelder-Curtiss approximation and the reaction rates are based on Arrhenius laws. The time integration of the equations is based on a splitting approach where the reaction rates are integrated with a stiff integrator (CVODE package from the SUNDIALS library, http://computation.llnl.gov/casc/sundials), diffusion is sub-stepped to comply with the stability constraint and transport is solved explicitly with a Runge-Kutta approach. The resulting solver has been validated for a wide range of fuels and chemical mechanisms. The main issue with the complex chemistry solver comes from the fact that the stiff integration of the source term is very costly in the flame front while being inexpensive away from the flame front. As a result, strong load imbalance appears when using usual domain decomposition techniques. The calculation of a 2D methane/air Bunsen burner is presented in the following figure. The chemical mechanism is from T.P. Coffee, 1983 with 14 species and 38 reactions.

0 1 2 3 4Number of cores

0

1

2

3

4

Sp

eed

-up

Ideal scaling

X86 XeonARM

Page 43: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

43

Figure 31 - Calculation of a 2D methane/air Bunsen burner (left) and measure of the load

imbalance (right)

A first attempt to overcome this difficulty consisted in implementing an MPI-based dynamic scheduler on CPUs. Its principle is very simple: i) a master is designated, ii) the master gives chunks of source term calculations to a number of slaves, iii) available slaves designate a new master to continue the process and the former master gives the rank of the new master to the slaves that have finished their tasks. This approach is very effective as shown on the next figure. Linear speed-up is recovered with the dynamic scheduler.

Figure 32 - Strong scaling of complex chemistry solver for the 2D Bunsen burner with and

without dynamic scheduler on up to 1000 Intel processors

The objective of the coming work in the Mont-Blanc project may be to perform the stiff integration of the source terms on GPUs and to benefit at the same time from the MPI dynamic scheduler.

Page 44: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

44

Conclusions and next steps 4

During the first two years of the Mont-Blanc project, a total of 11 real large scale simulations, daily used in HPC national and European centers by wide scientific and industrial communities have been ported to the low-power architectures made available by the Mont-Blanc partners. This deliverable describes in detail the conclusions of the porting for each of the applications. The overall conclusions are:

The software stack provided by the Linux distributions and by WP5 is sufficient to port easily all the applications and their external dependencies (meshers, post processing tools, I/O and numerical libraries).

The Tibidabo cluster which was designed primarily for porting issues has allowed us not only to port but also to profile and to scale-out some applications.

After the initial first results presented one year ago in D4.1, some improvement have been possible by changing the job scheduler of Tibidabo, rebooting more often the CISCO network switches or using biggest datasets.

The partners worked also on updating performance numbers on PRACE petascale systems like CURIE, SuperMUC or JUQUEEN as well as on some Arndaleboards, which are prefiguring the next Mont-Blanc prototype. Most of the applications show strong scaling when running on large-scale HPC systems.

Due to to a strong collaboration with WP3 the next and last year of Mont-Blanc project will be dedicated to perform on a subset of the applications the integration of the WP3 OmpSs/OpenCL kernels into the real applications and to port the full application on the Mont-Blanc prototype. The choice of this subset is detailed into D4.3 “Preliminary report about the choice and the first profiling and optimisation efforts on a subset of application”.

Page 45: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

45

List of figures

Figure 1- Picture of a single Arndale board .................................................................................6

Figure 2- Strong scaling of the Mont-Blanc applications .............................................................8

Figure 3 - Nitrogen (N2) electronics orbitals ................................................................................9

Figure 4 - BigDFT initial scaling on Tibidabo ............................................................................. 11

Figure 5 - BigDFT strong scaling on an x86 system (CURIE) ................................................... 12

Figure 6 - Traces of BigDFT on Tibidabo using 36 cores (18 boards) ....................................... 12

Figure 7 - New BigDFT scaling on Tibidabo. Real vs simulated ................................................ 13

Figure 8 - Increased scalability of BQCD using SLURM on Tibidabo ....................................... 14

Figure 9 - Strong scaling of the MPI solver of BQCD ................................................................ 15

Figure 10 - Strong scaling of the MPI/OpenMP solver of BQCD ............................................... 15

Figure 11 - Strong scaling of COSMO OpCode on Tibidabo ..................................................... 18

Figure 12 - Strong scaling of COSMO on a Cray XC30 system from 288 to 2304 physical codes using from 576 to 4608 virtual cores ................................................................................. 19

Figure 13 - Performance of MP2C on Tibidabo before and after rebooting network switches ... 25

Figure 14 - Improved speedups of PEPC on Tibidabo using bigger datasets ........................... 26

Figure 15 - PEPC scaling on a single BG/Q node ..................................................................... 27

Figure 16 - Overall strong scaling of PEPC on up to 256k cores on JUGENE (IBM BG/P) ....... 27

Figure 17 - Profiling of the initial OpenACC version of PEPC (using the NNVP tool) ................ 28

Figure 18 - Profiling of the optimised OpenACC version of PEPC ............................................ 29

Figure 19 - Scalability of the CP kernel of QE on BG/Q system using the CNT10POR8 benchmark ........................................................................................................................ 32

Figure 20 - Profiling of cp.x varying the number of tasks on Tibidabo ....................................... 33

Figure 21 - Time spent in 3D-FFTs varying the number of tasks (time to solution in seconds against number of cores of Tibidabo) ............................................................................... 34

Figure 22 - Break down of the time needed for various parts of the calculation of the potential energy. Times (in ms) shown are for a dual socket Intel Xeon E5-2650 running at 2.0 GHz using the Intel OpenCL driver version. The initialization time (blue) is the same for all kernels. The computational kernels (green) become faster with increasing kernel ID. The reduction kernels (yellow) are rather slow if done in OpenCL (0—2). For kernel 3 and 4 the data transfer becomes dominant. The summation is done on the CPU. ..................... 36

Figure 23 - Comparison of OpenCL devices ............................................................................. 37

Figure 24 - Strong scaling of SPECFEM3D on Tibidabo ........................................................... 38

Figure 25 - Scaling of the OmpSs version of SPECFEM3D on Hector ...................................... 38

Figure 26 - Small vortices of the turbulent flow in an industrial swirl burner .............................. 39

Figure 27 - Example of primary atomisation simulations on CURIE using YALES2 and comparison with experiment (on the right) ........................................................................ 40

Figure 28 - Strong scaling of YALES2 on an IBM BG/P system with a 2.2 billion elements mesh ......................................................................................................................................... 40

Figure 29 - Validation test case: simulation of the wake behind a 2D cylinder at Re=100. The color represents the velocity magnitude and the white dots are Lagrangian particles emitted from the cylinder .................................................................................................. 41

Figure 30 - Strong scaling of YALES2 on up to 4 cores of Tibidabo ......................................... 42

Figure 31 - Calculation of a 2D methane/air Bunsen burner (left) and measure of the load imbalance (right) ............................................................................................................... 43

Page 46: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

46

Figure 32 - Strong scaling of complex chemistry solver for the 2D Bunsen burner with and without dynamic scheduler on up to 1000 Intel processors .............................................. 43

List of tables

Table 1 - List of the 11 WP4 scientific applications .....................................................................5

Table 2 - Status of the porting for each MB application ...............................................................7

Table 3 - BigDFT execution time and energy on ARM vs x86 ................................................... 10

Table 4 - Performance of the CG solver using Intel and OmpSs compiler ................................ 16

Table 5 - Short description of kernel variants implemented to calculate the potential energy of a protein. .......................................................................................................................... 36

Table 6 - Performance and power profiling of YALES2 - Xeon vs ARM .................................... 41

Acronyms and Abbreviations

- DEISA Distributed European Infrastructure for Supercomputing Applications - GbE Gigabit Ethernet - GPL General Public Licence - GPU Graphics Processing Unit - HPC High Performance Computing - I/O Input (read), Output (write) operations on memory or on disks/tapes - MD Molecular Dynamics - PRACE Partnership for Advanced Computing in Europe (http://www.prace-ri.eu) - SoC System On Chip - TDP Thermal Dissipation Power - WP Work Package - WP2 Work Package 2 (“Dissemination and Exploitation”) - WP3 Work Package 3 (“Optimized application kernels ») - WP4 Work Package 4 (“Exascale applications”) - WP5 Work Package 5 (“System software”) - WP6 Work Package 6 (“Next-generation system architecture”) - WP7 Work Package 7 (“Prototype system architecture”) - WPL Work Package Leader

Page 47: D4.2 Final report about the porting of the full-scale ......o SuperMUC: a 3.2 PFlops IBM iDataPlex supercomputer installed at Leibniz Supercomputing Centre (Germany) with 18,432 Intel

D4.2 “Final report about the porting of the full-scale scientific applications” Version 1.0

47

List of references

[1] Mont-Blanc D4.1 “Preliminary report of progress about the porting of the full-scale scientific applications” [2] http://www.fz-juelich.de/portal/EN/Research/InformationTechnology/Supercomputer/QPACE.html [3] A. Montani, D. Cesari, C. Marsigli, and T. Paccagnella. Seven years of activity in the field of

mesoscale ensemble forecasting by the COSMO-LEPS system: main achievements and open challenges. Technical report, Deutscher Wetterdienst, 2010.

[4] http://www.hp2c.ch/projects/opcode/ [5] Joint CRESTA/DEEP/Mont-Blanc Workshop, 10th – 11th June 2013 (BSC, Barcelona, Spain)

http://www.hp2c.ch/projects/opcode/ [6] E.g., the CAPS compiler: http://www.caps-entreprise.com/products/caps-compilers/ [7] Paolo Giannozzi, Stefano Baroni, Nicola Bonini, Matteo Calandra, Roberto Car, Carlo Cavazzoni,

Davide Ceresoli, Guido L Chiarotti, Matteo Cococcioni, Ismaila Dabo, Andrea Dal Corso, Stefano de Gironcoli, Stefano Fabris, Guido Fratesi, Ralph Gebauer, Uwe Gerstmann, Christos Gougoussis, Anton Kokalj, Michele Lazzeri, Layla Martin-Samos, Nicola Marzari, Francesco Mauri, Riccardo Mazzarello, Stefano Paolini, Alfredo Pasquarello, Lorenzo Paulatto, Carlo Sbraccia, Sandro Scandolo, Gabriele Sclauzero, Ari P Seitsonen, Alexander Smogunov, Paolo Umari, and Renata MWentzcovitch. QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials. Journal of Physics: Condensed Matter, 21(39):395502 (19pp), 2009.

[8] F. Spiga and I. Girotto, phiGEMM: a CPU-GPU library for porting Quantum ESPRESSO on hybrid

systems, http://www.cise.ufl.edu/research/sparse/matrices 20th Euromicro International Conference On Parallel Distributed and Network-based Processing (IEEE 2012), DOI: 10.1109/PDP.2012.72.

[9] http://www.fz-juelich.de/ias/jsc/pepc [10] http://www.fz-juelich.de/ias/jsc/slpp [11] http://www.fz-juelich.de/ias/jsc/high-q-club