hpc benchmark project (part i)outline of the presentation first part: openfoam hpc benchmark project...

27
1 www.esi-group.com Copyright © ESI Group, 2019. All rights reserved. Copyright © ESI Group, 2019. All rights reserved. www.esi-group.com HPC Benchmark Project , Ivan Spisso , CINECA HPC Benchmark Project (part I) Ivan Spisso, [email protected] Simone Bna, [email protected] Giorgio Amati, g.amati [email protected] Giacomo Rossi, [email protected] Fabrizio Magugliani, [email protected]

Upload: others

Post on 15-Oct-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

1www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.Copyright © ESI Group, 2019. All rights reserved.

www.esi-group.com

HPC Benchmark Project , Ivan Spisso , CINECA

HPC Benchmark Project (part I)

Ivan Spisso, [email protected] Bna, [email protected]

Giorgio Amati, g.amati [email protected] Rossi, [email protected]

Fabrizio Magugliani, [email protected]

Page 2: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

2www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Outline of the presentationFirst Part: OpenFOAM HPC Benchmark Project

• HPC Performances of OpenFOAM (3 slides)

• OpenFOAM HPC Technical Committee (TC)

• Code repository for HPC TC, https://develop.openfoam.com/committees/hpc

• Initial (first?) Benchmark test-case: 3-D Lid Driven Cavity • Set-up • Metrics/KPI (Key Performance Indicators) to measure performances• Results

• Further Benchmark test-cases (with other TCs): ask to the Community

HPC Benchmark Project , Ivan Spisso , CINECA

Page 3: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

3www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

OpenFOAM: algorithmic overview and Pstream● OpenFOAM is first and foremost a C++ library used to solve in discretized form systems of Partial

Differential Equations (PDE). OpenFOAM stands for Open Field Operation and Manipulation

● The Engine of OpenFOAM is the Numerical Method, with the following features: ○ segregated, iterative solution (PCG/GAMG), unstructured finite volume method,

co-located variables, equation coupling.

● The method of parallel computing used by OpenFOAM is based on the standardMPI using the strategy of domain decomposition(zero layer domain decomposition)

● A convenient interface, Pstream, is used to plug any MPI library into OpenFOAM. It is a light wrapper around the selected MPI Interface

HPC Benchmark Project , Ivan Spisso , CINECA

Page 4: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

4www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

OpenFOAM: HPC Performances ● OpenFOAM scales reasonably well up to thousands of cores, upper limit order of thousands of cores [1] We

are looking at achieving radical scalability of cases of 100's of millions / billions of cell on 10K-100K cores.

● A custom version by Shimuzu Corp., Fujitsu Limited and RIKEN on K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect) was able to achieve high performance on 100 thousand MPI tasks on a large scale transient CFD simulation up to 100 billion cell mesh [2]

HPC Benchmark Project , Ivan Spisso , CINECA

Page 5: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

5www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

OpenFOAM HPC bottleneckstowards the exascale: open issuesUp to now, the well known bottlenecks for enabling OpenFOAM to perform on massively parallel clusters are:

● Scalability of the linear solvers and their limits in the parallelism paradigm.

○ In most cases the memory bandwidth is a limiting factor for the solver. ○ Additionally, global reductions are frequently required in the solvers.

The linear algebra core libraries are the main communication bottlenecks for the scalability

● Sparse Matrix storage format: The LDU sparse matrix storage format used internally does not enable any cache-blocking mechanism (SIMD, vectorization).

● The I/O data storage system: when running in parallel, the data for decomposed fields and mesh(es) has historically been stored in multiple files within separate directories for each processor, which is a bottleneck for big simulation.

HPC Benchmark Project , Ivan Spisso , CINECA

Page 6: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

6www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

OpenFOAM HPC bottlenecksProfiling with aps from Intel

towards the exascale: open issues

● https://software.intel.com/sites/products/snapshots/application-snapshot/

HPC Benchmark Project , Ivan Spisso , CINECA

Page 7: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

7www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC Roof-line Model and Computational intensity

● The roofline model: Performance bound (y-axis) ordered according to computational intensity● Computational Intensity: ratio of total floating-point operations to total data movement (bytes): i.e.

flops/byte● Which is the OpenFoam (CFD/FEM) arithmetic intensity? About 0.1, may be less…. ● TOP500 List - June 2019

CFD range

Linpack range

● CFD Range BW-limited● Peak flops: two order of

Magnitude ● Linpack: Dense Linear

Algebra● OpenFOAM: Sparse

linear algebra ● HPCG benchmark

Technological trend: towards the exascale

HPC Benchmark Project , Ivan Spisso , CINECA

Page 8: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

8www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC Hardware technological trendTowards the e

OpenFOAM HPC Technical Committee (TC)

• The Technical Committees cover all the key focus areas for OpenFOAM development; they assess the state-of-the-art, need and status for validation, documentation and further development.

•• Remits of the HPC TC

• OpenFOAM recommendations to Steering Committee in respect of HPC technical area• Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM• Scalability of linear solvers• Adapt/modify data structures of Sparse Linear System to enable vectorization / hybridization• Improve memory access on new architectures• Improve memory bandwidth• Parallel pre- and post-processing, parallel I/O

• Strong co-design approach• Identify algorithm improvements to enhance HPC scalability • Interaction with other the Technical Committee (Numerics, Documentations)

• Priorities of HPC TC:• HPC Benchmark• GPU enabling of OpenFOAM (see presentation afterwards of Nvidia’s Stan Posey) • Parallel I/O :Adios 2 I/O library as function object is now a git submodule in the OpenFOAM develop branch

https://www.openfoam.com/governance/technical-committees.php#tm-hpc

HPC Benchmark Project , Ivan Spisso , CINECA

Page 9: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

9www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC Technical Committee Membership Nominations

Nominations for Technical Committee Membership

Members of the Technical Committee•Chair: Dr. Ivan Spisso CINECA-S. Bnà, HPC Developer, CINECA•Deputy-Chair: Dr. Mark Olesen Principal Eng., ESI-OpenCFD (Release and Maintenance)•Dr. Henrik Rusche Wikki Ltd. (Release Authority)•Dr. M. Klemm, Principal Eng., Dr. G. Rossi App. Software Eng., Intel (Hardware OEM)•Dr. Olly Perks / Dr. F. Spiga Research Eng. ARM (Hardware OEM)•F. Magugliani, Strategic Planning E4 (HPC System Integrator)•Dr. William F. Godoy, Scientific Data Group Oak Ridge National Lab (HPC Computer Sc.)•Dr. Pham Van Phuc, Institute of Technology Shimizu Corporation (HPC/OEM)•PhD. Stefano Zampini, PETSC Team KAUST (HPC center/ PETSc dev.)•Mr. Axel Koehler, Sr. Solutions Architect Nvidia (Hardware OEM)

HPC Benchmark Project , Ivan Spisso , CINECA

Page 10: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

10www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC Technical Committee First Indications of TasksTechnical Committee Tasks

• Commitment• •Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM.• Demonstrate improvements in performance and scalability to move forward from actual near peta-scale to pre- and exa-scale class

performances•• Remits

•OpenFOAM recommendations to Steering Committee in respect of HPC technical area•Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM•Scalability of linear solvers•Adapt/modify data structures of Sparse Linear System to enable vectorization / hybridization•Improve memory access on new architectures•Improve memory bandwidth•Parallel pre- and post-processing, parallel I/O

•Strong co-design approach•Identify algorithm improvements to enhance HPC scalability•Interaction with other the Technical Committee (Numerics, Documentations)

• Tasks•Tests and benchmark Third Party linear solver algebra package (ESI, Intel, PeTSc Team)•Tests parallel I/O (ADIOS2) (ESI, ORNL) •Review of HPC benchmarks and establish common HPC benchmarks among different architectures

(Intel, ARM, EPCC,E4)•Review of documentation in HPC & OpenFOAM technical area

•…

HPC Benchmark Project , Ivan Spisso , CINECA

Page 11: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

11www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC Hardware technological trendTowards the exascale: co-design is required!

Code repository for HPC Technical Committee https://develop.openfoam.com/committees/hpc

● Create an open and shared repository with relevant data-sets and information

● Provide an User-Guide and initial scripts to set-up and run different data-sets

● Provide to the community a homogeneous term of reference to compare different HW architectures, configurations and different SW environments

● Define a common set of Metrics/KPI (Key Performance Indicators) to measure performances

● Data-sets are in patch-1 branch, right now

HPC Benchmark Project , Ivan Spisso , CINECA

Page 12: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

12www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

3-D Lid Driven Cavity

Initial Benchmark test case:

data-set Problem size(s)

Physical features

Relation with HW/SW infrastructure

Tasks per node KPIs bottleneck(s)Cpu intensive Memory

intensive GPU intensive I/O intensive

3-D Lid Driven Cavity flow

S (100^3)M (200^3)XL (400^3)

Incompr. laminar flow,Regular and uniform grid

yes yes no no Full (36)

Time to solution

Energy to solution (?)

Memory Bandwidth

Bound

Linear solver Algebra

Data structure

….. Hlaf (18)

● Derived from 2-D Lid driven cavity flow of OpenFOAM tutorial, https://www.openfoam.com/documentation/tutorial-guide/tutorialse2.php

● Simple geometry. B.C.: inflow, outflow, no slip walls● Stress Test for the linear solver algebra, mainly pressure equation ● KPIs to be monitored: Wall time, Memory Bandwidth, Bound● Bound = The metric represents the percentage of Elapsed time spent heavily utilizing system bandwidth

(available only for Intel architectures)

HPC Benchmark Project , Ivan Spisso , CINECA

Page 13: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

13www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Profiling of S test-case:

Profiling with Vtune:

• Vectorization 3,9 % • Top Loops/functions usage by CPU Time

• lduMatrix:Amul, only FP Scalar Ops, No VEC

• Time spent in pressure solver 67.9 %• Time spent in velocity solver 27.3 %

Initial Benchmark test-case:

HPC Benchmark Project , Ivan Spisso , CINECA

Page 14: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

14www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Geometrical and physical properties

Initial Benchmark test-case: 3-D Lid Driven Cavity

Geometrical and physical propertiesTest-case S M XL

delta X (m) 0.001 0.0005 0.00025

N of cells 1,000,000 8,000,000 64,000,000

n of cells lin 100 200 400

ni (m^2/se) 0.01 0.01 0.01

d (m) 0.1 0.1 0.1

Co 1 0.5 0.25

T. final 0.5 0.5 0.5(*) (0.07275)

delta T (sec.) 0.001 0.00025 0.0000625Reynolds 10 10 10

U (m/s) 1 1 1

num. of Iterations 500 2000 8000

● Final physical time T = 0.5, to reach steady state in laminar flow (* if feasible)● Delta X is halved when moving to the next bigger case. ● Courant number under stability limit, reduced for bigger cases ● Delta T is reduced by 4 times when moving to the next bigger case

Co ≅ (U delta T) / delta X

HPC Benchmark Project , Ivan Spisso , CINECA

Page 15: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

15www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Geometrical and physical properties

Initial Benchmark test-case: 3-D Lid Driven Cavity

HPC Benchmark Project , Ivan Spisso , CINECA

Page 16: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

16www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Solver set-upInitial Benchmark test-case: 3-D Lid Driven Cavity

● The set-up has been chosen to have a constant computational load at each time-step

● maxIter is reached by the # of iter, at each time step both for p and U

● An “acceptable” level of convergence for the residuals is get

● The set-up has been NOT been chosen to maximize the execution time or convergence rate

Num iter. Tot = num. Iteration x maxIter.

system/fvSolutionTest-case S M XL

pressuresolver PCG PCG PCG

tolerance 0.00E+00 0 0relTol 0 0 0

maxIter 250 650 1300num. iter Tot 125,000 1,300,000 10,400,000

pFinal - -relTol 0 0 0

U velocitysolver PBiCGStab PBiCGStab PBiCGStab

preconditioner DILU DILU DILUtolerance 0 0 0

relTol 0 0 0maxIter 5 5 5

num. iter Tot 2,500 10,000 40,000

HPC Benchmark Project , Ivan Spisso , CINECA

Page 17: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

17www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Residuals and num. of iteration comparisonInitial Benchmark test-case: 3-D Lid Driven Cavity

● Comparison pressure (Left) and Ux residuals (Right) ● Solver tolerance vs solver tolerance maxIter ● Left vertical axis: Residual trend with and without fixed # iter. ● Right vertical axis: # iter.

HPC Benchmark Project , Ivan Spisso , CINECA

Page 18: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

18www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Solver set-up: S case exampleInitial Benchmark test-case: 3-D Lid Driven Cavity

system/fvSolutionTest-case S M XL

pressuresolver PCG PCG PCG

tolerance 0 0 0relTol 0 0 0

maxIter 250 650 1300

num. iter Tot 125,000 1,300,000 10,400,000

pFinal - -relTol 0 0 0

U velocitysolver PBiCGStab PBiCGStab PBiCGStab

preconditioner DILU DILU DILUtolerance 0 0 0

relTol 0 0 0maxIter 5 5 5

num. iter Tot 2,500 10,000 40,000

HPC Benchmark Project , Ivan Spisso , CINECA

Page 19: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

19www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Computational set-up and domain decomposition

Initial Benchmark test-case: 3-D Lid Driven Cavity

● Full computational density: using the whole cores/tasks available per node● Half computational density: using 1/2 cores/tasks available per node● Quarter density: using 1/4 cores/tasks available per node

● For Broadwell processor, respectively: ○ --tasks-per-node=36○ --tasks-per-node=18○ --tasks-per-node=9

● KPI to measure = memory bandwidth

● Domain decomposition = scotch

HPC Benchmark Project , Ivan Spisso , CINECA

Page 20: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

20www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Example of Table with specifications of the platform used

HPC resources specs

HPC Benchmark Project , Ivan Spisso , CINECA

Page 21: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

21www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

● Galileo cluster: Intel Broadwell based cluster, --tasks-per-node=36● Left Figure: Inter-node scalability ● Too few cells per cores to get good scalability

Results: S test-case 1 M of cells

HPC Benchmark Project , Ivan Spisso , CINECA

Page 22: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

22www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Comparison ARMIDIA, GALILEO, etc.

HPC resources specs

Cluster

Compute

Proc. type Cores per node

tot.num. of nodes accelerators Memory

(GB/node) networkMemory

bandwidth (GB/s)

Galileo (CINECA)

Intel Broadwell E5-2697

36 1022 none 128Intel

OmniPath (100Gb/s)

141

ARMIDIA (E4)

ARM Marvell TX2@2,2/2.5

GHz64 8 none 256

Mellanox IB 100

GBytes/sec

….

HPC Benchmark Project , Ivan Spisso , CINECA

Page 23: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

23www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC comparison: Armidia cluster Execution time, speed-up, number of cells per procs

● Armidia cluster: 8 compute nodes, 64 cores per node, Marvell TX2@2,2/2.5 GHz, 256 Gbytes/node, Mellanox IB 100 GBytes/sec , ● --tasks-per-node=64

Results: M and XL test-case

HPC Benchmark Project , Ivan Spisso , CINECA

Page 24: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

24www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC comparison: Armidia cluster Execution time, speed-up, number of cells per procs

Results: M and XL test-case

HPC Benchmark Project , Ivan Spisso , CINECA

Page 25: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

25www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC comparison: Galileo cluster Execution time and speed-up and number of cells per cores

● Galileo cluster: Intel Broadwell based cluster, --tasks-per-node=36● Superlinear scalability due to memory bandwidth, see next presentation

Results: XL test-case

HPC Benchmark Project , Ivan Spisso , CINECA

Page 26: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

26www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

HPC comparison: Armidia and Galileo clusterResults:XL test-case

number of cores 128 256 288 512 576 1152

Intel Broadwell

(Galileone)103.505 43.664 15.867

ArmiThunderX2

(Armidia)189.930 90.914 47.276

● Comparison by using the same number of cores ● Runs by using the full number of tasks-per-nodes 36 per Broadwell, 64 for ARM● Continuous line, trend line

HPC Benchmark Project , Ivan Spisso , CINECA

Page 27: HPC Benchmark Project (part I)Outline of the presentation First Part: OpenFOAM HPC Benchmark Project • HPC Performances of OpenFOAM (3 slides) • OpenFOAM HPC Technical Committee

27www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Conclusion / Further work / Acknowledgment ● Conclusions

○ Presented the OpenFOAM HPC Benchmark project○ Create an open and shared repository with relevant data-sets and information ○ Present preliminary results on HPC architectures○ Starting to provide to the community a homogeneous term of reference to compare different HW architectures,

configurations and different SW environments ● Further work

○ Run M test-case○ Populate the repo web site, with instruction on how to run and post-proc the bench (with Documentation TC)○ Add half number of cores per node ○ Add several architectures○ Add weak scalability tests at optimal number of cells per cores○ Add energy to solution?○ Add different decomposition (optimal?) method ○ Add further test-cases

● Acknowledgment ○ C. Latini and A. Memmolo (CINECA) for post-proc, scripting and figures templates○ Andy Heather, joseph Nagzy for initial set-up of wiki repo and web-page (work in progress)○ Useful discussion with S. Zampini (PETSc Team)○ Peoples / Company / OEM which give (or will give) the availability of different architectures (E4, Intel, ARM, AMD, etc. etc. )

HPC Benchmark Project , Ivan Spisso , CINECA