openfoam on bg/q porting and performance - prace ... on bg/q porting and performance paride dagna,...

www.cineca.it

OpenFOAM on BG/Q porting and performance

Paride Dagna, SCAI Department, CINECA

www.cineca.it

OpenFOAM : selected application inside of PRACE project

Fermi : PRACE Tier-0 System

Model: IBM-BlueGene /Q

Architecture: 10 BGQ Frame with 2 MidPlanes each

Front-end Nodes OS: Red-Hat EL 6.2

Compute Node Kernel: lightweight Linux-like kernel

Processor Type: IBM PowerA2, 16 cores, 1.6 GHz

Computing Nodes: 10.240

Computing Cores: 163.840

RAM: 16GB / node

Internal Network: Network interface

with 11 links ->5D Torus

Disk Space: more than 2PB of scratch space

Peak Performance: 2.1 PFlop/s

SYSTEM OVERVIEW

www.cineca.it

Single Chip Module

Compute card: One chip module,

16 GB DDR3 Memory

SYSTEM OVERVIEW

Compute node (back-end): • each compute node comprise 17 cores on a single chip

with16 GB of dedicated physical memory

• Applications run on 16 of the cores with the 17th core reserved for system software.

• Nearly the full 16 GB of physical memory is dedicated to application usage.

• On each core it’s possible to run up to 4 processes/threads for a total of 64 processes/threads per node

Applications : • Applications are submitted to the compute nodes by the

batch scheduler system • To run on the compute nodes (back-end), applications

must be cross-compiled

www.cineca.it

Porting of OpenFOAM on BG/Q

Compiling OpenFOAM for the back-end nodes on BG/Q requires some system specific changes to the configuration scripts of OpenFOAM and Third-party package

It’s not possible to use Third-party MPI, rules for BG/Q MPI must be inserted

Environment configuration:

• Configure environment with compilers and zlib using modules

module load bgq-gnu

module load zlib

OpenFOAM configuration scripts and rules:

• Files “bashrc” and “settings.sh” must be changed inserting the rules for BG/Q MPI

• Files c/c++ in wmake/rules folders must be modified for dynamic linking

Scotch library build

• Before running “Allwmake” in the OpenFOAM main folder some changes need to be made to the compiling and dynamic linking rules in the file “Makefile .inc” contained in the scotch library.

• Cross-compile and execute on the back-end the “dummysizes” scotch utility to build properly the header files scotch.h and scotchf.h

Compile

• Go in $WM_PROJECT/$WM_PROJECT_VERSION and compile with ./Allwmake

www.cineca.it

Performance of OpenFOAM on BG/Q

Test cases Cavity 3D

Isothermal Incompressible Flow

Solver : icoFoam

BoxTurb 3D Omogeneus Isotropic Turbulence on compressible flow

Solver : sonicFoam

Airfoil – wing section External aerodynamic

Solver : simpleFoam

Dtmb hull Marine hydrodynamics

Solver : interFoam

www.cineca.it

Performance of OpenFOAM on BG/Q

Systems

Model: IBM-BlueGene /Q (Fermi)

Processor Type: IBM PowerA2, 1.6 GHz

Computing Node: 16 cores

RAM: 16GB / node; 1GB/core

Internal Network: Network interface

with 11 links ->5D Torus

Model: Hewlett Packard C7000 (Lagrange)

Processor Type: Intel, Xeon Westmere,

2.8 GHz

Computing Node: 12 cores

RAM: 24GB / node; 2GB/core

Internal Network: Infiniband QDR/DDR Voltaire, Fat Tree

www.cineca.it

Cavity – 3D

Flow : laminar, isothermal, incompressible

Mesh : fully structured 3D

Mesh elements : cubes

Elements 10.000.000

Scotch Simple

icoFoam

Elements 20.000.000

Scotch Simple

icoFoam

www.cineca.it

Cavity – 3D Speed up and Efficiency

Mesh :10.000.000

Solution saved at final time step

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048 4096

Effi

cie

ncy

# cores

Partition method - simple

Fermi Lagrange Ideal

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048 4096

Effi

cie

ncy

# cores

Partition method - scotch


0500

10001500200025003000350040004500

64 128 256 512 1024 2048 4096

Spe

ed

up

# cores



0500

10001500200025003000350040004500

64 128 256 512 1024 2048 4096

Spe

ed

up

# cores



www.cineca.it

Cavity – 3D Speed up and Efficiency

Mesh :10.000.000

Solution saved every 10 time steps

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores



0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores



0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



www.cineca.it

Cavity – 3D – Profiling

0%

50%

100%

150%

200%

250%

64 128 256 512 1024

Incr

em

en

t %

# cores

I/O overhead on simulation time

Fermi Lagrange

# Cores Cumulative I/O

(GB)

Files Size per core

(MB)

64 13,0 5,10

128 14,0 2,50

256 14,0 1,33

512 15,0 0,75

1024 22,0 0,40

Number of iterations : 100

Files per core : 3

MPI_Allreduce average message size per core (B) : 8 -- #cores 1024

Average message size sent and received per core (KB) : 4,6 -- #cores 1024

MPI and I/O profiling : 512 cores


www.cineca.it

Cavity – 3D Speed up and efficiency

Mesh :20.000.000

Solution saved at final time step

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048 4096

Effi

cie

ncy

# cores



0500

10001500200025003000350040004500

64 128 256 512 1024 2048 4096

Spe

ed

up

# cores



0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048 4096

Effi

cie

ncy

# cores



0500

10001500200025003000350040004500

64 128 256 512 1024 2048 4096

Spe

ed

up

# cores



www.cineca.it

Cavity – 3D Speed up and efficiency

Mesh :20.000.000


0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores



0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores



0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



www.cineca.it

Cavity – 3D – Profiling


(GB)

Files Size per core

(MB)

64 18,1 9,46

128 18,1 4,73

256 18,5 2,42

512 22,5 1,27

1024 23,1 0,63



0%

50%

100%

150%

200%

250%

300%

64 128 256 512 1024

% In

cre

me

nt

# cores


Fermi Lagrange


Files per core : 3



www.cineca.it

BoxTurb – 3D

Flow : compressible

Case study : homogeneous, isotropic turbulence

Mesh : uniform 3D

Number of cells : ≈ 17.000.000

Solver : sonicFoam

Partition method : simple

Courtesy of : Matteo Cerminara (INGV), Pisa

www.cineca.it

BoxTurb – 3D Speed up and

efficiency

Solution saved at the final time step

0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024 2048

Effi

cie

ncy

# cores

Patition method - simple


0

500

1000

1500

2000

2500

64 128 256 512 1024 2048

Spe

ed

up

# cores



www.cineca.it

BoxTurb – 3D Speed up and

efficiency


0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores

Patition method - simple


0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



www.cineca.it

BoxTurb – 3D – Profiling

0%

20%

40%

60%

80%

100%

120%

140%

64 128 256 512 1024

Incr

em

en

t %

# cores


Fermi Lagrange


(GB)

Files Size per core

(MB)

64 18,4 4,50

128 18,4 2,25

256 18,6 1,14

512 19,6 0,60

1024 21,2 0,32




Files per core : 4



www.cineca.it

Airfoil – wing section

Flow : turbulent, incompressible

Case study : steady state, extruded NACA airfoil

Mesh : fully structured 3D


Solver : simpleFoam

Method : simple - scotch

www.cineca.it

Airfoil – wing section - Speed up

and efficiency


0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores



0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores



www.cineca.it

Airfoil – wing section – Profiling

MPI profiling – simple - 512 cores

MPI profiling – scotch - 512 cores

MPI profiling – simple - 512 cores

MPI profiling – scotch - 512 cores

www.cineca.it

Airfoil – wing section - Speed up

and efficiency


0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores



0

200

400

600

800

1000

1200

64 128 256 512 1024

Spe

ed

up

# cores



0,00

0,20

0,40

0,60

0,80

1,00

1,20

64 128 256 512 1024

Effi

cie

ncy

# cores



www.cineca.it

Airfoil – wing section – Profiling




(GB)

Files Size per core

(MB)

64 5,6 1,46

128 5,8 0,76

256 6,6 0,43

512 7,9 0,26

1024 12,0 0,20


Files per core : 6



0%

10%

20%

30%

40%

50%

60%

70%

80%

64 128 256 512 1024

Spe

ed

up

# cores

Decomposition method - scotch

Fermi Lagrange

www.cineca.it

Free surface - dtmb hull – 3D

Flow : turbulent, incompressible

Case study : unsteady, multiphase

Mesh : unstructured 3D


Solver : interFoam

Method : simple - scotch

www.cineca.it

Free surface - dtmb hull – 3D Speed

up and efficiency


0,00

0,20

0,40

0,60

0,80

1,00

1,20

32 64 128 256 512

Effi

cie

ncy

# cores



0

100

200

300

400

500

600

32 64 128 256 512

Spe

ed

up

# cores



0

0,2

0,4

0,6

0,8

1

1,2

32 64 128 256 512

Effi

cie

ncy

# cores



0

100

200

300

400

500

600

32 64 128 256 512

Spe

ed

up

# cores



www.cineca.it

Free surface, dtmb hull – 3D Speed

up and efficiency


0

100

200

300

400

500

600

32 64 128 256 512

Spe

ed

up

# cores



0

0,2

0,4

0,6

0,8

1

1,2

32 64 128 256 512

Effi

cie

ncy

# cores



0

100

200

300

400

500

600

32 64 128 256 512

Spe

ed

up

# cores



0

0,2

0,4

0,6

0,8

1

1,2

32 64 128 256 512

Effi

cie

ncy

# cores



www.cineca.it

Free surface - dtmb hull – 3D -

Profiling


(GB)

Files Size per core

(MB)

64 18,4 4,50

128 18,4 2,25

256 18,6 1,14

512 19,6 0,60


Files per core : 8



0%10%20%30%40%50%60%70%80%90%

100%

32 64 128 256 512

Incr

em

en

t %

# cores


Fermi

Lagrange



www.cineca.it

Conclusions

OpenFOAM scaling and efficiency performance on Fermi and classic HPC systems are comparable but for well suited case studies with a good balancing between computation, I/O and MPI communications we could benefit from the larger amount of available cores on Fermi.

OpenFOAM efficiency and scaling are constrained by poor I/O design and intra-process communication

A new scheme of I/O based on MPI Parallel I/O routines or available parallel I/O libraries, able to use efficiently parallel file system facilities, should dramatically reduce I/O overhead

A multi-threaded hybrid MPI/OpenMP version of the solvers will indeed mitigate the time spent in MPI routines with the increase in the number of cores.

www.cineca.it

Acknowledgements

Bob Danani VLSCI Carlton, Melbourne

Matteo Cerminara INGV

Massimiliano Culpo CINECA

Piero Lanucara CINECA

Andrea Penza CINECA

Francesco Salvadore CINECA

Ivan Spisso CINECA

www.cineca.it

Questions ?

openfoam on bg/q porting and performance - prace ... on bg/q porting and performance paride dagna,...

Documents