openfoam on bg/q porting and performance - prace ... on bg/q porting and performance paride dagna,...
TRANSCRIPT
www.cineca.it
OpenFOAM on BG/Q porting and performance
Paride Dagna, SCAI Department, CINECA
www.cineca.it
OpenFOAM : selected application inside of PRACE project
Fermi : PRACE Tier-0 System
Model: IBM-BlueGene /Q
Architecture: 10 BGQ Frame with 2 MidPlanes each
Front-end Nodes OS: Red-Hat EL 6.2
Compute Node Kernel: lightweight Linux-like kernel
Processor Type: IBM PowerA2, 16 cores, 1.6 GHz
Computing Nodes: 10.240
Computing Cores: 163.840
RAM: 16GB / node
Internal Network: Network interface
with 11 links ->5D Torus
Disk Space: more than 2PB of scratch space
Peak Performance: 2.1 PFlop/s
SYSTEM OVERVIEW
www.cineca.it
Single Chip Module
Compute card: One chip module,
16 GB DDR3 Memory
SYSTEM OVERVIEW
Compute node (back-end): • each compute node comprise 17 cores on a single chip
with16 GB of dedicated physical memory
• Applications run on 16 of the cores with the 17th core reserved for system software.
• Nearly the full 16 GB of physical memory is dedicated to application usage.
• On each core it’s possible to run up to 4 processes/threads for a total of 64 processes/threads per node
Applications : • Applications are submitted to the compute nodes by the
batch scheduler system • To run on the compute nodes (back-end), applications
must be cross-compiled
www.cineca.it
Porting of OpenFOAM on BG/Q
Compiling OpenFOAM for the back-end nodes on BG/Q requires some system specific changes to the configuration scripts of OpenFOAM and Third-party package
It’s not possible to use Third-party MPI, rules for BG/Q MPI must be inserted
Environment configuration:
• Configure environment with compilers and zlib using modules
module load bgq-gnu
module load zlib
OpenFOAM configuration scripts and rules:
• Files “bashrc” and “settings.sh” must be changed inserting the rules for BG/Q MPI
• Files c/c++ in wmake/rules folders must be modified for dynamic linking
Scotch library build
• Before running “Allwmake” in the OpenFOAM main folder some changes need to be made to the compiling and dynamic linking rules in the file “Makefile .inc” contained in the scotch library.
• Cross-compile and execute on the back-end the “dummysizes” scotch utility to build properly the header files scotch.h and scotchf.h
Compile
• Go in $WM_PROJECT/$WM_PROJECT_VERSION and compile with ./Allwmake
www.cineca.it
Performance of OpenFOAM on BG/Q
Test cases Cavity 3D
Isothermal Incompressible Flow
Solver : icoFoam
BoxTurb 3D Omogeneus Isotropic Turbulence on compressible flow
Solver : sonicFoam
Airfoil – wing section External aerodynamic
Solver : simpleFoam
Dtmb hull Marine hydrodynamics
Solver : interFoam
www.cineca.it
Performance of OpenFOAM on BG/Q
Systems
Model: IBM-BlueGene /Q (Fermi)
Processor Type: IBM PowerA2, 1.6 GHz
Computing Node: 16 cores
RAM: 16GB / node; 1GB/core
Internal Network: Network interface
with 11 links ->5D Torus
Model: Hewlett Packard C7000 (Lagrange)
Processor Type: Intel, Xeon Westmere,
2.8 GHz
Computing Node: 12 cores
RAM: 24GB / node; 2GB/core
Internal Network: Infiniband QDR/DDR Voltaire, Fat Tree
www.cineca.it
Cavity – 3D
Flow : laminar, isothermal, incompressible
Mesh : fully structured 3D
Mesh elements : cubes
Elements 10.000.000
Scotch Simple
icoFoam
Elements 20.000.000
Scotch Simple
icoFoam
www.cineca.it
Cavity – 3D Speed up and Efficiency
Mesh :10.000.000
Solution saved at final time step
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024 2048 4096
Effi
cie
ncy
# cores
Partition method - simple
Fermi Lagrange Ideal
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024 2048 4096
Effi
cie
ncy
# cores
Partition method - scotch
Fermi Lagrange Ideal
0500
10001500200025003000350040004500
64 128 256 512 1024 2048 4096
Spe
ed
up
# cores
Partition method - scotch
Fermi Lagrange Ideal
0500
10001500200025003000350040004500
64 128 256 512 1024 2048 4096
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
www.cineca.it
Cavity – 3D Speed up and Efficiency
Mesh :10.000.000
Solution saved every 10 time steps
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Partition method - simple
Fermi Lagrange Ideal
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Partition method - scotch
Fermi Lagrange Ideal
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - scotch
Fermi Lagrange Ideal
www.cineca.it
Cavity – 3D – Profiling
0%
50%
100%
150%
200%
250%
64 128 256 512 1024
Incr
em
en
t %
# cores
I/O overhead on simulation time
Fermi Lagrange
# Cores Cumulative I/O
(GB)
Files Size per core
(MB)
64 13,0 5,10
128 14,0 2,50
256 14,0 1,33
512 15,0 0,75
1024 22,0 0,40
Number of iterations : 100
Files per core : 3
MPI_Allreduce average message size per core (B) : 8 -- #cores 1024
Average message size sent and received per core (KB) : 4,6 -- #cores 1024
MPI and I/O profiling : 512 cores
MPI and I/O profiling : 1024 cores
www.cineca.it
Cavity – 3D Speed up and efficiency
Mesh :20.000.000
Solution saved at final time step
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024 2048 4096
Effi
cie
ncy
# cores
Partition method - simple
Fermi Lagrange Ideal
0500
10001500200025003000350040004500
64 128 256 512 1024 2048 4096
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024 2048 4096
Effi
cie
ncy
# cores
Partition method - scotch
Fermi Lagrange Ideal
0500
10001500200025003000350040004500
64 128 256 512 1024 2048 4096
Spe
ed
up
# cores
Partition method - scotch
Fermi Lagrange Ideal
www.cineca.it
Cavity – 3D Speed up and efficiency
Mesh :20.000.000
Solution saved every 10 time steps
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Partition method - simple
Fermi Lagrange Ideal
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Partition method - scotch
Fermi Lagrange Ideal
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - scotch
Fermi Lagrange Ideal
www.cineca.it
Cavity – 3D – Profiling
# Cores Cumulative I/O
(GB)
Files Size per core
(MB)
64 18,1 9,46
128 18,1 4,73
256 18,5 2,42
512 22,5 1,27
1024 23,1 0,63
MPI and I/O profiling : 512 cores
MPI and I/O profiling : 1204 cores
0%
50%
100%
150%
200%
250%
300%
64 128 256 512 1024
% In
cre
me
nt
# cores
I/O overhead on simulation time
Fermi Lagrange
Number of iterations : 100
Files per core : 3
MPI_Allreduce average message size per core (B) : 8 -- #cores 1024
Average message size sent and received per core (KB) : 6,4 -- #cores 1024
www.cineca.it
BoxTurb – 3D
Flow : compressible
Case study : homogeneous, isotropic turbulence
Mesh : uniform 3D
Number of cells : ≈ 17.000.000
Solver : sonicFoam
Partition method : simple
Courtesy of : Matteo Cerminara (INGV), Pisa
www.cineca.it
BoxTurb – 3D Speed up and
efficiency
Solution saved at the final time step
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024 2048
Effi
cie
ncy
# cores
Patition method - simple
Fermi Lagrange Ideal
0
500
1000
1500
2000
2500
64 128 256 512 1024 2048
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
www.cineca.it
BoxTurb – 3D Speed up and
efficiency
Solution saved every 10 time steps
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Patition method - simple
Fermi Lagrange Ideal
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
www.cineca.it
BoxTurb – 3D – Profiling
0%
20%
40%
60%
80%
100%
120%
140%
64 128 256 512 1024
Incr
em
en
t %
# cores
I/O overhead on simulation time
Fermi Lagrange
# Cores Cumulative I/O
(GB)
Files Size per core
(MB)
64 18,4 4,50
128 18,4 2,25
256 18,6 1,14
512 19,6 0,60
1024 21,2 0,32
MPI and I/O profiling : 512 cores
MPI and I/O profiling : 1024 cores
Number of iterations : 180
Files per core : 4
MPI_Allreduce average message size per core (B) : 8 -- #cores 1024
Average message size sent and received per core (KB) : 9,3 -- #cores 1024
www.cineca.it
Airfoil – wing section
Flow : turbulent, incompressible
Case study : steady state, extruded NACA airfoil
Mesh : fully structured 3D
Number of cells : ≈ 9.000.000
Solver : simpleFoam
Method : simple - scotch
www.cineca.it
Airfoil – wing section - Speed up
and efficiency
Solution saved at the final time step
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Partition method - simple
Fermi Lagrange Ideal
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - scotch
Fermi Lagrange Ideal
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Partition method - scotch
Fermi Lagrange Ideal
www.cineca.it
Airfoil – wing section – Profiling
MPI profiling – simple - 512 cores
MPI profiling – scotch - 512 cores
MPI profiling – simple - 512 cores
MPI profiling – scotch - 512 cores
www.cineca.it
Airfoil – wing section - Speed up
and efficiency
Solution saved every 100 time steps
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Partition method - simple
Fermi Lagrange Ideal
0
200
400
600
800
1000
1200
64 128 256 512 1024
Spe
ed
up
# cores
Partition method - scotch
Fermi Lagrange Ideal
0,00
0,20
0,40
0,60
0,80
1,00
1,20
64 128 256 512 1024
Effi
cie
ncy
# cores
Partition method - scotch
Fermi Lagrange Ideal
www.cineca.it
Airfoil – wing section – Profiling
MPI and I/O profiling : 1024 cores
MPI and I/O profiling : 512 cores
# Cores Cumulative I/O
(GB)
Files Size per core
(MB)
64 5,6 1,46
128 5,8 0,76
256 6,6 0,43
512 7,9 0,26
1024 12,0 0,20
Number of iterations : 1000
Files per core : 6
MPI_Allreduce average message size per core (B) : 8 -- #cores 512
Average message size sent and received per core (KB) : 4,2 -- #cores 512
0%
10%
20%
30%
40%
50%
60%
70%
80%
64 128 256 512 1024
Spe
ed
up
# cores
Decomposition method - scotch
Fermi Lagrange
www.cineca.it
Free surface - dtmb hull – 3D
Flow : turbulent, incompressible
Case study : unsteady, multiphase
Mesh : unstructured 3D
Number of cells : ≈ 5.500.000
Solver : interFoam
Method : simple - scotch
www.cineca.it
Free surface - dtmb hull – 3D Speed
up and efficiency
Solution saved at the final time step
0,00
0,20
0,40
0,60
0,80
1,00
1,20
32 64 128 256 512
Effi
cie
ncy
# cores
Partition method - simple
Fermi Lagrange Ideal
0
100
200
300
400
500
600
32 64 128 256 512
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
0
0,2
0,4
0,6
0,8
1
1,2
32 64 128 256 512
Effi
cie
ncy
# cores
Partition method - scotch
Fermi Lagrange Ideal
0
100
200
300
400
500
600
32 64 128 256 512
Spe
ed
up
# cores
Partition method - scotch
Fermi Lagrange Ideal
www.cineca.it
Free surface, dtmb hull – 3D Speed
up and efficiency
Solution saved every 10 time steps
0
100
200
300
400
500
600
32 64 128 256 512
Spe
ed
up
# cores
Partition method - scotch
Fermi Lagrange Ideal
0
0,2
0,4
0,6
0,8
1
1,2
32 64 128 256 512
Effi
cie
ncy
# cores
Partition method - scotch
Fermi Lagrange Ideal
0
100
200
300
400
500
600
32 64 128 256 512
Spe
ed
up
# cores
Partition method - simple
Fermi Lagrange Ideal
0
0,2
0,4
0,6
0,8
1
1,2
32 64 128 256 512
Effi
cie
ncy
# cores
Partition method - simple
Fermi Lagrange Ideal
www.cineca.it
Free surface - dtmb hull – 3D -
Profiling
# Cores Cumulative I/O
(GB)
Files Size per core
(MB)
64 18,4 4,50
128 18,4 2,25
256 18,6 1,14
512 19,6 0,60
Number of iterations : 100
Files per core : 8
MPI_Allreduce average message size per core (B) : 8 -- #cores 512
Average message size sent and received per core (KB) : 29,4 -- #cores 512
0%10%20%30%40%50%60%70%80%90%
100%
32 64 128 256 512
Incr
em
en
t %
# cores
I/O overhead on simulation time
Fermi
Lagrange
MPI and I/O profiling : 256 cores
MPI and I/O profiling : 512 cores
www.cineca.it
Conclusions
OpenFOAM scaling and efficiency performance on Fermi and classic HPC systems are comparable but for well suited case studies with a good balancing between computation, I/O and MPI communications we could benefit from the larger amount of available cores on Fermi.
OpenFOAM efficiency and scaling are constrained by poor I/O design and intra-process communication
A new scheme of I/O based on MPI Parallel I/O routines or available parallel I/O libraries, able to use efficiently parallel file system facilities, should dramatically reduce I/O overhead
A multi-threaded hybrid MPI/OpenMP version of the solvers will indeed mitigate the time spent in MPI routines with the increase in the number of cores.
www.cineca.it
Acknowledgements
Bob Danani VLSCI Carlton, Melbourne
Matteo Cerminara INGV
Massimiliano Culpo CINECA
Piero Lanucara CINECA
Andrea Penza CINECA
Francesco Salvadore CINECA
Ivan Spisso CINECA
www.cineca.it
Questions ?