Title
DUNE on current and next generation HPC Platforms
Markus Blatt
Dr. Markus BlattHPC-Simulation-Software & Services
Forschungszentrum Julich, GermanyMarch 8, 2012
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 1 / 33
Outline
Outline
1 DUNE
2 ParallelizationParallel Grid InterfaceParallel Iterative SolversParallel Algebraic MultigridScalabilityA Glimpse at other DUNE projects
3 Trends and Outlook for HPC and DUNE
4 HPC-Simulation-Software & Services
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 2 / 33
DUNE
Why we created DUNE!
Problems with most PDE software• Mostly support one special set of features:
• IPARS: block structured, parallel, multiphysics.• Alberta: simplicial, unstructured, bisection refinement.• UG: unstructured, multi-element, red-green refinement, parallel.• QuocMesh: Fast, on-the-fly structured grids.
• Other features either not or inefficiently supported.
The idea of DUNE• Separation of data structures and algorithms.
• Easy exchange of interface implementations.
• Reuse of legacy software.
• Fine grained interfaces.
• C++ with templates.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 3 / 33
DUNE
Modularity
dune−grid
ALUUG
dune−grid−howto
dune−fem
dune−istl
dune−commonAlberta
NeuronGrid
dune−pdelab−howto
dune−pdelab
dune−localfunctions
VTK Gmsh
SuperLU
Metis
• Grid interface: (non-)conforming hierarchically grid interface.
• Iterative Solver Template Library: Dense and sparse linear algebra.
• PDELab: Discretization module based on residual formulation.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 4 / 33
DUNE
PDELab: Plug, Code and Play Simulation Software
• Choose:• Grid• Finite Element• Maybe imlement local
operator.• Time stepping scheme.• (Non-)linear solvers.
• Recompile application
• Run efficiently.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 5 / 33
DUNE
Sample Simulations
Transport in porous media, Density-driven flow, Flow around Root networks, Neuron network simulations
• Electromagnetics• Computational neuroscience: biophysically realistic networks of neurons• Geostatistical inversion: coping with uncertain parameters• Linear Acoustics• Multiphase-Flow and transport in porous media• Density-driven flow
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 6 / 33
Parallelization
Parallelization
Parallel Grids
• Domain decomposition (overlapping or non-overlapping).
• Load-balancing
• Message passing based on MPI is handled by grid manager.
Parallel linear algebra
• Message passing decoupled from grid.
• Abstraction: Parallel index sets to identify data globally.
• Reuse of efficient sequential linear algebra.
• Minimize communication.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 7 / 33
Parallelization Parallel Grid Interface
An Example of a Parallel Grid
c = 0
c = 0
c = 0
c = 1
c = 1
c = 1
c = 2
c = 2
c = 2
1
First row: withoverlap and ghostsSecond row: withoverlap onlyThird row: withghosts only
interior
overlap
ghost
border
front
not stored
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 8 / 33
Parallelization Parallel Grid Interface
Parallel Grids in Dune
• YaspGrid• structured• 2D/3D• arbitrary overlap
• UGGrid• unstructured• 2D/3D• multi-element• one layer of ghost cells• (conforming) red-green refinement• (non-free!)
• ALUGrid• unstructured• 3D• tetrahedral or hexahedral elements• ghost cells• (non-conforming) bisection refinement
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 9 / 33
Parallelization Parallel Grid Interface
Parallel Grids in Dune
• YaspGrid• structured• 2D/3D• arbitrary overlap
• UGGrid• unstructured• 2D/3D• multi-element• one layer of ghost cells• (conforming) red-green refinement• (non-free!)
• ALUGrid• unstructured• 3D• tetrahedral or hexahedral elements• ghost cells• (non-conforming) bisection refinement
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 9 / 33
Parallelization Parallel Grid Interface
Parallel Grids in Dune
• YaspGrid• structured• 2D/3D• arbitrary overlap
• UGGrid• unstructured• 2D/3D• multi-element• one layer of ghost cells• (conforming) red-green refinement• (non-free!)
• ALUGrid• unstructured• 3D• tetrahedral or hexahedral elements• ghost cells• (non-conforming) bisection refinement
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 9 / 33
Parallelization Parallel Iterative Solvers
Index Sets
Index Set
• Distributed overlapping index set I =⋃P−1
p=0 Ip
• Process p manages mapping Ip −→ [0, np).
• Might only store information about the mapping for shared indices.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 10 / 33
Parallelization Parallel Iterative Solvers
Index Sets
Index Set
• Distributed overlapping index set I =⋃P−1
p=0 Ip
• Process p manages mapping Ip −→ [0, np).
• Might only store information about the mapping for shared indices.
Global Index
• Identifies a position (index) globally.
• Arbitrary and not consecutive (to support adaptivity).
• Persistent.
• On JUGENE this is not an int to get rid off the 32 bit limit!
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 10 / 33
Parallelization Parallel Iterative Solvers
Index Sets
Index Set
• Distributed overlapping index set I =⋃P−1
p=0 Ip
• Process p manages mapping Ip −→ [0, np).
• Might only store information about the mapping for shared indices.
Local Index• Addresses a position in the local container.
• Convertible to an integral type.
• Consecutive index starting from 0.
• Non-persistent.
• Provides an attribute to identify ghost region.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 10 / 33
Parallelization Parallel Iterative Solvers
Remote Information and Communication
Remote Index Information• For each process q the process p knows all common global indices
together with their attribute on q.
Communication• Target and source partition of the index is chosen using attribute
flags, e.g from ghost to owner and ghost.
• If there is remote index information of q available on p, then p sendall the data in one message.
• All communication takes place asynchronously at the same time.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 11 / 33
Parallelization Parallel Iterative Solvers
Parallel Matrix Representation
• Let Ii be a nonoverlapping decomposition of our index set I .
• Ii is the augmented index set such set for all k ∈ Ii with|akj |+ |ajk | 6= 0 also k ∈ Ii holds.
• Then the locally stored matrix looks like
Ii
Ii
Aii ∗
0 I
• Therefore Av can be computed locally for the entries associated withIi if v is known for Ii
• A communication step ensures consistent ghost values.
• Matrix can be used for hybrid preconditioners.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 12 / 33
Parallelization Parallel Algebraic Multigrid
Algebraic Multigrid (AMG)
Stationary Iterative Methods
• Error reduction stagnates with increasing number of iterations andunknowns
• Reduces only high frequency errors
Algebraic Multigrid
• approximate smooth residual on a coarser grid and solve there.
• calculate a correction there
• interpolate correction to the fine grid and add it to the current guess
• Use algebraic nature of the problem to define coarse level.
• Coarsening adapts to problem and grid automatically
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 13 / 33
Parallelization Parallel Algebraic Multigrid
Aggregation AMG
Simple, non-smoothed version
• Piecewise constant prolongators Pl .
• Heuristic and greedy aggregation algorithm.
• Al−1 = PTl AlPl
• Proposed by Raw, Vanek et al., Braess
• Preconditioner for Krylov methods.
Observations• Reasonable coarse grid operator for systems.
• Preserves FV discretization.
• Very memory efficient.
• Fast and scalable V-cycle.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 14 / 33
Parallelization Parallel Algebraic Multigrid
Aggregation AMG
Simple, non-smoothed version
• Piecewise constant prolongators Pl .
• Heuristic and greedy aggregation algorithm.
• Al−1 = PTl AlPl
• Proposed by Raw, Vanek et al., Braess
• Preconditioner for Krylov methods.
Observations• Reasonable coarse grid operator for systems.
• Preserves FV discretization.
• Very memory efficient.
• Fast and scalable V-cycle.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 14 / 33
Parallelization Parallel Algebraic Multigrid
Illustration Parallel Setup
Decoupled Aggregation
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 15 / 33
Parallelization Parallel Algebraic Multigrid
Illustration Parallel Setup
Communicate Ghost Aggregates
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 15 / 33
Parallelization Parallel Algebraic Multigrid
Parallel Setup Phase
• Every process builds aggregates in his owner region.
• One communication with next neighbors to update aggregateinformation in ghost region.
• Coarse level index sets are a subset of the fine level.
• Remote index information can be deduced locally.
• Galerkin product can be calculated locally.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 16 / 33
Parallelization Parallel Algebraic Multigrid
Data Agglomeration on Coarse Levels
coarsen target
number of nonidle processor n1
num
ber
of
ver
tice
s per
pro
cess
or 0
1
L
L−1
(L−2)’
L−2
L−3
L−4
L−4’
L−5
L−6
L−6’
L−7
• Repartition the data ontofewer processes.
• Use METIS on the graph ofthe communication pattern.(ParMETIS cannot handlethe full machine!)
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 17 / 33
Parallelization Scalability
Weak Scalability Results (Poisson)
procs 1/H lev. TB TS It TIt TT
1 80 5 19.86 31.91 8 3.989 51.778 160 6 27.7 46.4 10 4.64 74.2
64 320 7 74.1 49.3 10 4.93 123512 640 8 76.91 60.2 12 5.017 137.1
4096 1280 10 81.31 64.45 13 4.958 145.832768 2560 11 92.75 65.55 13 5.042 158.3
262144 5120 12 188.5 67.66 13 5.205 256.2
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 18 / 33
Parallelization Scalability
Clipped Log-Random Problem
• −∇ · (k(x)∇u) = f in Ω
• κ(x) realization of log-random field with variance σ2, mean 0, andcorrelation length λ.
• k(x): binary medium constructed from κ(x).
• Weak scaling: λ scales with mesh width h.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 19 / 33
Parallelization Scalability
Weak Scalability Results (Possion vs. Clipped)
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 20 / 33
Parallelization Scalability
Weak Scalability Results (Clipped Log-Random Problem)
• σ2 = 8, λ = 4h
procs 1/h lev. TB TS It TIt TT
1 80 5 19.93 49.39 12 4.116 69.328 160 6 28.1 73.7 15 4.91 102
64 320 7 75.1 105 20 5.26 180512 640 8 80.11 134 25 5.362 214.1
4096 1280 10 84.71 171.7 33 5.203 256.432768 2560 11 93.24 189.5 36 5.264 282.7
262144 5120 12 195.9 386.5 72 5.368 582.5
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 21 / 33
Parallelization Scalability
Parallel Groundwater Simulation
Figure: Cut through the ground beneath an acre
• Highly discontinuous permeability of the ground.• 3D simulations with high resolution.• Efficient and robust parallel iterative solvers.
−∇ · (K (x)∇u) = f in Ω (1)
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 22 / 33
Parallelization Scalability
Weak Scalability Results II
• Richards equation.• 64 ∗ 64 ∗ 128 unknowns per process.• 1.25E11 unknowns on the full JUGENE.• One time step in simulation.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 23 / 33
Parallelization Scalability
Efficiency Solver Components
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 24 / 33
Parallelization Scalability
Efficiency IO
• Highly tuned by Olaf Ippisch
• SionLib from Julich rocks!
• Still: IO not very scalable!
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 25 / 33
Parallelization A Glimpse at other DUNE projects
Robert Klöfkorn
Projects using DUNE-FEM on JUGENE
DFG SPP MetStrom: Adaptive Numerics for Multiscale Phenomena· D. Kroner, S.Brdar (Freiburg)· M. Baldauf, D. Schuster (DWD)· R. Klofkorn (Stuttgart)· A. Dedner (Warwick)
Mountain wave test case: work by S.Brdar
BW Stiftung HPC-11: Simulation of 2-stroke engines with detailed combustion· D. Kroner, D. Lebiedz, M. Nolte, M. Fein (Freiburg)· R. Klofkorn (Stuttgart)· A. Dedner (Warwick)
work by D.Trescher
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 26 / 33
Parallelization A Glimpse at other DUNE projects
Robert Klöfkorn
Strong scaling on Blue Gene/P
Table: Strong scaling and efficiency on the supercomputer JUGENE (Julich, Germany)
#cores #cells/core1 #DOFs/core time (ms)2 speed-up efficiency
512 474 296250 46216 — —4096 59 36875 6294 7.34 0.91
32768 7 4375 949 48.71 0.7665536 3 1875 504 91.70 0.72
Navier-Stokes equations solved with CDG2 ( k = 4, 3D)overall number of cells 243 000 (#DOFS ≈ 1.52 · 108) on Cartesian grid≈ 6.1 GB memory consumption on a desktop machineexplicit Runge-Kutta method of order 3
programming techniques for performance· template meta programming· automated code generation of DG kernels· hybrid parallelization (MPI / pthreads , work in progress)· overlap computation and communication (DCMF INTERRUPT=1)
1average #cells/core2average run time per time step
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 27 / 33
Trends and Outlook for HPC and DUNE
(My) Parallel Machines
Helics I (2003)
• 256 nodes
• Dual AMD Athlon 1,4GHz
• 5.9 GFLOPSpeak/node
• 1GB mainmemory/node
• Myrinet 2 Gbit
Helics II (2007)
• 156+4 nodes
• 2x Dual Core AMDOpteron 2220 2.8 GHz
• 18.8 GFLOPSpeak/node
• 8 GB RAM/node (21.4GB/s)
• Myricom 10Gbit
Blue Gene/P (2009)
• 73728 nodes
• PowerPC 450Quad-core 850Mhz
• 13,6 GFLOPSpeak/node
• 2 GB RAM/node (13.6GB/s)
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 28 / 33
Trends and Outlook for HPC and DUNE
Observations in Parallel Computing
Software
• Solution of time dependent (nonlinear) equations with implicit timestepping schemes.
• Most time consuming: Solution of linear system.
• Peak GFLOPS out of reach!
• Methods are limited by memory bandwidth.
Hardware
• Costs for compute power drop fast (2002: 12 USD/MFLOP, 2011:.01 USD/MFLOP)
• Costs for main memory drop only slightly.
• Main memory not power efficient.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 29 / 33
Trends and Outlook for HPC and DUNE
Current Hardware/Software Trends
The hardware manufacturers solution (Green Computing)
• More cores per node
• Less main memory per core.
• SIMD (Blue Gene/Q, GPGPU)
• Increase GFLOPS per GB/s main memory speed
• Faster network interconnects.
How does DUNE cope
• Memory efficiency!
• Minimize communication!
• Favor less convergent iterative methods!
• Time to solution / scalability matters most!
• Ability to compute bigger problems faster.
Software always two steps behind.M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 30 / 33
Trends and Outlook for HPC and DUNE
Current Work and Future Plans
Software Point of View
• Hybrid parallelization in ALUgrid (Klofkorn, University Stuttgart)
• Borrow ideas from GPGPU-computing (coalesced memory)
• Check out cache oblivious algorithms
• Check out parallel in time algorithms.
Application Point of View
• Inverse Modeling• Geostatistical inversion: University Heidelberg (Ippisch, Ngo) and
Tubingen (Cirpka, Schwede)• Run several parallel forward simulation in parallel.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 31 / 33
Trends and Outlook for HPC and DUNE
DUNE on Blue Gene/Q?
Advantages of a central installation:
• Saves scientists a lot of time.
• Would help optimizing DUNE for the platform.
• Possibility of professional installation and user support.
• Closer cooperation of DUNE and IBM brings benefits to all.
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 32 / 33
HPC-Simulation-Software & Services
What can we do for you?
Efficient simulation software made to measure.
Hans-Bunte-Str. 8-10, 69123 Heidelberg, Germanyhttp://www.dr-blatt.de
M. Blatt (HPC-Sim-Soft, Heidelberg) DUNE Blue Gene Julich 2012 33 / 33