how we use mpi: a naïve curmudgeon’s view

SNL MPI WorkshopSNL MPI Workshop

Managed by UT-Battellefor the Department of Energy

11

How We Use MPI:A Naïve Curmudgeon’s View

Bronson MesserScientific Computing Group

Leadership Computing FacilityNational Center for Computational Sciences

Oak Ridge National Laboratory

Theoretical Astrophysics GroupOak Ridge National Laboratory

Department of Physics & AstronomyUniversity of Tennessee, Knoxville



22

Why do we (I and other idiot astrophysicists) use MPI? It is ubiquitous!

… and everywhere it exists it performs – OK, ‘performs’ connotes good performance, but ‘poor’

performance on a given platform is always met with alarm– AND, we have now figured out how to ameliorate some

shortcomings in performance through avoidance

… and it’s pretty darn easy to use– Even ‘modern’ (i.e. grew up with their notion of ‘computer’

meaning ‘information appliance’) grad students can figure out how to program poorly using MPI in a matter of days

That’s it!

Importantly, right now our need for expressiveness is close to being met



33

Selected Petascale Science DriversWe Have Worked With Science Teams to Understand and Define Specific Science Objectives

Application Area

Science Driver Science Objective Impact

Combustion(S3D)

Predictive engineering engine design simulation tool for new engine design

Understanding flame stabilization in lifted autoigniting diesel fuel jets relevant to low-temperature combustion for engine design at realistic operating conditions

Potential for 50% increase in efficiency and 20% savings in petroleum consumption with lower emission, leaner burning engines

Fusion(GTC)

Understand and quantify physics and properties of ITER scaling and H‑mode confinement

Strongly coupled and consistent wall-to-edge-to-core modeling of ITER plasmas; attain a realistic assessment of ignition margins

ITER design and operation

Chemistry(MADNESS)

Computational catalysis Describe large systems accurately with modern hybrid and meta density functional theory functionals

Generate quantitative catalytic reaction rates and guide small system calibration

Nanoscale Science(DCA++)

Material-specific understanding of high‑temperature superconductivity theory

Understand the quantitative differences in the transition temperatures of high-temperature superconductors

Macroscopic quantum effect at elevated temperatures (>150K); new materials for power transmission and oxide electronics

Climate(POP)

Accurate representation of ocean circulation

Fully coupled eddy-resolving ocean and sea ice model to reduce the coupled model biases where ice and deep water parameters are governed by the accurate representation of current systems

Reduce current uncertainties in coupled ocean-sea ice system model

Geoscience(PFLOTRAN)

Perform multiscale, multiphase, multi-component modeling of a 3-D field CO2

injection scenario

Include oil phase and four-phase liquid-gas-aqueous-oil system to describe dissipation of the supercritical CO2 phase and escape of CO2 to the

surface

Demonstrate viability of and potential for sequestration of anthropogenic CO2 in deep geologic formations

Astrophysics(CHIMERA)

Understand the core-collapse supernova mechanism for a range of progenitor star masses

Perform core-collapse simulations with sophisticated spectral neutrino transport, detailed nuclear burning, and general relativistic gravity

Understand the origin of many elements in the Periodic Table and the creation of neutron stars and black holes



44

Science WorkloadJob Sizes and Resource Usage of Key Applications

Code2007 Resource

Utilization(M core-hours)

Projected 2008 Resource Utilization

(M core-hours)

Typical Job Size in 2006-

2007(K cores)

Anticipated Job Size in

2008(K cores)

CHIMERA2

(under development)

160.25

(under development)

>10

GTC 8 7 8 12

S3D 6.5 18 8-12 >15

POP 4.8 4.7 4 8

MADNESS1

(under development)

40.25

(under development)

>8

DCA++N/A

(under development)

3-8N/A

(under development)

4-16 (w/o disorder)>40 (with disorder)

PFLOTRAN0.37

(under development)

>21-2

(under development)

>10

AORSA 0.61 1 15-20 >20

Total aggregate allocation for CHIMERA production & GenASiS development this FY:38 million CPU-hours (16M INCITE, 18M NSF, 4M NERSC)

Total aggregate allocation for CHIMERA production & GenASiS development this FY:38 million CPU-hours (16M INCITE, 18M NSF, 4M NERSC)



55

Current Planned Pioneering Application RunsSimulation Specs on the 250 TF Jaguar System*

CodeQuad-Core

Nodes

Global Memory

Reqm (TB)

Wall-Clock Time Reqm

(hours)

Number of Runs

Local Storage Reqms

(TB)

Archival Storage Reqms

(TB)

Resolution and Fidelity

MADNESS 7824 48122

1012

5 50 600B coefficients

CHIMERA78244045

168

100100

11

13 50256x128x256 or 256x90x180

20 energy groups, 14 alpha nuclei

GTC-SGTC-C

39003900

4060

3636

22

350 550600M grid points, 60B particles400M grid points, 250B particles

DCA++20006000

1648

12 to 24 20 1 1Lattices of 16 to 32 sites

80 to 120 time slicesO(102-103) disorder realizations

S3D 7824 10 140 1 50 1001B grid points, 15 μm grid spacing4 ns time step, 23 transport vars

POP 2500 1 400 1 1 23600x2400x42 tripole grid (0.1°)

20-yr run; partial bottom cells; first with biogeochemistry at this scale

Multi-physics applications are very good present-day laboratories for multi-core ideas.



66

Current Workhorse

mCHIMERA

bCHIMERA

Ray-by-ray MGFLD transport (E)3D (magneto)hydrodynamics150 species nuclear network

Ray-by-ray Boltzmann transport (E)3D (magneto)hydrodynamics

150-300 species nuclear network

Possible Future Workhorse

The “Ultimate Goal”

Full 3D Boltzmann transport (Eφ)3D (magneto)hydrodynamics

150-300 species nuclear network

Bruenn et al. (2006) Messer et al. (2007)



77

Pioneering Application: CHIMERA*Physical Models and Algorithms

Physical Models A “chimera” of three separate yet mature

codes– Coupled into a single executable

Three primary modules (“heads”)– MVH3: Stellar gasdynamics

– MGFLD-TRANS: ``ray-by-ray-plus'' neutrino transport

– XNET: thermonuclear kinetics

The heads are augmented by– Sophisticated equation of state for nuclear

matter

– Self-gravity solver capable of an approximation to general-relativistic gravity

Numerical Algorithms Directionally-split hydrodynamics with a

standard Riemann solver for shock capturing

Solutions for ray-by-ray neutrino transport and thermonuclear kinetics are obtained during the radial hydro sweep

– All necessary data for those modules is local to a processor during the radial sweep

– Computed along each radial ray using only data that is local to that ray

Physics modules are coupled with standard operator-splitting

– Valid because characteristic time scales for each module are widely disparate

Neutrino transport solution– Sparse linear solve local to a ray

Nuclear burning solution– Dense linear solve local to a zone

Early-time distribution of entropy in 2D exploding core collapse simulation

* Conservative Hydrodynamics Including Multi-Energy Ray-by-ray Transport

* Conservative Hydrodynamics Including Multi-Energy Ray-by-ray Transport



88

CHIMERA is: a “chimera” of 3 separate, mature codes

VH1 (MVH3)

• Multidimensional hydrodynamics• http://wonka.physics.ncsu.edu/pub/VH-1/• non-polytropic EOS• 3D domain decomposition

- uses directional sweeps to define subcommunicators for data transpose (MPI_alltoall)- results in all processes performing ‘several_to_several’



99

MN

4N

3N

N+1 N+2 2N

1 2 3 4 Nmype +1 =

jcol = 0 1 2 3 N-1

krow =

M-1 3

2

1 0

Y

Z

Using M*N processors; X data starts local to proc

MVH3: Dicing instead of slicing

MPI_ALLTOALL( MPI_COMM_ROW )

Y Hydro is done after transposing data only with processors with the same value of krow: transposing I and J but keeping K constant.

MPI_ALLTOALL( MPI_COMM_COL )

jcol = mod(mype,N)krow = mype/Nmpi_comm_split(mpi_comm_world, krow, mype, mpi_comm_row)mpi_comm_split(mpi_comm_world, jcol, mype, mpi_comm_col)

zro(imax,js,ks)

Local data includes all of the X domain, but only portions of Y and Z.



1010

MGFLD-TRANS

• Multi-group (energy) neutrino radiation hydro solver• GR corrections• 4 neutrino flavors with many modern interactions included• flux limiter is “tuned” from Boltzmann transport simulations

QuickTime™ and aYUV420 codec decompressor

are needed to see this picture.



1111

XNET

• Nuclear kinetics solver• Currently have implemented only an α network• 150 species to be included in future simulations• Custom interface routine written for CHIMERA• All else is ‘stock’

QuickTime™ and a decompressor

are needed to see this picture.



1212

CHIMERA

How does CHIMERA work?

νν

ϑϑ

φφ

rr

VH1/MVH3

MGFLD-TRANS

XNET



1313

Example: XNET performance and implementation XNET runs at ~50% of peak on a single XT4

processorRoughly 50% Jacobian build / 50% dense solve

1 XNET solve is required per SPATIAL ZONE (i.e. hundreds per ray)

Best load balancing on a node with OpenMP or a subcommunicator is interleaved

r=0 r=rmax

hot cool

lots of burning little burning

1 2 3 4 1 2 3 4



1414

Communication Patterns

Application Collectives Point-to-point Asynchronous Other

POP 45% 10% 45%(MPI_Waitall)

GTC 66% 34%

PFLOTRAN 95% 5%(MPI_barrier)

CHIMERA 96% 4%

AORSA 65% 35%(MPI_Wait)

S3D 15% 85%

LSMS (with one-sided comm.) 5% 15%

85% (MPI2 one-sided comm.:

MPI_win, MPI_put, etc)

LSMS (w/o one-sided comm.) 45% 55%

NOTE: absolute time for collectives is identical for both

LSMS versions



1515

A lot of “big” codes don’t really stress the XT network

Solar Physics3.3% Accelerator Physics

3.1%

Astrophysics14.1%

Biology4.8%

Chemistry7.4%

Climate13.6%

Computer Science2.8%

Engineering0.56%

Combustion14.4%

Nuclear Physics5.2%

Atomic Physics1.4%

QCD4.9%

Geosciences1.2%

Fusion7.2%

Materials Science 16.0%

Solar Physics3.3% Accelerator Physics

3.1%

Astrophysics14.1%

Biology4.8%

Chemistry7.4%

Climate13.6%

Computer Science2.8%

Engineering0.56%

Combustion14.4%

Nuclear Physics5.2%

Atomic Physics1.4%

QCD4.9%

Geosciences1.2%

Fusion7.2%

Materials Science 16.0%

2007 INCITE

MADNESS

DCA++

S3D

CHIMERA

POPPFLOTRAN

Distribution in this space depends upon the applications and the problem being simulated for a given application

Computation

Co

mm

un

ica

tion

0% 100%0%

100%

GTC

S3D

POP

CHIMERA

DCA++

MADNESS

Computation

Co

mm

un

ica

tion

0% 100%0%

100%

GTC

S3D

POP

DCA++

MADNESSPFLOTRAN

GTC

CHIMERA



1616

Relative Per Core Performance

Code XT4 2 Socket F/ Seastar2 2 Socket F/Gemini

POP 1.0 1.01 1.47

CHIMERA* 1.0 0.89 1.03

GTC 1.0 1.03 1.04

S3D 1.0 0.89 1.01

PFLOTRAN 1.0 1.23 1.39

MADNESS 1.0 1.04 1.04

* Only hydrodynamics module used in benchmark* Only hydrodynamics module used in benchmark



1717

GenASiS development

GenASiS is not completely “wed” to an programming model yet– Lots of abstraction– Function overloading used everywhere in the

code– Many implementations are possible ‘under the

hood’

Full, 3D rad-hydro simulations will require an exascale computer in any event, so we have time…

(Why, they couldn’t hit an elephant at this dis… [Gen. John Sedgwick, 1864])



1818

Opinions and questions

Ubiquity and performance are go/no-go metrics for any future methods/languages/ideas.– Does this present a chicken/egg conundrum: must

things be built and tested on architectures unready to exhibit the expected performance?

Are the users of an exascale machine the present users of petascle-ish platforms? Is the mapping one-to-one?

Writing code from scratch is not anathema, but you’re lucky if you can afford to do it.– Even then, design decisions are often made during

this process based not on wise reflection, but attempts to snag the proverbial (but elusive) low-hanging fruit.

how we use mpi: a naïve curmudgeon’s view

Documents