exafmm: an open source library for fast multipole methods ... · exafmm: an open source library for...

1
ExaFMM: An open source library for Fast Multipole Methods aimed towards Exascale systems Lorena A Barba 1 , Rio Yokota 2 1 Boston University, 2 KAUST Left — The simulation of homogeneous isotropic turbulence is one of the most challenging benchmarks for computational fluid dynamics. Here, we show results of a 2048 3 system, computed with a fast multipole vortex method on 2048 GPUs. This simulation achieved 0.5 petaflop/s on Tsubame 2.0. Below — Scaling on a massively parallel CPU systems is also excellent. The plot shows MPI weak scaling (with SIMD within each core) from 1 to 32,768 processes, and timing breakdown of the different kernels, tree construction and communications. The test uses 10 miilion points per process, and results in 72% parallel efficiency on 32 thousand processes (Kraken system). GTC Express @ SC’11, November 2011 What’s new? An open source FMM Want more? Papers and software are online P2P M2L M2P 0 50 100 150 200 250 300 350 GFlops N=10 4 N=10 5 N=10 6 N=10 7 Left — Actual performance on the GPU of three core kernels, for four different values of N. The M2P kernel of the treecode is able to deliver more flop/s on GPU than the M2L kernel of the FMM. M2M multipole to multipole treecode & FMM M2L multipole to local FMM L2L local to local FMM L2P local to particle FMM P2P particle to particle treecode & FMM M2P multipole to particle treecode source particles target particles information moves from red to blue P2M particle to multipole treecode & FMM Results: Parallel, multi-GPU performance Besides the results shown above, strong scaling runs on CPUs for 10 million bodies per core have shown 93% parallel efficiency at 2048 cores without SIMD, and 54% parallel efficiency at 2048 cores with SIMD kernels. Strong scaling runs on GPUs for 100 million bodies show 65% parallel efficiency on 512 GPUs, while weak scaling runs on GPUs with 4 million bodies per GPU show 72% parallel efficiency on 2048 GPUs. The calculation of isotropic turbulence on a 2048 3 grid (as shown in the figure above) uses 4 million bodies per GPU on 2048 GPUs. The same calculation using FFTW on 2048 CPU cores of the same machine obtains 27% parallel efficinecy on 2048 cfores. The good scalability of FMMs becomes an advantage on massively parallel systems. All the codes developed in our group are free (like free beer) and open source. To download them, follow the links from our group website: http://barbagroup.bu.edu Also on the website are up-to-date bibliographic references, and papers for download. Please visit! pop stack split cell multipole acceptance criterion push to stack calculate interaction calculate interaction target tree source tree push root pair to stack Initial step Iterative step 1 8 64 512 4096 32768 0 10 20 30 40 N procs time [s] tree construction mpisendp2p mpisendm2l P2Pkernel P2Mkernel M2Mkernel M2Lkernel L2Lkernel L2Pkernel 1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 256 16 32 64 128 256 512 1024 2048 Operational intensity (flop/byte) Attainable flop/s (Gflop/s) no SFU, no FMA +SFU +FMA Fast N-body (particle-particle) Fast N-body (cell-cell) 3-D FFT Stencil SpMV single-precision peak The fast multipole method (FMM) is a numerical engine used in many applications, from acoustics, electrostatics, fluid simulations, wave scattering, and more. Despite its importance, there is a lack of open community code, which arguably has affected its wider adoption. It is also a difficult algorithm to understand and to program, making availability of open-source implementations even more desirable. Method: Multipole-based, hierarchical There are two common variants: treecodes and FMMs. Both use tree data structures to cluster ‘source particles’ into a hierarchy of cells. See Figure to the right. Multipole expansions are used to represent the effect of a cluster of particles, and are built for parent cells by an easy transformation (M2M). Local expansions are used to evaluate the effect of multipole expansions on many target points locally. The multipole-to-local (M2L) trans- formation takes most of the runtime. Treecodes do not use local expansions, and compute the effect of multipole expansions directly on target points. They thus scale as O(N log N), for N particles. The FMM has the ideal O(N) scaling. Right — an actual computation of a bioelectrostatics problem, where a protein-solvent interface is represented by a surface mesh. One molecule needs 100 thousand triangles on its surface. We modeled a system with more than 10 thousand such molecules—this results in one billion unknowns! We developed a novel treecode-FMM hybrid algorithm with auto-tuning capabilities, that is O(N) and chooses the most efficient type of interaction. It is highly parallel and GPU-capable. The algorithm uses a dual tree traversal for the target and source trees. See Figure below. In the initial step, both trees are formed and their root cells are paired and pushed to an empty stack. Features: Parallel, multi-GPU, with auto-tuning (did we say open source?) partition domain: Orthogonal Recursive Bisection communicate LET with MPI my rank rank 0 rank 2 rank 3 Local Essential Tree on other processes evaluate FMM kernels for Local Tree on GPU my rank rank 0 rank 2 rank 3 Local Tree evaluate FMM kernels for remaining Local Essential Tree update position, vortex strength, and core radius locally reinitialize particle position and core radius every few steps OpenMP parallel sections Further iterations consist of popping a pair of cells from the stack, splitting the larger one, and applying an accept- ance criterion to determine if multipole interaction is valid or not. If not, a new pair is pushed to the stack. The figure shows tree cases: the ones on th eleft and right satisfy the multipole acceptance criterion (MAC), while the one in the center does not and is pushed to the stack. Above — The flow of the vortex method caluclation using parallel FMM, where the domain is partitioned by an orthogonal recursive bisection and communication of the local essential tree (LET) is overlapped with the GPU calculation of the local tree. The remaning parts of the LET are calculated and updates are performed locally. Right This roofline model shows the high operational intesity of FMM kernels compared to SpMV, Stencil, and 3-D FFT kernels. The model is for an NVIDIA Tesla C2050 GPU, where the single-precision peak can only be achieved when special function units (SFU) and fused multiply add (FMA) operations are fully utilized.

Upload: trinhthuy

Post on 21-Apr-2018

258 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: ExaFMM: An open source library for Fast Multipole Methods ... · ExaFMM: An open source library for Fast Multipole Methods aimed towards Exascale systems Lorena A Barba1, Rio Yokota2

ExaFMM: An open source library for Fast Multipole Methodsaimed towards Exascale systems

Lorena A Barba1, Rio Yokota2

1 Boston University, 2 KAUST

Left — The simulation of homogeneous isotropic turbulence isone of the most challenging benchmarks for computationalfluid dynamics. Here, we show results of a 20483 system,computed with a fast multipole vortex method on 2048 GPUs.This simulation achieved 0.5 petaflop/s on Tsubame 2.0.

Below — Scaling on a massively parallel CPU systems is alsoexcellent. The plot shows MPI weak scaling (with SIMD withineach core) from 1 to 32,768 processes, and timing breakdownof the different kernels, tree construction and communications.The test uses 10 miilion points per process, and results in 72%parallel efficiency on 32 thousand processes (Kraken system).

GTC

Exp

ress

@ S

C’1

1, N

ovem

ber 2

011

What’s new? An open source FMM

Want more? Papers and software are online

P2P M2L M2P0

50

100

150

200

250

300

350

GFl

ops

N=104

N=105

N=106

N=107Left — Actual performance on the GPU ofthree core kernels, for four different valuesof N. The M2P kernel of the treecode is able to deliver more flop/s on GPU thanthe M2L kernel of the FMM.

M2Mmultipole to multipoletreecode & FMM

M2Lmultipole to localFMM

L2Llocal to localFMM

L2Plocal to particleFMM

P2Pparticle to particletreecode & FMM

M2Pmultipole to particletreecode

source particles target particles

information moves from red to blue

P2Mparticle to multipoletreecode & FMM

Results: Parallel, multi-GPU performanceBesides the results shown above, strong scaling runs onCPUs for 10 million bodies per core have shown 93%parallel efficiency at 2048 cores without SIMD, and 54%parallel efficiency at 2048 cores with SIMD kernels.

Strong scaling runs on GPUs for 100 million bodies show 65% parallel efficiency on 512 GPUs, while weak scalingruns on GPUs with 4 million bodies per GPU show 72%parallel efficiency on 2048 GPUs.

The calculation of isotropic turbulence on a 20483 grid(as shown in the figure above) uses 4 million bodies per GPU on 2048 GPUs.

The same calculation using FFTW on 2048 CPU cores ofthe same machine obtains 27% parallel efficinecy on 2048 cfores. The good scalability of FMMs becomes an advantageon massively parallel systems.

All the codes developed in our group are free (like free beer) and open source.To download them, follow the links from our group website:

http://barbagroup.bu.edu

Also on the website are up-to-date bibliographic references, and papers fordownload. Please visit!

popstack

splitcell

multipoleacceptancecriterion

pushtostack

calculateinteraction

calculateinteraction

targettree

sourcetree

push root pairto stack

Initial step Iterative step

1 8 64 512 4096 327680

10

20

30

40

Nprocs

time

[s]

tree constructionmpisendp2pmpisendm2lP2PkernelP2MkernelM2MkernelM2LkernelL2LkernelL2Pkernel

1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 256

16

32

64

128

256

512

1024

2048

Operational intensity (flop/byte)

Att

aina

ble

flop

/s (

Gflo

p/s

)

no SFU, no FMA

+SFU+FMA

Fast

N-b

ody

(par

ticle

-par

ticle

)

Fast

N-b

ody

(cel

l-cel

l)

3-D

FFT

Sten

cil

SpM

V

single-precision peak

The fast multipole method (FMM) is a numerical engineused in many applications, from acoustics, electrostatics,fluid simulations, wave scattering, and more.

Despite its importance, there is a lack of open communitycode, which arguably has affected its wider adoption. It isalso a difficult algorithm to understand and to program,making availability of open-source implementationseven more desirable.

Method: Multipole-based, hierarchicalThere are two common variants: treecodes and FMMs.Both use tree data structures to cluster ‘source particles’into a hierarchy of cells. See Figure to the right.

Multipole expansions are used to represent the effect ofa cluster of particles, and are built for parent cells by aneasy transformation (M2M). Local expansions are usedto evaluate the effect of multipole expansions on manytarget points locally. The multipole-to-local (M2L) trans-formation takes most of the runtime.

Treecodes do not use local expansions, and compute theeffect of multipole expansions directly on target points.They thus scale as O(N log N), for N particles. The FMMhas the ideal O(N) scaling.

Right — an actual computation of a bioelectrostatics problem, where a protein-solvent interface isrepresented by a surface mesh. One moleculeneeds 100 thousand triangles on its surface.

We modeled a system with more than10 thousand such molecules—this results in one billion unknowns!

We developed a novel treecode-FMM hybrid algorithmwith auto-tuning capabilities, that is O(N) and choosesthe most efficient type of interaction. It is highly paralleland GPU-capable.

The algorithm uses a dual tree traversal for the target and source trees. See Figure below. In the initial step,both trees are formed and their root cells are paired andpushed to an empty stack.

Features: Parallel, multi-GPU, with auto-tuning (did we say open source?)

partition domain: Orthogonal Recursive Bisection

communicate LET with MPI my rankrank 0 rank 2 rank 3

Local Essential Tree on other processes

evaluate FMM kernels for Local Tree on GPU

my rankrank 0 rank 2 rank 3

Local Treeevaluate FMM kernels for remaining Local Essential Tree

update position, vortex strength, and core radius locally

reinitialize particle position and core radius every few steps

OpenMP parallel sections

Further iterations consist of popping a pair of cells fromthe stack, splitting the larger one, and applying an accept-ance criterion to determine if multipole interaction isvalid or not. If not, a new pair is pushed to the stack.

The figure shows tree cases: the ones on th eleft and rightsatisfy the multipole acceptance criterion (MAC), whilethe one in the center does not and is pushed to the stack.

Above — The flow of the vortex method caluclation using parallel FMM, where the domain is partitioned byan orthogonal recursive bisection and communication of the local essential tree (LET) is overlapped with the GPUcalculation of the local tree. The remaning parts of the LET are calculated and updates are performed locally.

Right — This roofline model showsthe high operational intesity of FMMkernels compared to SpMV, Stencil,and 3-D FFT kernels. The model is foran NVIDIA Tesla C2050 GPU, wherethe single-precision peak can only beachieved when special function units(SFU) and fused multiply add (FMA)operations are fully utilized.