domain decomposition methods for the finite element...

Introduction Domain decomposition Numerical Experiments Conclusions

Domain decomposition methods for theFinite Element Approximation of

partial differential equationsECAR Workshop 2012

Santiago Badia1,2, Alberto F. Martín2 and Javier Principe1,2

1 Universitat Politècnica de Catalunya2 International Center for Numerical Methods in Engineering (CIMNE)

Buenos Aires, July 2012


Outline

Introduction

Domain decomposition methods

Numerical experimentsCoarse problem solution strategiesWeak scalability for 3d problems

Conclusions


Introduction

Finite element method (FEM)• It is a method for the numerical solution of partial differential equations.• Widely used in engineering analysis (users).• 1K-3K people conference each year (researchers).• Permits to deal with arbitrary geometries.• Sound mathematical theory.• “Easy” implementation (local structure).

Domain decomposition method (DDM)• It is a divide and conquer method to solve concurrently.• A small community, 200-300 people conference each year (www.ddm.org)• Mainly mathematical oriented, some HPC implementations but not many:

• TRILINOS (AztecOO,ML)• PETSC (BNN?,BDDC?)


Computational Methods in Fusion Technology (COMFUS)EU Starting Grant awarded to S. Badia (5-year project, since 2011)

ITER (Experimental Fusion Reactor)

Magnetically confined plasma

• Very high temperature (150,000,000 oC)

• Fusion reaction ⇒ Heat + n

Test Blanket Module (TBM)

• Absorb heat and extract it

• Absorb n + Li → Tr (self-sustainment)


Blanket Modules


Overall strategy

• Stabilized FEM solvers• Based on multiscale concepts (model the subscales, the part of the unknown not

captured by the grid).• Permits to treat different problems with the same discretization (multiphysics).• Optimal a priori error estimates.

• Fully implicit schemes• Unconditionally stability regardless of time step sizes (multiple time scales).• Requires the solution of a (very large) linear system per time step.

• Block iterative preconditioners• Built from positive definite operators (Laplace, CDR).• Algorithmically scalable (independent of the discretization).

• Domain decomposition methods for positive definite problems• Hybrid (direct-iterative) robust methods• Permit to obtain weakly scalable algorithms


Problem statementGiven a bounded domain Ω and a FE partition T , we build aconforming (nodal) finite element (FE) space W ⊂ H1

0(Ω).

• Strong problem: find u ∈ WLu = f

where (as a model problem) L = −∇2

• Variational problem: find u ∈ W such that

a(u, v) = (f , v ), for any v ∈ W,

where f ∈ V ′ and (as a model problem) a(u, v) =∫

Ω∇ u · ∇ v dΩ.

• FEM approximation: Expand the unknown u and test function v in terms ofbasis functions Na(x)

u (x) =

n∑j=1

N j(x)Uj, v (x) =

n∑i=1

N i(x)V i


Problem statementGiven a bounded domain Ω and a FE partition T , we build aconforming (nodal) finite element (FE) space W ⊂ H1

0(Ω).

• Strong problem: find u ∈ WLu = f

where (as a model problem) L = −∇2

• Variational problem: find u ∈ W such that

a(u, v) = (f , v ), for any v ∈ W,

where f ∈ V ′ and (as a model problem) a(u, v) =∫

Ω∇ u · ∇ v dΩ.

• Algebraic problem: Find u ∈ RN such that

Au = b

where Ai,j = a(N i,N j) is a symmetric and positive matrix whose sparsitydepends on the mesh and bi = (N i, f ).


Contents of the talk

• Describe preconditioners of balancing type which permit to obtain weaklyscalable algorithms

• Balancing Neumann-Neumann (BNN)• Balancing DD by Constraints (BDDC)

• Mention our rehabilitation of the BNN method.• Easy/cheap treatment of singular Neumann problems,• Spare one Dirichlet solver per iteration (like additive).

• Describe our hybrid implementation of domain decomposition methods (butpure MPI results).

• Present weak scalability results up to 4K cores for structured meshes.• Present a comparison of two strategies for the solution of the coarse problem.


Outline

Introduction



Conclusions


Domain partition

The global problem on (Th,Ω)

h


Domain partition

is partitioned into P local problems on (T ih ,Ωi)

H

h


Domain partition

Local (internal) interfaces Γi = ∂Ωi \ ∂Ω


Domain partition

generate a global (internal) interface Γ =⋃nsbd

i=1 Γi.


Domain partition

Now, we define local FE spaces Vi related to Γi and a global FE space V related to Γ.


Blanket Modules

• Represent the mesh by a graph• Use a graph partition tool (e.g. METIS)• Generate interface matching information• Done as a preprocessing step


Interface (Schur complement) problem• The partition induces a structure

Au =

[AII AIΓ

AΓI AΓΓ

] [uI

uΓ

]=

[bI

bΓ

]= b

whereAII = diag

(A(1)

II ,A(2)II , ...A

(P)II

)• After static condensation of bubble functions uI

SuΓ = g

where the Schur complement S ∈ Rni×ni and g ∈ Rni read

S = AΓΓ − AΓIA−1II AIΓ, g = fΓ − A−1

II bI

• Local solvers based on external cutting-edge multi-threaded sparse directlibraries (e.g., PARDISO).


Solution methodsThe conjugate gradient method requires O (

√κ) iterations and can be applied to

• the whole problem (Au = b)

κ (A) ≤ Ch−2 = CN2/d

• interface problem (Sx = g)

κ (S) ≤ CH−1h−1 = CP1/dN1/d

• the interface problem problem (Sy = g) preconditioned using inverses of localmatrices Si (Neumann-Neumann)

κ(B−1

NNS)≤ CH−2

[1 + log2

(Hh

)]= CP2/d

[1 + d−1log2

(NP

)]• the interface problem problem (Sy = g) preconditioned by balancing methods

κ(B−1

BDDS)≤ C

[1 + log2

(Hh

)]= C

[1 + d−1log2

(NP

)]


Balancing Neumann-Neumann (BNN)

• Introduce a global (coarse) approximation (balancing) B−1C and define the

multiplicative preconditioner

B−1BNN = B−1

C + (I − B−1C S)B−1

NN

• The coarse space is H0 = spanφi : i = 1, . . . , nsbd where φi = Ii1Γi .


BNN rehabilitation

Currently• Drawback 1: dealing with singular matrices Si (pseudo-inverses)

• Drawback 2: requires 2 Dirichlet solvers per iteration (with respect to one in theadditive BDDC)

• Deprecated; overperformed by BDDC...

but we propose a rehabilitation• Definite matrices are obtained fixing appropriately chosen degrees of freedom.• Reusing preconditioner computations in the Schur complement multiplication

we can spare one Dirichlet solver.• Implies also a reduction of nearest neighbor communications (+ scalable)

and there are advantages• Smaller coarse problem in 3d (not in 2d)• Easier implementation


Balancing DD by Constraints (BDDC)

• A discontinuous coarse space is proposed (not Galerkin).• Basis functions are defined locally as the solution of a problem which ensures

continuity of values on corners and mean values on edges (faces)• These constrains led to positive definite local problems (IF properly applied) so

external libraries can be used.• It is defined as an additive correction (1DS + 1NS + 1CS)


PCG algorithm

BNN_PCG (Input: (S,BBNN, g, x0), Output: x)z0 := B−1

BNN(I − SB−1C )r0

p0 := z0for j = 0, 1, . . . , till convergence do

αj := (rj, zj)/(Spj, pj) (GR + LDS*)xj+1 := xj + αjpj

rj+1 := rj − αjSpj

zj+1 := B−1NNrj+1 (LC+LNS+LC)

s := rj+1 − Szj+1 (LDS)zj+1 := zj+1 + B−1

C s (GC+GCS)βj := (rj+1, zj+1)/(rj, zj) (GR)pj+1 := zj+1 + βjpj

end for

• Local operations (LC: communication, LNS: Neumann solver, LDS: Dirichlet solver)

• Global operations (GR: reduction, GC: communication, GCS: coarse solver)


Coarse problem solution strategies

• Serial gather (SG) MPI Rank 0 is responsible for the the serial computation ofthe coarse- grid correction

• All gather (AG) All MPI Ranks are responsible for the serial computation of thecoarse-grid correction.

N Solver

N Solver

N Solver

N Solver

N Solver

Nearest Neighbor Comm.

Updates

Updates

Updates

Updates

Updates


D Solver

D Solver

D Solver

D Solver

D Solver

GlobalReduction

GlobalGather

GlobalScatter

C Solver



• Serial gather (SG) MPI Rank 0 is responsible for the the serial computation ofthe coarse- grid correction

• All gather (AG) All MPI Ranks are responsible for the serial computation of thecoarse-grid correction.

N Solver

N Solver

N Solver

N Solver

N Solver


Updates

Updates

Updates

Updates

Updates


D Solver

D Solver

D Solver

D Solver

D Solver

GlobalReduction

GlobalAllGather

C Solver

C Solver

C Solver

C Solver

C Solver


Outline

Introduction



Conclusions


Experimental framework (Software + Platform)

FEMPAR Finite Element Multiphyiscs PARallel software (in-house):

• MPI implementation of sub-structuring DDMs• Relies on highly-efficient vendor implementations of the BLAS (Intel MKL,

IBM ESSL, etc.)• Provides interfaces to external multi-threaded sparse direct solvers (PARDISO,

WSMP, etc.)• Although codes are hybrid MPI/OpenMP, focus is on pure MPI model with

one-to-one mapping among subdomains/MPI tasks/physical cores.

running on• Marenostrum@BSC (2560 JS21 blades, 10240 cores)



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1664 144 256 400 576 784 1024

TO

TA

L W

all

clo

ck tim

e (

secs.)

#cores

BNN SGBNN AG


Serial gather (SG) strategyUsing PARAVER tool developed at BSC

P = 16



P = 64



P = 144



P = 256



P = 400



P = 576



P = 784



P = 1024


Global collective communicationsUsing IBM XL Compiler and MPICH 1.2.7 with variable and fixed size collectives

Scatter wall clock time [µ s]

0

500

1000

1500

2000

2500

3000

3500

4000

16 64 144 256 400 576 784 1024

cores

Scatterv wall clock time [µ s]

0

5000

10000

15000

20000

25000

30000

35000

40000

16 64 144 256 400 576 784 1024

coresMS=16MS=64

MS=128MS=256MS=512



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

16 256 576 1024 1296 1600 1936 2304 2704 3136 3600 4096

TO

TA

L W

all

clo

ck tim

e (

secs.)

#cores

BNN SGBNN AG


Weak scalability for 3D problems

• Target problem: −∆u = f on a rectangular prism Ω = [0, 2]× [0, 2]× [0, 1]

• Uniform global mesh of hexahedral Q1 finite elements• Uniform domain partition into rectangular prism grids of 2m× 2m× m

hexahedral local meshes• We use 4 cores/blade and m = 2, 3, 4 . . . , 10• Gradually larger local problem sizes H

h = 10, 20, 30, 40• Weak scalability: at which rate a given magnitude evolves while increasing the

number of cores while keeping Hh constant ?

• Focus on the total computation time and number of PCG iterations for theinterface problem


BNN vs. BDDC (PCG iterations)

0

2

4

6

8

10

32 256 500 864 1372 2048 2916 4000

Nu

mb

er

of

PC

G ite

ratio

ns

#cores

BNN

BDDC.CE

BDDC.CEF

H/h = 10

0

2

4

6

8

10

12

32 256 500 864 1372 2048 2916 4000

Nu

mb

er

of

PC

G ite

ratio

ns

#cores

BNN

BDDC.CE

BDDC.CEF

H/h = 20

• BNN ∼ BDDC(ce)• BDDC(cef) small reduction from BDDC(ce)


BNN vs. BDDC (PCG iterations)

0

2

4

6

8

10

12

14

32 256 500 864 1372 2048 2916 4000

Nu

mb

er

of

PC

G ite

ratio

ns

#cores

BNN

BDDC.CE

BDDC.CEF

H/h = 30

0

2

4

6

8

10

12

14

32 256 500 864 1372 2048 2916 4000

Nu

mb

er

of

PC

G ite

ratio

ns

#cores

BNN

BDDC.CE

BDDC.CEF

H/h = 40

• BNN ∼ BDDC(ce)• BDDC(cef) small reduction from BDDC(ce)


BNN vs. BDDC (total computation time)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

32 256 500 864 1372 2048 2916 4000

TO

TA

L W

all

clo

ck t

ime

(se

cs.)

#cores

BNNBDDC.CE

BDDC.CEF

H/h = 10

0

1

2

3

4

5

6

7

32 256 500 864 1372 2048 2916 4000

TO

TA

L W

all

clo

ck t

ime

(se

cs.)

#cores

BNNBDDC.CE

BDDC.CEF

H/h = 20

• Enhanced BNN outperforms BDDC(ce) as p ↑ or Hh ↓ (dominant coarse solver)

• Almost identical as Hh ↑ (BUT enhancement basic)

• BDDC(cef) not competitive


BNN vs. BDDC (total computation time)

0

2

4

6

8

10

12

14

16

18

32 256 500 864 1372 2048 2916 4000

TO

TA

L W

all

clo

ck t

ime

(se

cs.)

#cores

BNNBDDC.CE

BDDC.CEF

H/h = 30

0

5

10

15

20

25

30

35

40

45

50

32 256 500 864 1372 2048 2916 4000

TO

TA

L W

all

clo

ck t

ime

(se

cs.)

#cores

BNNBDDC.CE

BDDC.CEF

H/h = 40

• Enhanced BNN outperforms BDDC(ce) as p ↑ or Hh ↓ (dominant coarse solver)

• Almost identical as Hh ↑ (BUT enhancement basic)

• BDDC(cef) not competitive


Outline

Introduction



Conclusions


Conclusions

• Enhanced BNN:• Sparse direct solvers for definite matrices (PARDISO).• Spare of a Dirichlet solver per iteration (= Additive BDDC).

• Hybrid implementation of BDD methods• Weakly scalable in terms of iteration count. (in accordance with the 1 + log2(H

h )estimate of the condition number).

• Weakly scalable in terms cpu time when p ↓ or Hh ↑ (dominant fine solver).

• The coarse problem is solved faster using the serial gather strategy.

• BDD comparison:• 2d: BNN and BDDC-(c) similar, BNN and BDDC-(ce) depends on H

h (BNN doesnot outperform BDDC(*))

• 3d: BNN very competitive in 3d (superior as p ↑ or Hh ↓)


Current and future work

• Porting our code to other platforms (CURIE, JUROPA/HPC-FF).• Extension to the (monolithic) elasticity problem.• Comprehensive hybrid tests.• Comprehensive unstructured tests.• Other strategies for the treatment of the coarse problem (multilevel, additive).

S. Badia, A. F. Martín and J. PrincipeEnhanced balancing Neumann-Neumann preconditioning in computational fluidand solid mechanics, in preparation.

S. Badia, A. F. Martín and J. PrincipeImplementation and weak scalability study of domain decomposition methods ofbalancing type, in preparation.


Acknowledgements:• European Research Council (ERC) (funding)• Spanish supercomputing network (RES) (computer resources, technical

expertise and assistance)

Thank you!

javier principehttp://principe.rmee.upc.edu/http://www.cimne.com/comfus/

domain decomposition methods for the finite element...

Documents