robust memory-aware mappings for parallel multifrontal...

57
Robust memory-aware mappings for parallel multifrontal factorizations Joint work with P. R. Amestoy, E. Agullo, A. Buttari, A. Guermouche, J.-Y. L’Excellent François-Henry Rouet, Lawrence Berkeley National Laboratory MUMPS Users Group Meeting, May 29-30, 2013

Upload: tranmien

Post on 22-Oct-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Robust memory-aware mappings forparallel multifrontal factorizationsJoint work with P. R. Amestoy, E. Agullo, A. Buttari, A. Guermouche, J.-Y. L’Excellent

François-Henry Rouet, Lawrence Berkeley National Laboratory

MUMPS Users Group Meeting, May 29-30, 2013

Page 2: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Introduction

Motivation

• Memory is often a bottleneck for direct solvers.• Physical memory per core tends to decrease.• Estimating memory consumption in a dynamic scheduling contextis difficult (cf. the infamous error code “-9” in MUMPS).

Context

• We want to design mapping algorithms that enforce some memoryconstraints and provide better memory estimates.

• We focus on multifrontal methods (e.g., MUMPS).

2/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 3: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

FactorsActivefrontalmatrix

Stack ofcontributionblocks

Active storage

5

54

145

34

23

Elimination tree

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 4: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 5: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 6: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 7: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 8: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 9: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 10: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 11: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 12: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 13: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 14: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 15: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

1 2 3 4 512345

1 2 3 4 512345

A= L+U-I=

Storage is divided into two parts:• Factors• Active memory

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 16: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The multifrontal method [Duff & Reid ’83]

• Factors are incompressible and usuallyscale fairly; they can optionally bewritten on disk.

• In sequential, the traversal thatminimizes active memory is known[Liu’86].

• In parallel, active memory becomesdominant.Example: share of active storage onthe AUDI matrix using MUMPS 4.10.01 processor: 11%256 processors: 59%

5

54

145

34

23

Active memory

3/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 17: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

The problem

Metric: memory efficiency

e(p) = Sseqp × Smax (p)

We would like e(p) ' 1, i.e. Sseq/p on each processor.

Common mappings/schedulings provide a poor memory efficiency:• Proportional mapping: lim

p→∞e(p) = 0 on regular problems.

• MUMPS relies primarily on a relaxed proportional mapping; typicalefficiency: between 0.10 and 0.40.Memory estimates are unreliable.

4/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 18: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Proportional mapping

Proportional mapping: top-down traversal of the tree, where processors as-signed to a node are distributed among its children proportionally to the weightof their respective subtrees.

5/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 19: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Proportional mapping

Proportional mapping: top-down traversal of the tree, where processors as-signed to a node are distributed among its children proportionally to the weightof their respective subtrees.

64

5/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 20: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Proportional mapping

Proportional mapping: top-down traversal of the tree, where processors as-signed to a node are distributed among its children proportionally to the weightof their respective subtrees.

64

3262640% 10% 50%

5/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 21: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Proportional mapping

Proportional mapping: top-down traversal of the tree, where processors as-signed to a node are distributed among its children proportionally to the weightof their respective subtrees.

64

32626

26

13 13

2 2 221 11

11 10

• Targets run time but poor memory efficiency.• Usually a relaxed version is used: more memory-friendly but unreliable

estimates.5/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 22: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Proportional mapping

Proportional mapping: assuming that the sequential peak is 5 GB,

Smax (p) > max{4GB26 ,

1GB6 ,

5GB32

}= 0.16 GB⇒ e(p) 6 5

64× 0.16 6 0.5

64

326264GB 1GB 5GB

5GB

6/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 23: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

A more memory-friendly strategy. . .

All-to-all mapping: postorder traversal of the tree, where all the processorswork at every node:

64

646464

64

64 64

64 64

6464

646464

7/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 24: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

A more memory-friendly strategy. . .

All-to-all mapping: postorder traversal of the tree, where all the processorswork at every node:

64

646464

64

64 64

64 64

6464

646464

Optimal memory scalability (Smax (p) = Sseq/p) but no tree parallelism andprohibitive amounts of communications.

7/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 25: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

A class of “memory-aware” mappings

“Memory-aware” mapping [Agullo ’08]: aims at enforcing a given memory con-straint (M0, maximum memory per processor):

64

326264GB 1GB 5GB

5GB

1. Try to apply proportional mapping.

2. Enough memory for each subtree?

If not, serialize them, and update M0:processors stack equal shares of the CBs from previous nodes.

8/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 26: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

A class of “memory-aware” mappings

“Memory-aware” mapping [Agullo ’08]: aims at enforcing a given memory con-straint (M0, maximum memory per processor):

64

326264GB 1GB 5GB

5GB

!

1. Try to apply proportional mapping.2. Enough memory for each subtree?

If not, serialize them, and update M0:processors stack equal shares of the CBs from previous nodes.

8/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 27: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

A class of “memory-aware” mappings

“Memory-aware” mapping [Agullo ’08]: aims at enforcing a given memory con-straint (M0, maximum memory per processor):

64

64 6464

1. Try to apply proportional mapping.2. Enough memory for each subtree? If not, serialize them, and update M0:

processors stack equal shares of the CBs from previous nodes.8/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 28: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

A class of “memory-aware” mappings

“Memory-aware” mapping [Agullo ’08]: aims at enforcing a given memory con-straint (M0, maximum memory per processor):

64

64 6464

64 6464

32 32

22 21 21

32 32

• Ensures the given memory constraint and provides reliable estimates.• Tends to assign many processors on nodes at the top of the tree

=⇒ performance issues on parallel nodes.8/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 29: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

A class of “memory-aware” mappings

A finer “memory-aware” mapping? Serializing all the children at once is veryconstraining: more tree parallelism can be found.

64

51 641364

Find groups on which proportional mapping works, and serialize these groups.Heuristic: follow a given order (e.g. the serial postorder) and form groups aslarge as possible.

9/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 30: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Memory-aware with groups

The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:

(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.

while not all children collected doi= current nodeif i cannot be alone then

Reset previous children to anall-to-all mapping (this shouldimprove balance)

end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones

end while

pf

p2

p1

10/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 31: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Memory-aware with groups

The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:

(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.

while not all children collected doi= current nodeif i cannot be alone then

Reset previous children to anall-to-all mapping (this shouldimprove balance)

end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones

end while

pf

p2

p1

10/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 32: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Memory-aware with groups

The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:

(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.

while not all children collected doi= current nodeif i cannot be alone then

Reset previous children to anall-to-all mapping (this shouldimprove balance)

end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones

end while

pf

pf

pf

10/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 33: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Memory-aware with groups

The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:

(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.

while not all children collected doi= current nodeif i cannot be alone then

Reset previous children to anall-to-all mapping (this shouldimprove balance)

end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones

end while

pf

pf

pf

10/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 34: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Memory-aware with groups

The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:

(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.

while not all children collected doi= current nodeif i cannot be alone then

Reset previous children to anall-to-all mapping (this shouldimprove balance)

end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones

end while

pf

pf

pf

p3

p4

p5

10/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 35: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Experiments

• Matrix: finite-difference model of acoustic wave propagation,27-point stencil, 192× 192× 192 grid; Seiscope consortium.N = 7 M, nnz = 189 M, factors=152 GB (METIS).Sequential peak of active memory: 46 GB.

• Machine: 64 nodes with two quad-core Xeon X5560 per node.We use 256 MPI processes.

• Perfect memory scalability: 46 GB/256=180MB.

11/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 36: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Experiments

Prop Memory-aware mappingMap M0 = 225 MB M0 = 380 MB

w/o groups w/ groupsMax stack peak (MB) 1932 227 330 381Avg stack peak (MB) 626 122 177 228Time (s) 1323 2690 2079 1779

The root node has 3 children that cannot bemapped using a proportional mapping:

46 GB+ 32 GB+ 30 GB > 380 MB× 256

• The “basic” memory-aware mapping serializesthe three children.

• The variant with groups puts two childrentogether and one alone.

46GB

45GB 32GB 30GB

12/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 37: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Experiments

We assess how the different strategies map nodes, compared to theproportional mapping.

Dept

h

0

2

1

34

00

0.5

1.5

2

2.5

3

3.5

4

2 4 6 8 10 12 14

Average #procs/#procs PM per node, for each layer

Depth

Ratio

#pr

ocs/

#pr

ocs P

M

MA, 225MBMA, 380MB, w/o groupsMA, 380MB, w/ groups

1

• Relaxing M0 enhances tree parallelism.• The variant with groups enhances tree parallelism.

13/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 38: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Performance issues

The new mapping tends to assign many processes to nodes at the topof the tree. One of the communication patterns became a bottleneck.

M

S1S2S3

(unsymmetric case)

Non-blocking broadcast of a block of factorsfrom the master process.A kind of ibcast.

Baseline implementation: a loop of non-blocking sends from themaster to the processors. Communication speed is constant.Hyperion (CICT): 1.7 GB/s; Hopper (NERSC): 4.5 GB/s.Tree-based implementation: communication speed is almostproportional to the number of processors.Hyperion, 128 cores: 28 GB/s; Hopper, 768 cores: 122 GB/s.

14/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 39: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Performance issues

The new mapping tends to assign many processes to nodes at the topof the tree. One of the communication patterns became a bottleneck.

M

S1S2S3

(unsymmetric case)

M

S1

Sp

... p

Baseline implementation: a loop of non-blocking sends from themaster to the processors. Communication speed is constant.Hyperion (CICT): 1.7 GB/s; Hopper (NERSC): 4.5 GB/s.

Tree-based implementation: communication speed is almostproportional to the number of processors.Hyperion, 128 cores: 28 GB/s; Hopper, 768 cores: 122 GB/s.

14/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 40: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Performance issues

The new mapping tends to assign many processes to nodes at the topof the tree. One of the communication patterns became a bottleneck.

M

S1S2S3

(unsymmetric case)

M

S...

S...... w

S...

S...

S...

S...

... w... w

Baseline implementation: a loop of non-blocking sends from themaster to the processors. Communication speed is constant.Hyperion (CICT): 1.7 GB/s; Hopper (NERSC): 4.5 GB/s.Tree-based implementation: communication speed is almostproportional to the number of processors.Hyperion, 128 cores: 28 GB/s; Hopper, 768 cores: 122 GB/s.

14/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 41: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 42: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

BLAS

Task 1

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 43: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

BLASSEND

Task 1

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 44: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

BLASSEND

Task 1

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 45: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

BLAS

BLASSEND

Task 1 Task 2

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 46: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

BLASSEND

BLASSEND

Task 1 Task 2

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 47: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

BLAS

BLASSEND

BLASSEND

Task 1 Task 2 Task 3

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 48: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

FREE

BLASSEND

BLASSEND

BLASSEND

Task 1 Task 2 Task 3

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 49: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

FREE

BLASSEND

FREE

BLASSEND

BLASSEND

Task 1 Task 2

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 50: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

FREE

BLASSEND

FREE

BLASSEND

FREE

BLASSEND

Task 1

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 51: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.

Main loop:while . . . do

Probe for messageif message received then

Read tag, execute associated task(e.g., Facto panel)

end ifend while

Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then

Call Main loopend ifFree stuff

FREE

BLASSEND

FREE

BLASSEND

FREE

BLASSEND

15/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 52: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – implementation

Difficulties:• Messages representing different blocks of factors can overtake.• Deadlocks. . .

Example of messages overtaking (broadcast tree is a chain):

P0

P1

P2

P0P1P2

recursive call

tta tb tc

P2 receives the second panel before the first one!

16/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 53: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Non-blocking broadcast – experiments

Experiments on a single node of size 64000 with 64 processors:

Strategy Time (s)1-way threaded BLAS 4-way threaded BLAS

Baseline 596 468Tree-based 401 167

Better overlapping between communications and computations⇒ more potential for exploiting multithreaded BLAS / faster cores.

Quite critical in a memory-aware context since the algorithm tends toincrease the number of processes per subtree.

17/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 54: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Memory-aware mapping – more results

Back to experiments with the memory-aware mapping; matrices:

Matrix name Order Entries Factors Sseq Description; origin(millions) (millions) (GB) (GB)

cage13 0.4 7.5 30.7 17.9 Directed weighted graph; Utrecht U.as-Skitter 1.7 23.9 17.7 6.2 Internet topology graph; SNAPHV15R 2.0 283.1 366.4 87.8 CFD, 3D engine fan; FLUOREMMORANSYS1 2.7 81.3 63.5 19.1 Model Order Reduction; CADFEMmeca_raff6 3.3 130.2 63.5 8.8 Thermo-mechanical coupling; EDF

18/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 55: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Memory-aware mapping – more results

Memory-aware mapping vs. default strategy (MUMPS 4.10.0):

Matrix Mapping Smax (MB) Savg(MB) Time(s)

cage13 Default 1069 727 285MA, M0 = 380MB 305 302 498

as-Skitter Default 1484 584 578MA, M0 = 130MB 135 70 285

HV15R Default 9063∗ 8773∗ N/AMA, M0 = 2600MB 2225 1803 7169

MORANSYS1 Default 1246 834 373MA, M0 = 380MB 379 305 651

meca_raff6 Default 1078 796 324MA, M0 = 320MB 329 209 367

19/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 56: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

Conclusion – on-going work

Conclusion• The memory-aware mapping ensures a given constraint. . .• . . . and its variant with groups tries to add more parallelism.• Increasing the number of processes per node led to performanceproblems =⇒ new (dense!) algorithms.

Future work• Memory-aware mapping with groups: assess some heuristics.• We have performance models that provide the optimal number ofprocessors to use on a node, and we are working on how to injectthese models into the mapping.

• More dynamic scheduling.• Compressing CBs with low-rank approximation techniques (cf workby Clément Weisbecker and the MUMPS team).

20/21 MUMPS Users Group Meeting, May 29-30, 2013

Page 57: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations

End

Thank you for your attention!

Any questions?

Acknowledgements

• Matrices: Stéphane Operto (SEISCOPE), Olivier Boiteau (EDF).• Computational resources: Nicolas Renon, CICT, Toulouse.

21/21 MUMPS Users Group Meeting, May 29-30, 2013