domain decomposition in parallel computing ashok srinivasan asriniva florida state university cot...

Domain decomposition in parallel computing

Ashok Srinivasan

www.cs.fsu.edu/~asriniva

Florida State University

COT 5410 – Spring 2004

Outline

• Background

• Geometric partitioning

• Graph partitioning– Static– Dynamic

• Important points

Background• Tasks in a parallel computation need access to

certain data• Same datum may be needed by multiple tasks

– Example: In matrix-vector multiplication, b2 is needed for the computation of all ci2, 1 < i < n

– If a process does not “own” a datum needed by its task, then it has to get it from a process that has it

• This communication is expensive

– Aims of domain decomposition• Distribute the data in such a manner that the communication

required is minimized

• Ensure that the computational loads on processes are balanced

Domain decomposition example

• Finite difference computation– New value of a node depends on old values of its

neighbors

• We want to divide the nodes amongst the processes so that – Communication is minimized

• Measure of partition quality

– Computational load is evenly balanced

Geometric partitioning

• Partition a set of points– Uses only coordinate information

• Balances the load– The heuristic tries to ensure that communication costs

are low

• Algorithms are typically fast, but partition not of high quality

• Examples– Orthogonal recursive bisection– Inertial– Space filling curves

Orthogonal recursive bisection

• Recursively bisect orthogonal to the longest dimension– Assume communication is proportional to the surface area of

the domain, and aligned with coordinate axes– Recursive bisection

• Divide into two pieces, keeping load balanced• Apply recursively, until desired number of partitions obtained

Inertial

• ORB may not be effective if cuts along the x, y, or z directions are not good ones

• Inertial– Recursively bisect

orthogonal to the inertial axis

Space filling curves

• Space filling curves– A continuous curve that fills the space– Order the points based on their relative

position on the curve– Choose a curve that preserves proximity

• Points that are close in space should be close in the ordering too

• Example– Hilbert curve

Hilbert curve

• Sources– http://www.dcs.napier.ac.uk/~andrew/hilbert.html– http://www.fractalus.com/kerry/tutorials/hilbert/hilbert-tutorial.html

H1

H2

Hi

Hi+1 Hilbert curve = lim Hn

n

http://www.dcs.napier.ac.uk/~andrew/hilbert.html

Domain decomposition with a space filling curve

• Order points based on their position on the curve

• Divide into P parts– P is the number of processes

• Space filling curves can be used in adaptive computations too

• They can be extended to higher dimensions too

Graph partitioning

• Model as graph partitioning– Graph G = (V, E)– Each task is represented by a vertex

• A weight can be used to represent the computational effort

– An edge exists between tasks if one needs data owned by the other

• Weights can be associated with edges too

– Goal• Partition vertices into P parts such that each partition has equal

vertex weights• Minimize the weights of edges cut• Problem is NP hard

– Edge cut metric• Judge the quality of the partitioning by the number of edges cut

Static graph partitioning

• Combinatorial– Levelized nested dissection – Kernighan-Lin/Feduccia-Matheyses

• Spectral partitioning

• Multi-level methods

Combinatorial partitioning

• Use only connectivity information

• Examples– Levelized nested dissection – Kernighan-Lin/Feduccia-Matheyses

Levelized nested dissection (LND)

• Idea is similar to the geometric methods– But cannot use coordinate information– Instead of projecting vertices along the longest

axis, order them based on distance from a vertex that may be one extreme of the longest dimension of a graph

• Pseudo-peripheral vertex– Perform a breadth-first search, starting from an arbitrary

vertex

– The vertex that is encountered last might be a good approximation to a peripheral vertex

LND example Finding a pseudoperipheral vertex

Initial vertex

1

1

1

2

2

2

33

3

34

Pseudoperipheral vertex

LND example – Partitioning

5

3

5

6

4

2

53

2

1

4

Initial vertexPartition

Recursively bisect the subgraphs

Kernighan-Lin/Fiduccia-Matheyses

• Refines an existing partition• Kernighan-Lin

– Consider pairs of vertices from different partitions– Choose a pair whose swapping will result in the best improvement

in partition quality• The best improvement may actually be a worsening

– Perform several passes• Choose best partition among those encountered

• Fiduccia-Matheyses– Similar but more efficient

• Boundary Kernighan-Lin– Consider only boundary vertices to swap

• ... and many other variants

Kernighan-Lin example

Better partition

Edge cut = 3

Existing partition

Edge cut = 4

Swap these

Spectral method

• Based on the observation that a Fiedler vector of a graph contains connectivity information

• Laplacian of a graph: L– lii = di (degree of vertex i)– lij = -1 if edge {i,j} exists, otherwise 0

• Smallest eigenvalue of L is 0 with eigenvector all 1• All other eigenvalues are positive for a connected graph

• Fiedler vector– Eigenvector corresponding to the second smallest

eigenvalue

Fiedler vector

• Consider a partitioning of V into A and B– Let yi = 1 if vi A, and yi = -1 if vi B– For load balance, i yi = 0

– Also eij E (yi-yj)2 = 4 x number of edges

across partitions

– Also, yTLy = i di yi2 – 2 eij E

yiyj

= eij E (yi-yj)2

Optimization problem

• The optimal partition is obtain by solving– Minimize yTLy– Constraints:

• yi {-1,1}• i yi = 0

– This is NP hard

• Relaxed problem– Minimize yTLy– Constraints:

• i yi = 0• Add a constraint on a norm of y, example, ||y||2 = n0.5

– Note• (1, 1, ..., 1)T is an eigenvector with eigenvalue 0• For a connected graph, all other eigenvalues are positive and orthogonal

to this eigenvector, which implies i yi = 0• The objective function is minimized by a Fiedler vector

Spectral algorithm

• Find a Fiedler vector of the Laplacian of the graph– Note that the Fiedler value (the second smallest eigenvalue)

yields a lower bound on the communication cost, when the load is balanced

• From the Fiedler vector, bisect the graph– Let all vertices with components in the Fiedler vector greater

than the median be in one component, and the rest in the other

• Recursively apply this to each partition• Note: Finding the Fiedler vector of a large graph can

be time consuming

Multilevel methods

• Idea– It takes time to partition a large graph– So partition a small graph instead!

• Three phases– Graph coarsening

• Combine vertices to create a smaller graph– Example: Find a suitable matching

• Apply this recursively until a suitably small graph is obtained

– Partitioning• Use spectral or another partitioning algorithm to partition the

small graph

– Multilevel refinement• Uncoarsen the graph to get a partitioning of the original graph• At each level, perform some graph refinement

Multilevel example(without refinement)

126

11

107

4

53

2

18

9

1

1315

16

14


126

11

107

4

53

2

18

9

2

11

11

2

2

1

1

11

1315

16

14


126

11

107

4

53

2

18

9

2

11

11

2

2

1315

16

14

1

1

1


126

11

107

4

53

2

18

9

2

11

11

2

2

2

1

1

13 14

15

16 1

1

1

2

Dynamic partitioning

• We have an initial partitioning– Now, the graph changes– Determine a good partition, fast– Also minimize the number of vertices

that need to be moved

• Examples– PLUM– Jostle– Diffusion

PLUM

• Partition based on the initial mesh– Vertex and edge weights alone changed

• Map partitions to processors– Use more partitions than processors

• Ensures finer granularity

– Compute a similarity matrix based on data already on a process• Measures savings on data redistribution cost for each (process,

partition) pair• Choose assignment of partitions to processors

– Example: Maximum weight matching» Duplicate each processor: # of partitions/P times

– Alternative: Greedy approximation algorithm » Assign in order of maximum similarity value

• http://citeseer.nj.nec.com/oliker98plum.html

JOSTLE• Use Hu and Blake’s scheme for load balancing

– Solve Lx = b using Conjugate Gradient• L = Laplacian of processor graph, bi = Weight on process Pi

– Average weight

– Move max(xi-xj, 0) weight between Pi and Pj

• Leads to balanced load– Equivalent to Pi sending xi load to each neighbor j, and each

neighbor Pj sending xj to Pi

– Net loss in load for Pi = di xi - neighborj xj = L(i)x = bi

» where L(i) is row i of L, and di is degree of i– New load for Pi = weight on Pi - bi = average weight

• Leads to minimum L2 norm of load moved – Using max(xi-xj, 0)

• Select vertices to move, based on relative gain– http://citeseer.nj.nec.com/walshaw97parallel.html

Diffusion

• Involves only communication with neighbors• A simple scheme

– Processor Pi repeatedly sends wi weight to each neighbor

• wi = weight on Pi

• wk = (I – L) wk-1 , wk = weight vector at iteration k– Simple criteria exist for choosing to ensure convergence

» Example: = 0.5/(maxi di),

• More sophisticated schemes exist

Important points• Goals of domain decomposition

– Balance the load– Minimize communication

• Space filling curves• Graph partitioning model

– Spectral method• Relax NP hard integer optimization to floating point, and then

discretize to get approximate integer solution

– Multilevel methods• Three phases

• Dynamic partitioning – additional requirements– Use old solution to find new one fast– Minimize number of vertices moved

domain decomposition in parallel computing ashok srinivasan asriniva florida state university cot...

Documents

balanced slide

example hilbert curve

inertial axis slide

curve divide

continuous curve

space order

lim h n n slide

static graph