domain decomposition in parallel computing ashok srinivasan asriniva florida state university cot...
TRANSCRIPT
Domain decomposition in parallel computing
Ashok Srinivasan
www.cs.fsu.edu/~asriniva
Florida State University
COT 5410 – Spring 2004
Outline
• Background
• Geometric partitioning
• Graph partitioning– Static– Dynamic
• Important points
Background• Tasks in a parallel computation need access to
certain data• Same datum may be needed by multiple tasks
– Example: In matrix-vector multiplication, b2 is needed for the computation of all ci2, 1 < i < n
– If a process does not “own” a datum needed by its task, then it has to get it from a process that has it
• This communication is expensive
– Aims of domain decomposition• Distribute the data in such a manner that the communication
required is minimized
• Ensure that the computational loads on processes are balanced
Domain decomposition example
• Finite difference computation– New value of a node depends on old values of its
neighbors
• We want to divide the nodes amongst the processes so that – Communication is minimized
• Measure of partition quality
– Computational load is evenly balanced
Geometric partitioning
• Partition a set of points– Uses only coordinate information
• Balances the load– The heuristic tries to ensure that communication costs
are low
• Algorithms are typically fast, but partition not of high quality
• Examples– Orthogonal recursive bisection– Inertial– Space filling curves
Orthogonal recursive bisection
• Recursively bisect orthogonal to the longest dimension– Assume communication is proportional to the surface area of
the domain, and aligned with coordinate axes– Recursive bisection
• Divide into two pieces, keeping load balanced• Apply recursively, until desired number of partitions obtained
Inertial
• ORB may not be effective if cuts along the x, y, or z directions are not good ones
• Inertial– Recursively bisect
orthogonal to the inertial axis
Space filling curves
• Space filling curves– A continuous curve that fills the space– Order the points based on their relative
position on the curve– Choose a curve that preserves proximity
• Points that are close in space should be close in the ordering too
• Example– Hilbert curve
Hilbert curve
• Sources– http://www.dcs.napier.ac.uk/~andrew/hilbert.html– http://www.fractalus.com/kerry/tutorials/hilbert/hilbert-tutorial.html
H1
H2
Hi
Hi+1 Hilbert curve = lim Hn
n
Domain decomposition with a space filling curve
• Order points based on their position on the curve
• Divide into P parts– P is the number of processes
• Space filling curves can be used in adaptive computations too
• They can be extended to higher dimensions too
Graph partitioning
• Model as graph partitioning– Graph G = (V, E)– Each task is represented by a vertex
• A weight can be used to represent the computational effort
– An edge exists between tasks if one needs data owned by the other
• Weights can be associated with edges too
– Goal• Partition vertices into P parts such that each partition has equal
vertex weights• Minimize the weights of edges cut• Problem is NP hard
– Edge cut metric• Judge the quality of the partitioning by the number of edges cut
Static graph partitioning
• Combinatorial– Levelized nested dissection – Kernighan-Lin/Feduccia-Matheyses
• Spectral partitioning
• Multi-level methods
Combinatorial partitioning
• Use only connectivity information
• Examples– Levelized nested dissection – Kernighan-Lin/Feduccia-Matheyses
Levelized nested dissection (LND)
• Idea is similar to the geometric methods– But cannot use coordinate information– Instead of projecting vertices along the longest
axis, order them based on distance from a vertex that may be one extreme of the longest dimension of a graph
• Pseudo-peripheral vertex– Perform a breadth-first search, starting from an arbitrary
vertex
– The vertex that is encountered last might be a good approximation to a peripheral vertex
LND example Finding a pseudoperipheral vertex
Initial vertex
1
1
1
2
2
2
33
3
34
Pseudoperipheral vertex
LND example – Partitioning
5
3
5
6
4
2
53
2
1
4
Initial vertexPartition
Recursively bisect the subgraphs
Kernighan-Lin/Fiduccia-Matheyses
• Refines an existing partition• Kernighan-Lin
– Consider pairs of vertices from different partitions– Choose a pair whose swapping will result in the best improvement
in partition quality• The best improvement may actually be a worsening
– Perform several passes• Choose best partition among those encountered
• Fiduccia-Matheyses– Similar but more efficient
• Boundary Kernighan-Lin– Consider only boundary vertices to swap
• ... and many other variants
Kernighan-Lin example
Better partition
Edge cut = 3
Existing partition
Edge cut = 4
Swap these
Spectral method
• Based on the observation that a Fiedler vector of a graph contains connectivity information
• Laplacian of a graph: L– lii = di (degree of vertex i)– lij = -1 if edge {i,j} exists, otherwise 0
• Smallest eigenvalue of L is 0 with eigenvector all 1• All other eigenvalues are positive for a connected graph
• Fiedler vector– Eigenvector corresponding to the second smallest
eigenvalue
Fiedler vector
• Consider a partitioning of V into A and B– Let yi = 1 if vi A, and yi = -1 if vi B– For load balance, i yi = 0
– Also eij E (yi-yj)2 = 4 x number of edges
across partitions
– Also, yTLy = i di yi2 – 2 eij E
yiyj
= eij E (yi-yj)2
Optimization problem
• The optimal partition is obtain by solving– Minimize yTLy– Constraints:
• yi {-1,1}• i yi = 0
– This is NP hard
• Relaxed problem– Minimize yTLy– Constraints:
• i yi = 0• Add a constraint on a norm of y, example, ||y||2 = n0.5
– Note• (1, 1, ..., 1)T is an eigenvector with eigenvalue 0• For a connected graph, all other eigenvalues are positive and orthogonal
to this eigenvector, which implies i yi = 0• The objective function is minimized by a Fiedler vector
Spectral algorithm
• Find a Fiedler vector of the Laplacian of the graph– Note that the Fiedler value (the second smallest eigenvalue)
yields a lower bound on the communication cost, when the load is balanced
• From the Fiedler vector, bisect the graph– Let all vertices with components in the Fiedler vector greater
than the median be in one component, and the rest in the other
• Recursively apply this to each partition• Note: Finding the Fiedler vector of a large graph can
be time consuming
Multilevel methods
• Idea– It takes time to partition a large graph– So partition a small graph instead!
• Three phases– Graph coarsening
• Combine vertices to create a smaller graph– Example: Find a suitable matching
• Apply this recursively until a suitably small graph is obtained
– Partitioning• Use spectral or another partitioning algorithm to partition the
small graph
– Multilevel refinement• Uncoarsen the graph to get a partitioning of the original graph• At each level, perform some graph refinement
Multilevel example(without refinement)
126
11
107
4
53
2
18
9
1
1315
16
14
Multilevel example(without refinement)
126
11
107
4
53
2
18
9
1
1315
16
14
Multilevel example(without refinement)
126
11
107
4
53
2
18
9
2
11
11
2
2
1
1
11
1315
16
14
Multilevel example(without refinement)
126
11
107
4
53
2
18
9
2
11
11
2
2
1315
16
14
1
1
1
Multilevel example(without refinement)
126
11
107
4
53
2
18
9
2
11
11
2
2
2
1
1
13 14
15
16 1
1
1
2
Dynamic partitioning
• We have an initial partitioning– Now, the graph changes– Determine a good partition, fast– Also minimize the number of vertices
that need to be moved
• Examples– PLUM– Jostle– Diffusion
PLUM
• Partition based on the initial mesh– Vertex and edge weights alone changed
• Map partitions to processors– Use more partitions than processors
• Ensures finer granularity
– Compute a similarity matrix based on data already on a process• Measures savings on data redistribution cost for each (process,
partition) pair• Choose assignment of partitions to processors
– Example: Maximum weight matching» Duplicate each processor: # of partitions/P times
– Alternative: Greedy approximation algorithm » Assign in order of maximum similarity value
• http://citeseer.nj.nec.com/oliker98plum.html
JOSTLE• Use Hu and Blake’s scheme for load balancing
– Solve Lx = b using Conjugate Gradient• L = Laplacian of processor graph, bi = Weight on process Pi
– Average weight
– Move max(xi-xj, 0) weight between Pi and Pj
• Leads to balanced load– Equivalent to Pi sending xi load to each neighbor j, and each
neighbor Pj sending xj to Pi
– Net loss in load for Pi = di xi - neighborj xj = L(i)x = bi
» where L(i) is row i of L, and di is degree of i– New load for Pi = weight on Pi - bi = average weight
• Leads to minimum L2 norm of load moved – Using max(xi-xj, 0)
• Select vertices to move, based on relative gain– http://citeseer.nj.nec.com/walshaw97parallel.html
Diffusion
• Involves only communication with neighbors• A simple scheme
– Processor Pi repeatedly sends wi weight to each neighbor
• wi = weight on Pi
• wk = (I – L) wk-1 , wk = weight vector at iteration k– Simple criteria exist for choosing to ensure convergence
» Example: = 0.5/(maxi di),
• More sophisticated schemes exist
Important points• Goals of domain decomposition
– Balance the load– Minimize communication
• Space filling curves• Graph partitioning model
– Spectral method• Relax NP hard integer optimization to floating point, and then
discretize to get approximate integer solution
– Multilevel methods• Three phases
• Dynamic partitioning – additional requirements– Use old solution to find new one fast– Minimize number of vertices moved