parallel numerical algorithms - edgar solomonik

63
Sparse Matrices Sparse Triangular Solve Cholesky Factorization Sparse Cholesky Factorization Parallel Numerical Algorithms Chapter 4 – Sparse Linear Systems Section 4.1 – Direct Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 63

Upload: others

Post on 25-Dec-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Parallel Numerical AlgorithmsChapter 4 – Sparse Linear Systems

Section 4.1 – Direct Methods

Michael T. Heath and Edgar Solomonik

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

CS 554 / CSE 512

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 63

Page 2: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Outline

1 Sparse Matrices

2 Sparse Triangular Solve

3 Cholesky Factorization

4 Sparse Cholesky Factorization

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 63

Page 3: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse Matrix DefinitionsSparse Matrix Products

Sparse Matrices

Matrix is sparse if most of its entries are zero

For efficiency, store and operate on only nonzero entries,e.g., ajk · xk need not be done if ajk = 0

But more complicated data structures required incur extraoverhead in storage and arithmetic operations

Matrix is “usefully” sparse if it contains enough zero entriesto be worth taking advantage of them to reduce storageand work required

In practice, grid discretizations often yield matrices withΘ(n) nonzero entries, i.e., (small) constant number ofnonzeros per row or column

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 63

Page 4: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse Matrix DefinitionsSparse Matrix Products

Graph Model

Adjacency Graph G(A) of symmetric n× n matrix A isundirected graph having n vertices, with edge betweenvertices i and j if aij 6= 0

Number of edges in G(A) is the number of nonzeros in A

For a grid-based discretization, G(A) is the grid

Adjacency graph provides visual representation ofalgorithms and highlights connections between numericaland combinatorial algorithms

For nonsymmetric A, G(A) would be directed

Often convenient to think of aij as the weight of edge (i, j)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 63

Page 5: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse Matrix DefinitionsSparse Matrix Products

Sparse Matrix Representations

Coordinate (COO) (naive) format – store each nonzeroalong with its row and column indexCompressed-sparse-row (CSR) format

Store value and column index for each nonzero

Store index of first nonzero for each row

Storage for CSR is less than COO and CSR ordering isoften convenient

CSC (compressed-sparse column), blocked versions (e.g.CSB), and other storage formats are also used

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 63

Page 6: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse Matrix DefinitionsSparse Matrix Products

Sparse Matrix Distributions

Dense matrix mappings can be adapted to sparse matrices

1-D blocked mapping – store all nonzeros in n/pconsecutive rows on each processor1-D cyclic or randomized mapping – store all nonzeros insome subset of n/p rows on each processor2-D block mapping – store all nonzeros in a n/

√p× n/

√p

block of matrix

1-D blocked mappings are best for exploiting locality ingraph, especially when there are Θ(n) nonzeros

Row ordering matters for all mappings, randomization andcyclicity yield load balance, blocking can yield locality

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 63

Page 7: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse Matrix DefinitionsSparse Matrix Products

Sparse Matrix Vector Multiplication

Sparse matrix vector multiplication (SpMV) is

y = Ax

where A is sparse and x is dense

CSR-based matrix-vector product, for all i (in parallel) do

xi =∑j

ai,c(j)xc(j) =n∑

j=1

aijxj

where c(j) is the index of the jth nonzero in row i

For random 1-D or 2-D mapping, cost of vectorcommunication is same as in corresponding dense case

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 63

Page 8: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse Matrix DefinitionsSparse Matrix Products

SpMV with 1-D Mapping

For 1D blocking (each processor owns n/p rows), numberof elements of x needed by a processor is the number ofcolumns with a nonzero in the rows it ownsIn general, want to order rows to minimize maximumnumber of vector elements needed on any processorGraphically, we want to partition the graph into p subsets ofn/p nodes, to minimize the maximum number of nodes towhich any subset is connected, i.e., for G(A) = (V,E),

V = V1 ∪ · · · ∪ Vp, |Vi| = n/p

is selected to minimize

maxi

(|{v : v ∈ V \ Vi,∃w ∈ Vi, (v, w) ∈ E}|)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 63

Page 9: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse Matrix DefinitionsSparse Matrix Products

Surface Area to Volume Ratio in SpMV

The number of external vertices the maximum partition isadjacent to depends on the expansion of the graph

Expansion can be interpreted as a measure of thesurface-area to volume ratio of the subgraphs

For example, for a k × k × k grid, a subvolume ofk/p1/3 × k/p1/3 × k/p1/3 has surface area Θ(k2/p2/3)

Communication for this case becomes a neighbor haloexchange on a 3-D processor mesh

Thus, finding the best 1-D partitioning for SpMV oftencorresponds to domain partitioning and depends on thephysical geometry of the problem

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 63

Page 10: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse Matrix DefinitionsSparse Matrix Products

Other Sparse Matrix Products

SpMV is of critical importance to many numerical methods,but suffers from a low flop-to-byte ratio and a potentiallyhigh communication bandwidth cost

In graph algorithms SpMSpV (x and y are sparse) isprevalent, which is even harder to perform efficiently (e.g.,to minimize work need layout other than CSR, like CSC)

SpMM (x becomes dense matrix X) provides a higherflop-to-byte ratio and is much easier to do efficiently

SpGEMM (SpMSpM) (matrix multiplication where allmatrices are sparse) arises in e.g., algebraic multigrid andgraph algorithms, efficiency is highly dependent on sparsity

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 63

Page 11: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sequential Sparse Triangular SolveParallel Sparse Triangular Solve

Solving Triangular Sparse Linear Systems

Given sparse lower-triangular matrix L and vector b, solve

Lx = b

all nonzeros of L must be in its lower-triangular part

Sequential algorithm: take xi = bi/lii, update

bj = bj − ljixi for all j ∈ {i + 1, . . . , n}

If L has m > n nonzeros, require Q1 ≈ 2m operations

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 63

Page 12: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sequential Sparse Triangular SolveParallel Sparse Triangular Solve

Parallelism in Sparse Triangular Solve

We can adapt any dense parallel triangular solve algorithmto the sparse case

Again have fan-in (left-looking) and fan-out (right-looking)variantsCommunication cost stays the same, computational costdecreases

In fact there may be additional sources of parallelism, e.g.,if l21 = 0, we can solve for x1 and x2 concurrently

More generally, can concurrently prune leaves of directedacyclic adjacency graph (DAG) G(A) = (V,E), where(i, j) ∈ E if lij 6= 0

Depth of algorithm corresponds to diameter of this DAG

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 63

Page 13: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sequential Sparse Triangular SolveParallel Sparse Triangular Solve

Parallel Algorithm for Sparse Triangular Solve

Partition: associate fine-grain tasks with each (i, j) suchthat lij 6= 0

Communicate: task (i, i) communicates with task (j, i) and(i, j) for all possible j

Agglomerate: form coarse-grain tasks for each column ofL, i.e., do 1-D agglomeration, combining fine-grain tasks(?, i) into agglomerated task i

Map: assign coarse-grain tasks (columns of L) toprocessors with blocking (for locality) and/or cyclicity (forload balance and concurrency)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 63

Page 14: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sequential Sparse Triangular SolveParallel Sparse Triangular Solve

Analysis of 1-D Parallel Sparse Triangular Solve

Cost of 1-D algorithm will clearly be less than thecorresponding algorithm for the dense case

Load balance depends on distribution of nonzeros, cyclicitycan help distribute dense blocks

Naive algorithm with 1-D column blocking exploitsconcurrency only in fan-out updates

Communication bandwidth cost depends onsurface-to-volume ratio of each subset of verticesassociated with a block of columns

Higher concurrency and better performance possible withdynamic/adaptive algorithms

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 63

Page 15: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Cholesky Factorization

Symmetric positive definite matrix A has Choleskyfactorization

A = LLT

where L is lower triangular matrix with positive diagonalentries

Linear systemAx = b

can then be solved by forward-substitution in lowertriangular system Ly = b, followed by back-substitution inupper triangular system LTx = y

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 63

Page 16: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Computing Cholesky Factorization

Algorithm for computing Cholesky factorization can bederived by equating corresponding entries of A and LLT

and generating them in correct order

For example, in 2× 2 case[a11 a21a21 a22

]=

[`11 0`21 `22

] [`11 `210 `22

]so we have

`11 =√a11, `21 = a21/`11, `22 =

√a22 − `221

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 63

Page 17: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Cholesky Factorization Algorithm

for k = 1 to nakk =

√akk

for i = k + 1 to naik = aik/akk

endfor j = k + 1 to n

for i = j to naij = aij − aik ajk

endend

end

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 63

Page 18: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Cholesky Factorization Algorithm

All n square roots are of positive numbers, so algorithmwell defined

Only lower triangle of A is accessed, so strict uppertriangular portion need not be stored

Factor L computed in place, overwriting lower triangle of APivoting is not required for numerical stability

About n3/6 multiplications and similar number of additionsare required (about half as many as for LU)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 63

Page 19: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Parallel Algorithm

Partition

For i, j = 1, . . . , n, fine-grain task (i, j) stores aij andcomputes and stores {

`ij , if i ≥ j`ji, if i < j

yielding 2-D array of n2 fine-grain tasks

Zero entries in upper triangle of L need not be computedor stored, so for convenience in using 2-D mesh network,`ij can be redundantly computed as both task (i, j) andtask (j, i) for i > j

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 63

Page 20: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Fine-Grain Tasks and Communication

a11ℓ11

a21ℓ21

a21ℓ21

a22ℓ22

a31ℓ31

a41ℓ41

a32ℓ32

a42ℓ42

a31ℓ31

a32ℓ32

a41ℓ41

a42ℓ42

a33ℓ33

a43ℓ43

a43ℓ43

a44ℓ44

a51ℓ51

a52ℓ52

a61ℓ61

a62ℓ62

a53ℓ53

a54ℓ54

a63ℓ63

a64ℓ64

a51ℓ51

a61ℓ61

a52ℓ52

a62ℓ62

a53ℓ53

a63ℓ63

a54ℓ54

a64ℓ64

a55ℓ55

a65ℓ65

a65ℓ65

a66ℓ66

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 63

Page 21: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Fine-Grain Parallel Algorithmfor k = 1 to min(i, j)− 1

recv broadcast of akj from task (k, j)recv broadcast of aik from task (i, k)aij = aij − aik akj

endif i = j then

aii =√aii

broadcast aii to tasks (k, i) and (i, k), k = i+ 1, . . . , nelse if i < j then

recv broadcast of aii from task (i, i)aij = aij/aii

broadcast aij to tasks (k, j), k = i+ 1, . . . , nelse

recv broadcast of ajj from task (j, j)aij = aij/ajj

broadcast aij to tasks (i, k), k = j + 1, . . . , nend

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 63

Page 22: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Agglomeration Schemes

Agglomerate

Agglomeration of fine-grain tasks produces

2-D1-D column1-D row

parallel algorithms analogous to those for LU factorization,with similar performance and scalability

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 63

Page 23: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Loop Orderings for Cholesky

Each choice of i, j, or k index in outer loop yields differentCholesky algorithm, named for portion of matrix updated bybasic operation in inner loops

Submatrix-Cholesky : (fan-out) with k in outer loop, innerloops perform rank-1 update of remaining unreducedsubmatrix using current column

Column-Cholesky : (fan-in) with j in outer loop, inner loopscompute current column using matrix-vector product thataccumulates effects of previous columns

Row-Cholesky : (fan-in) with i in outer loop, inner loopscompute current row by solving triangular system involvingprevious rows

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 63

Page 24: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Memory Access Patterns

read only read and write

Submatrix-Cholesky Column-Cholesky Row-Cholesky

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 63

Page 25: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Column-Oriented Cholesky Algorithms

Submatrix-Cholesky

for k = 1 to nakk =

√akk

for i = k + 1 to naik = aik/akk

endfor j = k + 1 to n

for i = j to naij = aij − aik ajk

endend

end

Column-Cholesky

for j = 1 to nfor k = 1 to j − 1

for i = j to naij = aij − aik ajk

endendajj =

√ajj

for i = j + 1 to naij = aij/ajj

endend

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 63

Page 26: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Column Operations

Column-oriented algorithms can be stated more compactly byintroducing column operations

cdiv ( j ): column j is divided by square root of its diagonalentry

ajj =√ajj

for i = j + 1 to naij = aij/ajj

end

cmod ( j, k): column j is modified by multiple of column k,with k < j

for i = j to naij = aij − aik ajk

end

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 63

Page 27: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Column-Oriented Cholesky Algorithms

Submatrix-Cholesky

for k = 1 to ncdiv (k)for j = k + 1 to n

cmod ( j, k)end

end

right-lookingimmediate-updatedata-drivenfan-out

Column-Cholesky

for j = 1 to nfor k = 1 to j − 1

cmod ( j, k)endcdiv ( j)

end

left-lookingdelayed-updatedemand-drivenfan-in

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 63

Page 28: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Data Dependences

cmod (k + 1, k ) cmod (k + 2, k ) cmod (n, k )• • •

cdiv (k )

cmod (k, 1) cmod (k, 2 ) cmod (k, k - 1 )• • •

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 63

Page 29: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Cholesky FactorizationComputing CholeskyCholesky AlgorithmParallel Algorithm

Data Dependences

cmod (k, ?) operations along bottom can be done in anyorder, but they all have same target column, so updatingmust be coordinated to preserve data integrity

cmod (?, k) operations along top can be done in any order,and they all have different target columns, so updating canbe done simultaneously

Performing cmods concurrently is most important sourceof parallelism in column-oriented factorization algorithms

For dense matrix, each cdiv (k) depends on immediatelypreceding column, so cdivs must be done sequentially

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 63

Page 30: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Sparsity Structure

For sparse matrix M , let Mi? denote its ith row and M?j

its jth column

Define Struct (Mi?) = {k < i | mik 6= 0}, nonzero structureof row i of strict lower triangle of M

Define Struct (M?j) = {k > j | mkj 6= 0}, nonzero structureof column j of strict lower triangle of M

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 63

Page 31: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Sparse Cholesky Algorithms

Submatrix-Cholesky

for k = 1 to ncdiv ( k)for j ∈ Struct (L?k)

cmod ( j, k)end

end

right-lookingimmediate-updatedata-drivenfan-out

Column-Cholesky

for j = 1 to nfor k ∈ Struct (Lj?)

cmod ( j, k)endcdiv ( j)

end

left-lookingdelayed-updatedemand-drivenfan-in

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 63

Page 32: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Graph Model

Recall that adjacency graph G(A) of symmetric n× nmatrix A is undirected graph with edge between vertices iand j if aij 6= 0

At each step of Cholesky factorization algorithm,corresponding vertex is eliminated from graph

Neighbors of eliminated vertex in previous graph becomeclique (fully connected subgraph) in modified graph

Entries of A that were initially zero may become nonzeroentries, called fill

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 63

Page 33: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Graph Model of Elimination

××

×××

×××

× × ×× ×××× ×

×××

×××× ×××

× ×××××

A

××

×××

×××

×

××× ×××

× ×××××

L

+ ++ ++

3

7

1

6

9

5

4

8

2

3

7

6

9

5

4

8

2

3

7

6

9

5

4

8 7

6

9

5

4

8

7

6

9

5

89 8 7

6

9 87 9 89

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 63

Page 34: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Elimination Tree

parent ( j) is row index of first offdiagonal nonzeroin column j of L, if any, and j otherwise

Elimination tree T (A) is graph having n vertices, with edgebetween vertices i and j, for i > j, if i = parent ( j)

If matrix is irreducible, then elimination tree is single treewith root at vertex n; otherwise, it is more accuratelytermed elimination forest

T (A) is spanning tree for filled graph, F (A), which is G(A)with all fill edges added

Each column of Cholesky factor L depends only on itsdescendants in elimination tree

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 63

Page 35: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Elimination Tree

××

×××

×××

× × ×× ×××× ×

×××

×××× ×××

× ×××××

A

××

×××

×××

×

××× ×××

× ×××××

L

+ ++ ++

3

7

1

6

9

5

4

8

2

3

7

1

6

9

5

4

8

23

7

1

6

9

5

4

8

2G (A) F (A) T (A)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 63

Page 36: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Effect of Matrix Ordering

Amount of fill depends on order in which variables areeliminated

Example: “arrow” matrix — if first row and column aredense, then factor fills in completely, but if last row andcolumn are dense, then they cause no fill

××

×××

×××

× ×××××××

×

×× × ×××××

××

×× ×

×× ×××

××××××

×

×××××× × ×

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 63

Page 37: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Ordering Heuristics

General problem of finding ordering that minimizes fill isNP-complete, but there are relatively cheap heuristics that limitfill effectively

Bandwidth or profile reduction : reduce distance of nonzerodiagonals from main diagonal (e.g., RCM)

Minimum degree : eliminate node having fewest neighborsfirst

Nested dissection : recursively split graph into pieces usinga vertex separator, numbering separator vertices last

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 63

Page 38: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Symbolic Factorization

For symmetric positive definite (SPD) matrices, orderingcan be determined in advance of numeric factorization

Only locations of nonzeros matter, not their numericalvalues, since pivoting is not required for numerical stability

Once ordering is selected, locations of all fill entries in Lcan be anticipated and efficient static data structure set upto accommodate them prior to numeric factorization

Structure of column j of L is given by union of structures oflower triangular portion of column j of A and prior columnsof L whose first nonzero below diagonal is in row j

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 63

Page 39: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Solving Sparse SPD Systems

Basic steps in solving sparse SPD systems by Choleskyfactorization

1 Ordering : Symmetrically reorder rows and columns ofmatrix so Cholesky factor suffers relatively little fill

2 Symbolic factorization : Determine locations of all fillentries and allocate data structures in advance toaccommodate them

3 Numeric factorization : Compute numeric values of entriesof Cholesky factor

4 Triangular solve : Compute solution by forward- andback-substitution

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 63

Page 40: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Parallel Sparse Cholesky

In sparse submatrix- or column-Cholesky, if ajk = 0, thencmod ( j, k) is omitted

Sparse factorization thus has additional source ofparallelism, since “missing” cmods may permit multiplecdivs to be done simultaneously

Elimination tree shows data dependences among columnsof Cholesky factor L, and hence identifies potentialparallelism

At any point in factorization process, all factor columnscorresponding to leaves in the elimination tree can becomputed simultaneously

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 63

Page 41: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Parallel Sparse Cholesky

Height of elimination tree determines longest serial paththrough computation, and hence parallel execution time

Width of elimination tree determines degree of parallelismavailable

Short, wide, well-balanced elimination tree desirable forparallel factorization

Structure of elimination tree depends on ordering of matrix

So ordering should be chosen both to preserve sparsityand to enhance parallelism

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 63

Page 42: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Levels of Parallelism in Sparse Cholesky

Fine-grainTask is one multiply-add pairAvailable in either dense or sparse caseDifficult to exploit effectively in practice

Medium-grainTask is one cmod or cdivAvailable in either dense or sparse caseAccounts for most of speedup in dense case

Large-grainTask computes entire set of columns in subtree ofelimination treeAvailable only in sparse case

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 63

Page 43: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Band Ordering, 1-D Grid

×××

×××

××××

××

×

××

×××

×

A

×××

×××

×××

×××

×

L

4

1

6

3

5

7

2

T (A)

4

1

6

3

5

7

2

G (A)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 63

Page 44: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Minimum Degree, 1-D Grid

×××

×××

× ×××

×××

××

×××

×

A

×××

×××

×××

×××

×

L

7

1

4

5

6

2

3

G (A)

4

6

7

2

T (A)

1

3

5

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 63

Page 45: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Nested Dissection, 1-D Grid

×××

×××

× ×

×

×

×××××

×××

×

A

×××

×××

×××

××

××

L

7

1

6

2

5

4

3

G (A)

+ +4

6

7

2

T (A)

1

3

5

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 63

Page 46: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Band Ordering, 2-D Grid

××

×××

×××

×× ×× ×

××× ×

×××

×

×

××

××××

××

××

×

A

××

×××

×××

××××

×××

×××

××

×

L

+ ++

++

7

4

1

8

5

2

9

6

3

G (A)

+++

4

1

6

3

5

7

9

8

2

T (A)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 63

Page 47: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Minimum Degree, 2-D Grid

××

×××

×××

× × ×× ×××× ×

×××

×××× ×××

× ×××××

A

××

×××

×××

×

××× ×××

× ×××××

L

+ ++ ++

3

7

1

6

9

5

4

8

2

G (A)

3

7

1

6

9

5

4

8

2

T (A)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 63

Page 48: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Nested Dissection, 2-D Grid

××

×××

×××

× × ×× ×

×××××××

××

×

×

×××

××

×××

×A

××

×××

×××

××

×

×

×××

× ××

×××L

+ ++ ++

4

7

1

6

8

3

5

9

2

G (A)

4

7

1

6

9

3

5

8

2

T (A)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 63

Page 49: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Mapping

Cyclic mapping of columns to processors works well fordense problems, because it balances load andcommunication is global anyway

To exploit locality in communication for sparsefactorization, better approach is to map columns in subtreeof elimination tree onto local subset of processors

Still use cyclic mapping within dense submatrices(“supernodes”)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 63

Page 50: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: Subtree Mapping

00

00

00

0

0

0

1

1

1

1

1

1

1

1

1 2

2

2

2

2

2

2

2

2 3

3

3

3

3

3

3

3

3

0

0

1

2

2

3

2

3

0

1

3

1

0

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 63

Page 51: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Fan-Out Sparse Choleskyfor j ∈ mycols

if j is leaf node in T (A) thencdiv ( j)send L?j to processes in map (Struct (L?j))mycols = mycols − { j }

endendwhile mycols 6= ∅

receive any column of L, say L?k

for j ∈ mycols ∩ Struct (L?k)cmod ( j, k)if column j requires no more cmods then

cdiv ( j)send L?j to processes in map (Struct (L?j))mycols = mycols − { j }

endend

end

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 63

Page 52: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Fan-In Sparse Choleskyfor j = 1 to n

if j ∈ mycols or mycols ∩ Struct (Lj?) 6= ∅ thenu = 0for k ∈ mycols ∩ Struct (Lj?)

u = u+ `jk L?k

if j ∈ mycols thenincorporate u into factor column jwhile any aggregated update column

for column j remains, receive oneand incorporate it into factor column j

endcdiv ( j)

elsesend u to process map ( j)

endend

end

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 63

Page 53: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Multifrontal Sparse Cholesky

Multifrontal algorithm operates recursively, starting fromroot of elimination tree for A

Dense frontal matrix Fj is initialized to have nonzeroentries from corresponding row and column of A as its firstrow and column, and zeros elsewhere

Fj is then updated by extend_add operations with updatematrices from its children in elimination tree

extend_add operation, denoted by ⊕, merges matrices bytaking union of their subscript sets and summing entries forany common subscripts

After updating of Fj is complete, its partial Choleskyfactorization is computed, producing corresponding rowand column of L as well as update matrix Uj

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 63

Page 54: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Example: extend_add

a11 a13 a15 a18a31 a33 a35 a38a51 a53 a55 a58a81 a83 a85 a88

⊕b11 b12 b15 b17b21 b22 b25 b27b51 b52 b55 b57b71 b72 b75 b77

=

a11 + b11 b12 a13 a15 + b15 b17 a18b21 b22 0 b25 b27 0a31 0 a33 a35 0 a38

a51 + b51 b52 a53 a55 + b55 b57 a58b71 b72 0 b75 b77 0a81 0 a83 a85 0 a88

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 63

Page 55: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Multifrontal Sparse CholeskyFactor( j)

Let {i1, . . . , ir} = Struct (L?j)

Let Fj =

aj,j aj,i1 . . . aj,irai1,j 0 . . . 0

......

. . ....

air,j 0 . . . 0

for each child i of j in elimination tree

Factor(i)Fj = Fj ⊕Ui

endPerform one step of dense Cholesky:

Fj =

`j,j 0`i1,j

... I`ir,j

1 0

0 Uj

`j,j `i1,j . . . `ir,j

0 I

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 63

Page 56: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Advantages of Multifrontal Method

Most arithmetic operations performed on dense matrices,which reduces indexing overhead and indirect addressing

Can take advantage of loop unrolling, vectorization, andoptimized BLAS to run at near peak speed on many typesof processors

Data locality good for memory hierarchies, such as cache,virtual memory with paging, or explicit out-of-core solvers

Naturally adaptable to parallel implementation byprocessing multiple independent fronts simultaneously ondifferent processors

Parallelism can also be exploited in dense matrixcomputations within each front

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 63

Page 57: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Summary for Parallel Sparse Cholesky

Principal ingredients in efficient parallel algorithm for sparseCholesky factorization

Reordering matrix to obtain relatively short and wellbalanced elimination tree while also limiting fill

Multifrontal or supernodal approach to exploit densesubproblems effectively

Subtree mapping to localize communication

Cyclic mapping of dense subproblems to achieve goodload balance

2-D algorithm for dense subproblems to enhancescalability

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 63

Page 58: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

Scalability of Sparse Cholesky

Performance and scalability of sparse Cholesky depend onsparsity structure of particular matrix

Sparse factorization with nested dissection requiresfactorization of dense matrix of dimension Θ(

√n ) for 2-D

grid problem with n grid points (√n is the size of the root

vertex separator), for which unconditional weak scalabilityis possible

However, efficiency often deteriorates as a result of therest of the sparse factorization taking more time

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 63

Page 59: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

References – Dense Cholesky

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz,Communication-optimal parallel and sequential Choleskydecomposition, SIAM J. Sci. Comput. 32:3495-3523, 2010

J. W. Demmel, M. T. Heath, and H. A. van der Vorst,Parallel numerical linear algebra, Acta Numerica2:111-197, 1993

D. O’Leary and G. W. Stewart, Data-flow algorithms forparallel matrix computations, Comm. ACM 28:840-853,1985

D. O’Leary and G. W. Stewart, Assignment and schedulingin parallel matrix factorization, Linear Algebra Appl.77:275-299, 1986

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 63

Page 60: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

Sparse EliminationMatrix OrderingsParallel Algorithms

References – Sparse CholeskyM. T. Heath, Parallel direct methods for sparse linearsystems, D. E. Keyes, A. Sameh, and V. Venkatakrishnan,eds., Parallel Numerical Algorithms, pp. 55-90, Kluwer,1997

M. T. Heath, E. Ng and B. W. Peyton, Parallel algorithmsfor sparse linear systems, SIAM Review 33:420-460, 1991

J. Liu, Computational models and task scheduling forparallel sparse Cholesky factorization, Parallel Computing3:327-342, 1986

J. Liu, Reordering sparse matrices for parallel elimination,Parallel Computing 11:73-91, 1989

J. Liu, The role of elimination trees in sparse factorization,SIAM J. Matrix Anal. Appl. 11:134-172, 1990

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 63

Page 61: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

References – Multifrontal Methods

I. S. Duff, Parallel implementation of multifrontal schemes,Parallel Computing 3:193-204, 1986

A. Gupta, Parallel sparse direct methods: a short tutorial,IBM Research Report RC 25076, November 2010

J. Liu, The multifrontal method for sparse matrix solution:theory and practice, SIAM Review 34:82-109, 1992

J. A. Scott, Parallel frontal solvers for large sparse linearsystems, ACM Trans. Math. Software 29:395-417, 2003

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 63

Page 62: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

References – Scalability

A. George, J. Lui, and E. Ng, Communication results forparallel sparse Cholesky factorization on a hypercube,Parallel Computing 10:287-298, 1989

A. Gupta, G. Karypis, and V. Kumar, Highly scalableparallel algorithms for sparse matrix factorization, IEEETrans. Parallel Distrib. Systems 8:502-520, 1997

T. Rauber, G. Runger, and C. Scholtes, Scalability ofsparse Cholesky factorization, Internat. J. High SpeedComputing 10:19-52, 1999

R. Schreiber, Scalability of sparse direct solvers,A. George, J. R. Gilbert, and J. Liu, eds., Graph Theoryand Sparse Matrix Computation, pp. 191-209,Springer-Verlag, 1993

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 63

Page 63: Parallel Numerical Algorithms - Edgar Solomonik

Sparse MatricesSparse Triangular SolveCholesky Factorization

Sparse Cholesky Factorization

References – Nonsymmetric Sparse Systems

I. S. Duff and J. A. Scott, A parallel direct solver for largesparse highly unsymmetric linear systems, ACM Trans.Math. Software 30:95-117, 2004

A. Gupta, A shared- and distributed-memory parallelgeneral sparse direct solver, Appl. Algebra Engrg.Commun. Comput., 18(3):263-277, 2007

X. S. Li and J. W. Demmel, SuperLU_Dist: A scalabledistributed-memory sparse direct solver for unsymmetriclinear systems, ACM Trans. Math. Software 29:110-140,2003

K. Shen, T. Yang, and X. Jiao, S+: Efficient 2D sparse LUfactorization on parallel machines, SIAM J. Matrix Anal.Appl. 22:282-305, 2000

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 63