minimizing communication in numerical linear algebra cs.berkeley/~demmel

47
Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments, UC Berkeley [email protected]

Upload: axl

Post on 01-Feb-2016

65 views

Category:

Documents


0 download

DESCRIPTION

Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Optimizing Krylov Subspace Methods. Jim Demmel EECS & Math Departments, UC Berkeley [email protected]. Outline of Lecture 8: Optimizing Krylov Subspace Methods. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Minimizing Communication in Numerical Linear Algebra

www.cs.berkeley.edu/~demmel

Optimizing Krylov Subspace Methods

Jim DemmelEECS & Math Departments, UC Berkeley

[email protected]

Page 2: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Outline of Lecture 8: Optimizing Krylov Subspace Methods

• k-steps of typical iterative solver for Ax=b or Ax=λx– Does k SpMVs with starting vector (eg with b, if solving Ax=b)– Finds “best” solution among all linear combinations of these k+1

vectors– Many such “Krylov Subspace Methods”

• Conjugate Gradients, GMRES, Lanczos, Arnoldi, …

• Goal: minimize communication in Krylov Subspace Methods– Assume matrix “well-partitioned,” with modest surface-to-volume

ratio– Parallel implementation

• Conventional: O(k log p) messages, because k calls to SpMV• New: O(log p) messages - optimal

– Serial implementation• Conventional: O(k) moves of data from slow to fast memory• New: O(1) moves of data – optimal

• Lots of speed up possible (modeled and measured)– Price: some redundant computation

2Summer School Lecture 8

Page 3: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

x

AxA2x

A3x

A4x

A5x

A6x

A7xA8x

Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors

Can be computed without communication

Proc 1 Proc 2

3Summer School Lecture 8

Page 4: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

x

AxA2x

A3x

A4x

A5x

A6x

A7xA8x

Can be computed without communication

Proc 1 Proc 2

Locally Dependent Entries for [x,Ax,A2x], A tridiagonal, 2 processors

4Summer School Lecture 8

Page 5: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

x

AxA2x

A3x

A4x

A5x

A6x

A7xA8x

Can be computed without communication

Proc 1 Proc 2

Locally Dependent Entries for [x,Ax,…,A3x], A tridiagonal, 2 processors

5Summer School Lecture 8

Page 6: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

x

AxA2x

A3x

A4x

A5x

A6x

A7xA8x

Can be computed without communication

Proc 1 Proc 2

Locally Dependent Entries for [x,Ax,…,A4x], A tridiagonal, 2 processors

6Summer School Lecture 8

Page 7: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

x

AxA2x

A3x

A4x

A5x

A6x

A7xA8x

Locally Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors

Can be computed without communicationk=8 fold reuse of A

Proc 1 Proc 2

7

Page 8: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors

x

AxA2x

A3x

A4x

A5x

A6x

A7xA8x

One message to get data needed to compute remotely dependent entries, not k=8Minimizes number of messages = latency cost

Price: redundant work “surface/volume ratio”

Proc 1 Proc 2

8

Page 9: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Spyplot of A with sparsity pattern of a 5 point stencil, natural order

9

Page 10: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Spyplot of A with sparsity pattern of a 5 point stencil, natural order

10Summer School Lecture 8

Page 11: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Spyplot of A with sparsity pattern of a 5 point stencil, natural order, assigned to p=9 processors

11

For n x n mesh assigned to p processors, each processor needs 2n data from neighbors

Page 12: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order

12

Page 13: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order

13

For n x n mesh assigned to p processors, each processor needs 4n/p1/2 data from neighbors

Page 14: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Remotely Dependent Entries for [x,Ax,…,A3x], A with sparsity pattern of a 5 point stencil

14

Page 15: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Remotely Dependent Entries for [x,Ax,…,A3x], A with sparsity pattern of a 5 point stencil

15Summer School Lecture 8

Page 16: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Remotely Dependent Entries for [x,Ax,A2x,A3x], A irregular, multiple processors

16

Page 17: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors

x

AxA2x

A3x

A4x

A5x

A6x

A7xA8x

One message to get data needed to compute remotely dependent entries, not k=8Minimizes number of messages = latency cost

Price: redundant work “surface/volume ratio”

Proc 1 Proc 2

17

Page 18: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

18

Reducing redundant work for a tridiagonal matrix

Summer School Lecture 8

Page 19: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Summary of Parallel Optimizations so far, for“Akx kernel”: (A,x,k) [Ax,A2x,…,Akx]• Possible to reduce #messages from O(k) to O(1)• Depends on matrix being “well-partitioned”

– Each processor only needs data from a few neighbors

• Price: redundant computation of some entries of Ajx– Amount of redundant work depends on “surface to volume

ratio” of partition: ideally each processor only needs a little data from neighbors, compared to what it owns itself

• Best case: tridiagonal (1D mesh): need O(1) data from nbrs, versus O(n/p) data locally;

• #flops grows by factor 1+O(k/(n/p)) = 1 + O(k/data_per_proc) 1

• Pretty good: 2D mesh (n x n): need O(n/p1/2) data from nbors, versus O(n2/p) locally; #flops grows by 1+O(k/(data_per_proc)1/2)• Less good: 3D mesh (n x n x n): need O(n2/p2/3) data from nbors, versus O(n3/p) locally; #flops grows by 1+O(k/(data_per_proc)1/3)

19Summer School Lecture 8

Page 20: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Predicted speedup for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n)

Mesh

log2 n

k

20

Page 21: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n

x n) Mesh

21

k

log2 n

Page 22: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh

22

k

log2 n

Page 23: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Predicted speedup for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n)

Mesh

log2 n

k

23

Page 24: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x

n x n) Mesh

24

k

log2 n

Page 25: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh

25

k

log2 n

Page 26: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Minimizing #messages versus #words_moved

• Parallel case– Can’t reduce #words needed from other processors– Required to get right answer

• Sequential case– Can reduce #words needed from slow memory– Assume A, x too large to fit in fast memory, – Naïve implementation of [Ax,A2x,…,Akx] by k calls to

SpMV need to move A, x between fast and slow memory k times

– We will move it one time – clearly optimal

26Summer School Lecture 8

Page 27: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Sequential [x,Ax,…,A4x], with memory hierarchy

v

One read of matrix from slow memory, not k=4Minimizes words moved = bandwidth cost

No redundant work27

Page 28: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Remotely Dependent Entries for [x,Ax,…,A3x], A 100x100 with bandwidth 2

Only ~25% of A, vectors fits in fast memory

28

Page 29: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

In what order should the sequential algorithm process a general sparse matrix?

• For a band matrix, we obviously want to process the matrix “from left to right”, since data we need for the next partition is already in fast memory • Not obvious what best order is for a general matrix• We can formulate question of finding the best order as a Traveling Salesman Problem (TSP)

• One vertex per matrix partition• Weight of edge (j, k) is memory cost of processing partition k right after partition j• TSP: find lowest cost “tour” visiting all vertices

Page 30: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

What about multicore?

• Two kinds of communication to minimize– Between processors on the chip– Between on-chip cache and off-chip DRAM

• Use hybrid of both techniques described so far– Use parallel optimization so each core can work

independently– Use sequential optimization to minimize off-chip

DRAM traffic of each core

30Summer School Lecture 8

Page 31: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Speedups on Intel Clovertown (8 core)

Test matrices include stencils and practical matricesSee SC09 paper on bebop.cs.berkeley.edu for details

31Summer School Lecture 8

Page 32: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Minimizing Communication of GMRES to solve Ax=b• GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2

• Cost of k steps of standard GMRES vs new GMRES

Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H

Sequential: #words_moved = O(k·nnz) from SpMV + O(k2·n) from MGSParallel: #messages = O(k) from SpMV + O(k2 · log p) from MGS

Communication-Avoiding GMRES … CA-GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem

Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQRParallel: #messages = O(1) from computing W + O(log p) from TSQR

•Oops – W from power method, precision lost!32

Page 33: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

33

Page 34: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

How to make CA-GMRES Stable?

• Use a different polynomial basis for the Krylov subspace• Not Monomial basis W = [v, Av, A2v, …], instead [v, p1(A)v,p2(A)v,…]• Possible choices:

• Newton Basis WN = [v, (A – θ1 I)v , (A – θ2 I)(A – θ1 I)v, …]

where shifts θi chosen as approximate eigenvalues from Arnoldi (using same Krylov subspace, so “free”)• Chebyshev Basis WC = [v, T1(A)v, T2(A)v, …] where T1(z) chosen to be small in region of complex plane containing large eigenvalues

Summer School Lecture 8

34

Page 35: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

“Monomial” basis [Ax,…,Akx] fails to converge

Newton polynomial basis does converge

35

Page 36: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Speed ups on 8-core Clovertown

36

Page 37: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Summary of what is known (1/3), open

• GMRES– Can independently choose k to optimize speed, restart

length r to optimize convergence– Need to “co-tune” Akx kernel and TSQR– Know how to use more stable polynomial bases– Proven speedups

• Can similarly reorganize other Krylov methods– Arnoldi and Lanczos, for Ax = λx and for Ax = λMx – Conjugate Gradients, for Ax = b– Biconjugate Gradients, for Ax=b – BiCGStab?– Other Krylov methods?

37Summer School Lecture 8

Page 38: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Summary of what is known (2/3), open

• Preconditioning: MAx = Mb– Need different communication-avoiding kernel:

[x,Ax,MAx,AMAx,MAMAx,AMAMAx,…]– Can think of this as union of two of the earlier kernels

[x,(MA)x,(MA)2x,…,(MA)kx] and [x,Ax,(AM)Ax,(AM)2Ax,…,(AM)kAx]

– For which preconditioners M can we minimize communication?

• Easy: diagonal M • How about block diagonal M?

38Summer School Lecture 8

Page 39: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Examine [x,Ax,MAx,AMAx,MAMAx,…]for A tridiagonal, M block-diagonal

Summer School Lecture 839

Page 40: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Examine [x,Ax,MAx,AMAx,MAMAx,…]for A tridiagonal, M block-diagonal

Summer School Lecture 840

Page 41: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Examine [x,Ax,MAx,AMAx,MAMAx,…]for A tridiagonal, M block-diagonal

Summer School Lecture 841

Page 42: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Examine [x,Ax,MAx,AMAx,MAMAx,…]for A tridiagonal, M block-diagonal

Summer School Lecture 842

Page 43: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Examine [x,Ax,MAx,AMAx,MAMAx,…]for A tridiagonal, M block-diagonal

Summer School Lecture 843

Page 44: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Summary of what is known (3/3), open

• Preconditioning: MAx = Mb– Need different communication-avoiding kernel:

[x,Ax,MAx,AMAx,MAMAx,AMAMAx,…]– For block diagonal M, matrix powers rapidly become dense, but ranks

of off-diagonal block grow slowly

– Can take advantage of low rank to minimize communication

– yi = (AM)kij · xj = U V · xj = U (V · xj )

– Works (in principle) for Hierarchically Semi-Separable M – How does it work in practice?

• For details (and open problems) see M. Hoemmen’s PhD thesis

44

Compute on proc j,Send to proc i

Reconstruct

yi on proc i

Summer School Lecture 8

Page 45: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Of things not said (much about) …

• Other Communication-Avoiding (CA) dense factorizations– QR with column pivoting (tournament pivoting)– Cholesky with diagonal pivoting– GE with “complete pivoting”

– LDLT ? Maybe with complete pivoting?

• CA-Sparse factorizations– Cholesky, assuming good separators– Lower bounds from Lipton/Rose/Tarjan– Matrix multiplication, anything else?

• Strassen, and all linear algebra with

#words_moved = O(n / M/2-1 ) – Parallel?

• Extending CA-Krylov methods to “bottom solver” in multigrid with Adaptive Mesh Refinement (AMR)

45

Page 46: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Summary

Don’t Communic…

46

Time to redesign all dense and sparse linear algebra

Summer School Lecture 8

Page 47: Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

EXTRA SLIDES

Summer School Lecture 8

47