minimizing communication in linear algebra

1

Minimizing Communication in Linear Algebra

James Demmel9 Nov 2009

Outline• Background – 2 Big Events• Why avoiding communication is important• Communication-optimal direct linear algebra• Autotuning of sparse matrix codes• Communication-optimal iterative linear algebra

2

Collaborators• Grey Ballard, UCB EECS• Ioana Dumitriu, U. Washington• Laura Grigori, INRIA• Ming Gu, UCB Math• Mark Hoemmen, UCB EECS• Olga Holtz, UCB Math & TU Berlin• Julien Langou, U. Colorado Denver• Marghoob Mohiyuddin, UCB EECS• Oded Schwartz , TU Berlin• Hua Xiang, INRIA• Sam Williams, LBL• Vasily Volkov, UCB EECS• Kathy Yelick, UCB EECS & NERSC

• BeBOP group, bebop.cs.berkeley.edu• ParLab, parlab.eecs.berkeley.edu

3

Thanks to Intel, Microsoft,UC Discovery, NSF, DOE, …

2 Big Events

1. “Multicore revolution”, requiring all software (where performance matters!) to change• ParLab – Industry/State funded center with 15 faculty, ~50 grad

students ( parlab.eecs.berkeley.edu )• Rare synergy between commercial and scientific computing

2. Establishment of a new graduate program in Computational Science and Engineering (CSE)• 117 Faculty from 22 Departments• 60 existing courses, more being developed

• CS267 – Applications of Parallel Computers– www.cs.berkeley.edu/~demmel/cs267_Spr09

• 3 Day Parallel “Boot Camp” in August– parlab.eecs.berkeley.edu/2009bootcamp

5

Parallelism Revolution is Happening Now

• Chip density continuing to increase ~2x every 2 years

• Clock speed is not• Number of processor

cores may double instead

• There is little or no more hidden parallelism (ILP) to be found

• Parallelism must be exposed to and managed by software

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

The “7 Dwarfs” of High Performance Computing• Phil Colella (LBL) identified 7 kernels out of which most large scale

simulation and data-analysis programs are composed:

6

1. Dense Linear Algebra• Ex: Solve Ax=b or Ax = λx where A is a dense matrix

2. Sparse Linear Algebra• Ex: Solve Ax=b or Ax = λx where A is a sparse matrix (mostly zero)

3. Operations on Structured Grids• Ex: Anew(i,j) = 4*A(i,j) – A(i-1,j) – A(i+1,j) – A(i,j-1) – A(i,j+1)

4. Operations on Unstructured Grids• Ex: Similar, but list of neighbors varies from entry to entry

5. Spectral Methods• Ex: Fast Fourier Transform (FFT)

6. N-Body (aka Particle Methods)• Ex: Compute electrostatic forces using Fast Multiple Method

7. Monte Carlo (aka MapReduce)• Ex: Many independent simulations using different inputs

Motif/Dwarf: Common Computational Methods

(Red Hot Blue Cool)

Embe

d

SPEC

DB

Gam

es

ML

HPC Health Image Speech Music Browser

1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body

10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid

www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf

2 Big Events

1. “Multicore revolution”, requiring all software (where performance matters!) to change• ParLab – Industry/State funded center with 15 faculty, ~50 grad

students ( parlab.eecs.berkeley.edu )• Rare synergy between commercial and scientific computing

2. Establishment of a new graduate program in Computational Science and Engineering (CSE) at UCB• 117 Faculty from 22 Departments• 60 existing courses, more being developed

• CS267 – Applications of Parallel Computers– www.cs.berkeley.edu/~demmel/cs267_Spr09

• 3 Day Parallel “Boot Camp” in August– parlab.eecs.berkeley.edu/2009bootcamp

9

Why avoiding communication is important (1/2)

Algorithms have two costs:1.Arithmetic (FLOPS)2.Communication: moving data between

• levels of a memory hierarchy (sequential case) • processors over a network (parallel case).

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Why avoiding communication is important (2/2)• Running time of an algorithm is sum of 3 terms:

• # flops * time_per_flop• # words moved / bandwidth• # messages * latency

10

communication

• Time_per_flop << 1/ bandwidth << latency• Gaps growing exponentially with time

• Goal : reorganize linear algebra to avoid communication• Between all memory hierarchy levels

• L1 L2 DRAM network, etc • Not just hiding communication (speedup 2x ) • Arbitrary speedups possible

Annual improvements

Time_per_flop Bandwidth Latency

Network 26% 15%

DRAM 23% 5%59%

for i = 1 to n {read row i of A into fast memory, n2 reads} for j = 1 to n {read C(i,j) into fast memory, n2 reads} {read column j of B into fast memory, n3 reads} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory, n2 writes}

11

Naïve Sequential Matrix Multiply: C = C + A*B

= + *C(i,j) A(i,:)

B(:,j)C(i,j)

n3 + O(n2) reads/writes altogether

Less Communication with Blocked Matrix Multiply• Blocked Matmul C = A·B explicitly refers to subblocks of

A, B and C of dimensions that depend on cache size

12

… Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc… b chosen so 3 bxb blocks fit in cachefor i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b2 reads/writes • (n/b)3 · 4b2 = 4n3/b reads/writes altogether• Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes

• What if we had more levels of memory? (L1, L2, cache etc)?• Would need 3 more nested loops per level

Blocked vs Cache-Oblivious Algorithms• Blocked Matmul C = A·B explicitly refers to subblocks of

A, B and C of dimensions that depend on cache size

13

… Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc… b chosen so 3 bxb blocks fit in cachefor i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops

• Cache-oblivious Matmul C = A·B is independent of cache

Function C = MM(A,B)If A and B are 1x1 C = A · Belse … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + MM( A(i,k), B(k,j) ) … n/2 x n/2 matmul

Direct linear algebra: Prior Work on Matmul• Assume n3 algorithm (i.e. not Strassen-like)• Sequential case, with fast memory of size M

• Lower bound on #words moved to/from slow memory = (n3 / M1/2 ) [Hong, Kung, 81]

• Attained using blocked or cache-oblivious algorithms

14

• Parallel case on P processors:• Let NNZ be total memory needed; assume load balanced• Lower bound on #words communicated

= (n3 /(P· NNZ )1/2 ) [Irony, Tiskin, Toledo, 04]

NNZ (name of alg) Lower boundon #words

Attained by

3n2 (“2D alg”) (n2 / P1/2 ) [Cannon, 69]

3n2 P1/3 (“3D alg”) (n2 / P2/3 ) [Johnson,93]

New lower bound for all “direct” linear algebra

• Holds for• BLAS, LU, QR, eig, SVD, …• Some whole programs (sequences of these operations, no

matter how they are interleaved, eg computing Ak)• Dense and sparse matrices (where #flops << n3 )• Sequential and parallel algorithms• Some graph-theoretic algorithms (eg Floyd-Warshall)• See [BDHS09] – proof sketch if time! 15

Let M = “fast” memory size per processor

#words_moved by at least one processor = (#flops / M1/2 )

#messages_sent by at least one processor = (#flops / M3/2 )

Can we attain these lower bounds?• Do conventional dense algorithms as implemented in

LAPACK and ScaLAPACK attain these bounds?• Mostly not

• If not, are there other algorithms that do?• Yes

• Goals for algorithms:• Minimize #words = (#flops/ M1/2 ) • Minimize #messages = (#flops/ M3/2 )

• Need new data structures• Minimize for multiple memory hierarchy levels

• Cache-oblivious algorithms would be simplest• Fewest flops when matrix fits in fastest memory

• Cache-oblivious algorithms don’t always attain this

• Only a few sparse algorithms so far 16

Minimizing #messages requires new data structures• Need to load/store whole rectangular subblock of matrix with

one “message”• Incompatible with conventional columnwise (rowwise) storage

• Ex: Rows (columns) not in contiguous memory locations• Blocked storage: store as matrix of bxb blocks, each block

stored contiguously• Ok for one level of memory hierarchy, what if more?

• Recursive blocked storage: store each block using subblocks

17

Algorithm 2 Levels of Memory Multiple Levels of Memory

#Words Moved and # Messages #Words Moved and #Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97]

Cholesky LAPACK (with b = M1/2) [Gustavson 97]

[BDHS09]

[Gustavson,97] [Ahmed,Pingali,00]

[BDHS09]

(←same) (←same)

LU withpivoting

LAPACK (rarely)[Toledo,97] , [GDX 08]

[GDX 08]not partial pivoting

[Toledo, 97][GDX 08]?

[GDX 08]?

QR LAPACK (rarely) [Elmroth,Gustavson,98]

[DGHL08]

[Frens,Wise,03]but 3x flops[DGHL08]

[Elmroth,Gustavson,98]

[DGHL08] ?

[Frens,Wise,03] [DGHL08] ?

Eig, SVD Not LAPACK [BDD09] randomized, but more flops

[BDD09] [BDD09]

Summary of dense sequential algorithms attaining communication lower bounds

• Algorithms shown minimizing # Messages use (recursive) block layout• Many references (see reports), only some shown, plus ours• Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for#messages

Matrix multiply [Cannon, 69] 1 1

Cholesky ScaLAPACK log P log P

LU [GDX08]ScaLAPACK

log P log P

log P( N / P1/2 ) · log P

QR [DGHL08] ScaLAPACK

log Plog P

log3 P( N / P1/2 ) · log P

Sym Eig, SVD [BDD09]ScaLAPACK

log Plog P

log3 PN / P1/2

Nonsym Eig [BDD09]ScaLAPACK

log PP1/2 · log P

log3 PN · log P

Summary of dense 2D parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors, memory per processor = O(n2 / P)• ScaLAPACK assumes best block size b chosen• Many references (see reports), Green are ours• Recall lower bounds:

#words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 )

QR of a tall, skinny matrix is bottleneck;Use TSQR instead:

30

20

10

00

30

20

10

00

3

2

1

0

.

RRRR

QQ

QQ

WWWW

W

11

01

11

01

30

20

10

00

.RR

QQ

RRRR

020211

01 RQRR

Q is represented implicitly as a product (tree of factors) [DGHL08]

Minimizing Communication in TSQR

W = W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel:

W = W0

W1

W2

W3

R01 R02

R00

R03

Sequential:

W = W0

W1

W2

W3

R00

R01

R01

R11R02

R11

R03

Dual Core:

Can choose reduction tree dynamicallyMulticore / Multisocket / Multirack / Multisite / Out-of-core: ?

Performance of TSQR vs Sca/LAPACK• Parallel

– Intel Clovertown– Up to 8x speedup (8 core, dual socket, 10M x 10)

–Pentium III cluster, Dolphin Interconnect, MPICH• Up to 6.7x speedup (16 procs, 100K x 200)

–BlueGene/L• Up to 4x speedup (32 procs, 1M x 50)

–Both use Elmroth-Gustavson locally – enabled by TSQR• Sequential

–OOC on PowerPC laptop• As little as 2x slowdown vs (predicted) infinite DRAM• LAPACK with virtual memory never finished

• See UC Berkeley EECS Tech Report 2008-89• General CAQR = Communication-Avoiding QR in progress

Modeled Speedups of CAQR vs ScaLAPACK

Petascale up to 22.9x

IBM Power 5 up to 9.7x

“Grid” up to 11x

Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s../102,10,102 9512 wordsss

Randomized Rank-Revealing QR• Needed for eigenvalue and SVD algorithms• func [Q,R,V] = RRQR(A)

• [V,R1] = qr(randn(n,n)) … V random with Haar measure• [Q,R] = qr(A · VT ) … so A = Q · R · V

• Theorem (DDH). Provided that r+1 << r ~ RRQR produces a rank-revealing decomposition with high probability (when r = O(n), the probability is 1 – O(n-c) ).

• Can apply to A = A1

+/-1 A2+/-1 … Ak

+/-1

• Do RRQR, followed by a series of (k-1) QR / RQ to propagate unitary/orthogonal matrix to the front

• Analogous to [Stewart, 94]

24

What about o(n3) algorithms, like Strassen?

• Thm (DDHK07)• Given any matmul running in O(n) ops for some >2,

it can be made stable and still run in O(n+) ops, for any >0

• Current record: 2.38• Thm (DDH08)

• Given any stable matmul running in O(n+) ops, it is possible to do backward stable dense linear algebra in O(n+) ops

• Also reduces communication to O(n+)• But can do much better (work in progress…)

25

Direct Linear Algebra: summary and future work• Communication lower bounds on #words_moved and #messages

• BLAS, LU, Cholesky, QR, eig, SVD, …• Some whole programs (compositions of these operations,

no matter how they are interleaved, eg computing Ak)• Dense and sparse matrices (where #flops << n3 )• Sequential and parallel algorithms• Some graph-theoretic algorithms (eg Floyd-Warshall)

• Algorithms that attain these lower bounds• Nearly all dense problems (some open problems)• A few sparse problems

• Speed ups in theory and practice• Future work

• Lots to implement, tune• Next generation of Sca/LAPACK on heterogeneous architectures (MAGMA)

• Few algorithms in sparse case• Strassen-like algorithms• 3D Algorithms (only for Matmul so far), may be important for scaling

26

How hard is hand-tuning matmul, anyway?

27

• Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09• Students given “blocked” code to start with

• Still hard to get close to vendor tuned performance (ACML)• For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/

How hard is hand-tuning matmul, anyway?

28

29

What part of the Matmul Search Space Looks Like

A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned.(Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)

Num

b er o

f col

u mns

in re

gist

e r b

loc k

Number of rows in register block

30

Autotuning Matmul with ATLAS (n = 500)

• ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

• ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes, Chin, D.

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

900.0

Architectures

MFL

OPS

Vendor BLASATLAS BLASF77 BLAS

Source: Jack Dongarra

Automatic Performance Tuning

• Goal: Let machine do hard work of writing fast code

• What do tuning of Matmul, dense BLAS, FFTs, signal processing, have in common?

• Can do the tuning off-line: once per architecture, algorithm• Can take as much time as necessary (hours, a week…)• At run-time, algorithm choice may depend only on few parameters

(matrix dimensions, size of FFT, etc.)

• Can’t always do tuning off-line• Algorithm and implementation may strongly depend on data only known

at run-time• Ex: Sparse matrix nonzero pattern determines both best data structure

and implementation of Sparse-matrix-vector-multiplication (SpMV) • Part of search for best algorithm just be done (very quickly!) at run-time

Source: Accelerator Cavity Design Problem (from Ko via Husbands)

Linear Programming Matrix

…

A Sparse Matrix You Use Every Day

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

for each row ifor k=ptr[i] to ptr[i+1] do

y[i] = y[i] + val[k]*x[ind[k]]

SpMV with Compressed Sparse Row (CSR) Storage

Only 2 flops per 2 mem_refs:Limited by getting data from memory

Example: The Difficulty of Tuning

• n = 21200• nnz = 1.5 M• kernel: SpMV

• Source: NASA structural analysis problem

Example: The Difficulty of Tuning

• n = 21200• nnz = 1.5 M• kernel: SpMV

• Source: NASA structural analysis problem

• 8x8 dense substructure: exploit this to limit #mem_refs

Speedups on Itanium 2: The Need for Search

Reference

Best: 4x2

Mflop/s

Mflop/s8x8

Register Profile: Itanium 2

190 Mflop/s

1190 Mflop/s

Register Profiles: IBM and Intel IA-64

Power3 - 17% Power4 - 16%

Itanium 2 - 33%Itanium 1 - 8%

252 Mflop/s

122 Mflop/s

820 Mflop/s

459 Mflop/s

247 Mflop/s

107 Mflop/s

1.2 Gflop/s

190 Mflop/s

Register Profiles: Sun and Intel x86

Ultra 2i - 11% Ultra 3 - 5%

Pentium III-M - 15%Pentium III - 21%

72 Mflop/s

35 Mflop/s

90 Mflop/s

50 Mflop/s

108 Mflop/s

42 Mflop/s

122 Mflop/s

58 Mflop/s

Another example of tuning challenges

• More complicated non-zero structure in general

• N = 16614• NNZ = 1.1M

Zoom in to top corner


• N = 16614• NNZ = 1.1M

3x3 blocks look natural, but…


• Example: 3x3 blocking• Logical grid of 3x3 cells

• But would lead to lots of “fill-in”

Extra Work Can Improve Efficiency!


• Example: 3x3 blocking• Logical grid of 3x3 cells• Fill-in explicit zeros• Unroll 3x3 block multiplies• “Fill ratio” = 1.5

• On Pentium III: 1.5x speedup!• Actual mflop rate 1.52 = 2.25

higher

Extra Work Can Improve Efficiency!


• Example: 3x3 blocking• Logical grid of 3x3 cells• Fill-in explicit zeros• Unroll 3x3 block multiplies• “Fill ratio” = 1.5

• On Pentium III: 1.5x speedup!• Actual mflop rate 1.52 = 2.25

higher

• In general: sample matrix to choose rxc block structure using heuristic

Source: Accelerator Cavity Design Problem (from Ko via Husbands)

Post-RCM Reordering

100x100 Submatrix Along Diagonal

Before: Green + RedAfter: Green + Blue

“Microscopic” Effect of RCM Reordering

“Microscopic” Effect of Combined RCM+TSP Reordering

Before: Green + RedAfter: Green + Blue

Speedups from 1.6x to 3.3x

Summary of SpMV Performance Optimizations• Sequential SpMV

• Register blocking (RB): up to 4x over CSR• Symmetry: 2.8x over CSR, 2.6x over RB• Reordering to create dense structure + splitting: 2x over CSR• Variable block splitting: 2.1x over CSR, 1.8x over RB• Diagonals: 2x over CSR• Cache blocking: 2.8x over CSR• Multiple vectors (SpMM): 7x over CSR• And combinations…

• Multicore SpMV• NUMA, Affinity, SW Prefetch, Cache/LS/TLB blocking: 4x over naïve Pthreads

• Sequential Sparse triangular solve• Hybrid sparse/dense data structure: 1.8x over CSR

• Higher-level kernels• A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB• More later …

• All optimizations (being) automated in OSKI (bebop.cs.berkeley.edu)• Being adopted by Cray; Python interface in progress

OSKI

Avoiding Communication in Iterative Linear Algebra• k-steps of typical iterative solver for sparse Ax=b or Ax=λx

• Does k SpMVs with starting vector• Finds “best” solution among all linear combinations of these k+1 vectors• Many such “Krylov Subspace Methods”

• Conjugate Gradients, GMRES, Lanczos, Arnoldi, … • Goal: minimize communication in Krylov Subspace Methods

• Assume matrix “well-partitioned,” with modest surface-to-volume ratio• Parallel implementation

• Conventional: O(k log p) messages, because k calls to SpMV• New: O(log p) messages - optimal

• Serial implementation• Conventional: O(k) moves of data from slow to fast memory• New: O(1) moves of data – optimal

• Lots of speed up possible (modeled and measured)• Price: some redundant computation

• Much prior work• See bebop.cs.berkeley.edu• CG: [van Rosendale, 83], [Chronopoulos and Gear, 89]• GMRES: [Walker, 88], [Joubert and Carey, 92], [Bai et al., 94]

Minimizing Communication of GMRES to solve Ax=b• GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2• Cost of k steps of standard GMRES vs new GMRES

Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H

Sequential: #words_moved = O(k·nnz) from SpMV + O(k2·n) from MGSParallel: #messages = O(k) from SpMV + O(k2 · log p) from MGS

Communication-avoiding GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem

Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQRParallel: #messages = O(1) from computing W + O(log p) from TSQR

•Oops – W from power method, precision lost!

“Monomial” basis [Ax,…,Akx] fails to converge

A different polynomial basis does converge

Speed ups of GMRES on 8-core Intel Clovertown[MHDY09]

• Lots of algorithms to implement, autotune• Make widely available via OSKI, Trilinos, PETSc, Python, …

• Extensions to other Krylov subspace methods• So far just Lanczos/CG and Arnoldi/GMRES• BiCGStab, CGS, QMR, …

• Add preconditioning• Block diagonal are easiest• Hierarchical, semiseparable, …• Works if interactions between distant points can be “compressed”

Communication Avoiding Iterative Linear Algebra: Future Work

Summary

Don’t Communic…

58

Extra Slides

62

Proof of Communication Lower Bound on C = A·B (1/5)

• Proof from Irony/Toledo/Tiskin (2004)• Original proof, then generalization

• Think of instruction stream being executed• Looks like “ … add, load, multiply, store, load, add, …”

• Each load/store moves a word between fast and slow memory• We want to count the number of loads and stores, given that we are

multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative

• Assuming that at most M words can be stored in fast memory• Outline:

• Break instruction stream into segments, each with M loads and stores

• Somehow bound the maximum number of flops that can be done in each segment, call it F

• So F · # segments T = total flops = 2·n3 , so # segments T / F• So # loads & stores = M · #segments M · T / F 63

Load

Load

Load

Load

Load

Load

Load

Store

StoreStore

Store

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

Tim

e

Segment 1

Segment 2

Segment 3

Illustrating Segments, for M=3

...


• Proof from Irony/Toledo/Tiskin (2004)• Original proof, then generalization

• Think of instruction stream being executed• Looks like “ … add, load, multiply, store, load, add, …”

• Each load/store moves a word between fast and slow memory• We want to count the number of loads and stores, given that we are

multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative

• Assuming that at most M words can be stored in fast memory• Outline:

• Break instruction stream into segments, each with M loads and stores

• Somehow bound the maximum number of flops that can be done in each segment, call it F

• So F · # segments T = total flops = 2·n3 , so # segments T / F• So # loads & stores = M · #segments M · T / F 65


• Given segment of instruction stream with M loads & stores, how many adds & multiplies (F) can we do?

• At most 2M entries of C, 2M entries of A and/or 2M entries of B can be accessed

• Use geometry:• Represent n3 multiplications by n x n x n cube• One n x n face represents A

• each 1 x 1 subsquare represents one A(i,k)• One n x n face represents B

• each 1 x 1 subsquare represents one B(k,j)• One n x n face represents C

• each 1 x 1 subsquare represents one C(i,j)• Each 1 x 1 x 1 subcube represents one C(i,j) += A(i,k) · B(k,j)

• May be added directly to C(i,j), or to temporary accumulator

66


67

k

i

j

“A face”“B

face

”

“C face”

A(2,1)

A(1,3)

B(1

,3)

B(3

,1)

C(1,1)

C(2,3)

Cube representingC(1,1) += A(1,3)·B(3,1)

• If we have at most 2M “A squares”, 2M “B squares”, and 2M “C squares” on faces, how many cubes can we have?

A(1,1) B(1

,1)

A(1,2)

B(2

,1)


68

x

z

z

y

xy

k

“A shadow”

“B shadow”

“C shadow”

j

i

# cubes in black box with side lengths x, y and z= Volume of black box= x·y·z= ( xz · zy · yx)1/2

= (#A□s · #B□s · #C□s )1/2

(i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set

Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) · area(B shadow) · area(C shadow)) 1/2


• Consider one “segment” of instructions with M loads, stores• Can be at most 2M entries of A, B, C available in one segment• Volume of set of cubes representing possible multiply/adds in

one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F• # Segments 2n3 / F • # Loads & Stores = M · #Segments M · 2n3 / F = n3 / (2M)1/2

69

• Parallel Case: apply reasoning to one processor out of P• # Adds and Muls 2n3 / P (at least one proc does this )• M= n2 / P (each processor gets equal fraction of matrix)• # “Load & Stores” = # words moved from or to other procs M · (2n3 /P) / F= M · (2n3 /P) / (2M) 3/2 = n2 / (2P)1/2

How to generalize this lower bound (1/4)• It doesn’t depend on C(i,j) being a matrix entry, just a

unique memory location (same for A(i,k) and B(k,j) )

70

• It doesn’t depend on C and A not overlapping (or C and B, or A and B)

• Some reorderings may change answer; still get a lower bound

• It doesn’t depend on C(i,j) just being a sum of products

For all (i,j) S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k Sij , some other arguments)

C(i,j) = k A(i,k)*B(k,j)

• It doesn’t depend on doing n3 multiply/adds

How to generalize this lower bound (2/4)

71

• It does assume the presence of an operand generating a load and/or store; how could this not happen?

• Mem(b(k,j)) could be reused in many more gijk than (P) allows• Ex: Compute C(m) = A * B.^m (Matlab notation) for m=1 to t• Can move t1/2 times fewer words than (#flops / M1/2 )

• We might generate a result during a segment, use it, and discard it, with generating any memory traffic

• Turns out QR, eig, SVD all may do this• Need a different analysis for them later…


(P)


72

• Need to distinguish Sources, Destinations of each operand in fast memory during a segment:

• Possible Sources:S1: Already in fast memory at start of segment, or read; at most 2MS2: Created during segment; no bound without more information

• Possible Destinations:D1: Left in fast memory at end of segment, or written; at most 2MD2: Discarded; no bound without more information

• Need to assume no S2/D2 arguments; at most 4M of others


(P)


73

• Theorem: To evaluate (P) with memory of size M, where • fij and gijk are “nontrivial” functions of their arguments

• G is the total number of gijk‘s,

• No S2/D2 arguments requires at least G/ (8M)1/2 – M slow memory references


(P)

• Corollary: To evaluate (P) requires at least G/ (81/2 M3/2 ) – 1 messages (simpler: (#flops / M3/2 ) )

• Proof: maximum message size is M

• Simpler: #words_moved = (#flops / M1/2 )

Some Corollaries of this lower bound (1/2)

74

• Theorem applies to dense or sparse, parallel or sequential:• MatMul, including ATA or A2

• Triangular solve C = A-1·X • C(i,j) = (X(i,j) - k<i A(i,k)*C(k,j)) / A(i,i) … A lower triangular• C plays double role of b and c in Model (P)

• LU factorization (any pivoting, LU or “ILU”)• L(i,j) = (A(i,j) - k<j L(i,k)*U(k,j)) / U(j,j), U(i,j) = A(i,j) - k<i L(i,k)*U(k,j)• L (and U) play double role as c and a (c and b) in Model (P)

• Cholesky (any diagonal pivoting, C or “IC”)• L(i,j) = (A(i,j) - k<j L(i,k)*LT(k,j)) / L(j,j), L(j,j) = (A(j,j) - k<j L(j,k)*LT(k,j))1/2

• L (and LT) plays triple role as a, b and c

For all (i,j) S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )

for k Sij , some other arguments)

(P)#words_moved = (#flops / M1/2 )

Some Corollaries of this lower bound (2/2)

75

• Applies to “simplified” operations• Ex: Compute determinant of A(i,j) = func(i,j) using LU, where each

func(i,j) called just once; so no inputs and 1 output• Still get (#flops / M1/2 – 2n2) words_moved, by “imposing” loads/stores

For all (i,j) S Mem(c(i,j)) = fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )

for k Sij , some other arguments)

(P)#words_moved = (#flops / M1/2 )

• Applies to compositions of operations• Ex: Compute Ak by repeated squaring, only input A, output Ak

• Still get (#flops / M1/2 – n2 log k ) by “imposing” loads/stores, • Holds for any interleaving of operations

• Applies to some graph algorithms• Ex: All-Pairs-Shortest-Path by Floyd-Warshall: gijk = “+” and fij = “min”• Get (F / M1/2) words_moved where F {n4 , n3 log n , n3 }

Lower bounds for Orthogonal Transformations (1/4)• Needed for QR, eig, SVD, …• Analyze Blocked Householder Transformations

j=1 to b (I – 2 uj uTj ) = I – U T UT where U = [ u1 , … , ub ]

• Treat Givens as 2x2 Householder• Model details and assumptions

• Write (I – U T UT )A = A – U(TUT A) = A – UZ where Z = T(UT A)• Only count operations in all A – UZ operations

• “Generically” a large fraction of the work• Assume “Forward Progress”, that each successive Householder

transformation leaves previously created zeros zero; Ok for• QR decomposition• Reductions to condensed forms (Hessenberg, tri/bidiagonal)

– Possible exception: bulge chasing in banded case• One sweep of (block) Hessenberg QR iteration

76

Lower bounds for Orthogonal Transformations (2/4)• Perform many A – UZ where Z = T(UT A)• First challenge to applying theorem: need to collect all A-UZ

into one big set to which model (P) applies• Write them all as { A(i,j) = A(i,j) - k U(i,k) Z(k,j) } where k =

index of k-th transformation, • k not necessarily = index of column of A it comes from• Second challenge: All Z(k,j) may be S2/D2• Recall: S2/D2 means computed on the fly and discarded • Ex: An x 2n =Qn x n · Rn x 2n where 2n2 = M so A fits in cache

• Represent Q as n(n-1)/2 2x2 Householder (Givens) transforms• There are n2(n-1)/2 = (M3/2) nonzero Z(k,j), not O(M)• Still only do (M3/2) flops during segment• But can’t use Loomis-Whitney to prove it! • Need a new idea…

77

Lower bounds for Orthogonal Transformations (3/4)• Dealing with Z(k,j) being S2/D2

• How to bound #flops in A(i,j) = A(i,j) - k U(i,k) Z(k,j) ?• Neither A nor U is S2/D2

• A either turned into R or U, which are output• So at most 2M of each during segment

• #flops ≤ ( #U(i,k) ) · ( #columns A(:,j) present ) ≤ ( #U(i,k) ) · ( #A(i,j) / min #nonzeros per column of A) ≤ h · O(M) / r where h = O(M) … How small can r be?• To store h ≡ #U(i,k) Householder vector entries in r rows,

there can be at most r in the first column, r-1 in the second, etc.,

(to maintain “Forward Progress”) so r(r-1)/2 h so r h1/2

• # flops ≤ h · O(M) / r ≤ O(M) h1/2 = O(M3/2) as desired 78

Lower bounds for Orthogonal Transformations (4/4)• Theorem: #words_moved by QR decomposition using

(blocked) Householder transformations = ( #flops / M1/2 )

• Theorem: #words_moved by reduction to Hessenberg form, tridiagonal form, bidiagonal form, or one sweep of QR iteration (or block versions of any of these) = ( #flops / M1/2 )

• Assuming Forward Progress (created zeros remain zero)• Model: Merge left and right orthogonal transformations: A(i,j) = A(i,j) - kL

UL(i,kL) ·ZL(kL,j) - kRZR(i,kR) ·UR(j,kR)

79

Summary of dense sequential Cholesky algorithms attaining communication lower bounds

80

• Algorithms shown minimizing # Messages assume (recursive) block layout• Many references (see reports), only some shown, plus ours

• Oldest reference may or may not include analysis• Cache-oblivious are underlined, Green are ours, ? is unknown/future work• b = block size, M = fast memory size, n = dimension


#Words Moved and # Messages #Words Moved and # Messages

Cholesky LAPACK (with b = M1/2)

[Gustavson, 97]recursive, assumed good MatMul available

[BDHS09]

[BDHS09] like [Gustavson,97] &[Ahmed,Pingali,00]

Uses recursive Cholesky, TRSM and Matmul

(←same)

[BDHS09]has analysis

(←same)

[BDHS09]has analysis

Summary of dense sequential QR algorithms attaining communication lower bounds

81





QR [Elmroth, Gustavson,98]Recursive, may do more flops

LAPACK (only for nM and b=M1/2, or n2M) [DGHL08]

[Frens,Wise,03]explicit Q only, 3x more flops

[DGHL08]implicit Q, but represented differently,~ same flop count

[Elmroth, Gustavson,98]

[DGHL08]?

[Frens,Wise,03]

[DGHL08]?Can we get the

same flop count?

Summary of dense sequential LU algorithms attaining communication lower bounds

82





LU withpivoting

[Toledo, 97]recursive, assumes good TRSM and Matmul available

LAPACK (only for nM and b=M1/2, or n2M) [GDX08]Different than partial pivoting, still stable

[GDX08] [Toledo, 97]

[GDX08]?

[GDX08]?

Same idea for LU, almost [GDX08]• First idea: Use same reduction tree as in QR, with LU at each node

• Numerically unstable!• Need a different pivoting scheme:

• At each node in tree, TSLU selects b pivot rows from 2b candidates from its 2 child nodes

• At each node, do LU on 2b original rows selected by child nodes, not U factors from child nodes

• When TSLU done, permute b selected rows to top of original matrix, redo b steps of LU without pivoting

• Thm: Same results as partial pivoting on different input matrix whose entries are the same magnitudes as original

• CALU – Communication Avoiding LU for general A– Use TSLU for panel factorizations– Apply to rest of matrix– Cost: redundant panel factorizations

• Benefit: – Stable in practice, but not same pivot choice as GEPP– b times fewer messages overall - faster

TSLU: LU factorization of a tall skinny matrix

30

20

10

00

30

20

10

00

30

20

10

00

3

2

1

0

.

0

UUUU

LL

LL

WWWW

W

11

01

11

01

11

01

30

20

10

00

..

1

UU

LL

UUUU

020202

11

01

2

ULUU

First try the obvious generalization of TSQR:

Growth factor for TSLU based factorization

Unstable for large P and large matrices.When P = # rows, TSLU is equivalent to parallel pivoting.

Courtesy of H. Xiang

Growth factor for better CALU approach

Like threshold pivoting with worst case threshold = .33 , so |L| <= 3Testing shows about same residual as GEPP

Performance vs ScaLAPACK

• TSLU (Tall-Skinny)– IBM Power 5

• Up to 4.37x faster (16 procs, 1M x 150)– Cray XT4

• Up to 5.52x faster (8 procs, 1M x 150)• CALU (Square)

– IBM Power 5

• Up to 2.29x faster (64 procs, 1000 x 1000)– Cray XT4

• Up to 1.81x faster (64 procs, 1000 x 1000)• See INRIA Tech Report 6523 (2008), paper at SC08

Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

CALU speedup prediction for a Petascale machine - up to 81x faster

./102,10,102 9512 wordsss

P = 8192

Eigenvalue problems and SVD• Uses spectral divide and conquer

• Idea goes back to Bulgakov, Godunov, Malyshev• Roughly: If A2k = QR, leading columns of Q converge to span

invariant subspace for eigenvalues of A outside unit circle so Q*AQ becomes block upper triangular

• Repeated squaring done implicitly• [Bai, D., Gu, 97]

• Reduced just to Matmul, QR, and Generalized QR with pivoting• Eliminated need for exp(A), other improvements

• [DDH,07]• Introduced randomized rank revealing QR

• [Q’,R’]=qr(randn(n,n)), [Q,R]=qr(A·Q’)• Just need Matmul and QR

• [BDD,09]• Shows that randomized GQR also rank revealing• Shows limits of divide-and-conquer, depending on pseudospectrum• Cache-oblivious, using [Frens,Wise,03]

89

B j A j

Q11 Q12

Q21 Q22

R j0

A j1 Q12HA j

B j1 Q22HB j

Repeat:

until Rj converges

for each j,

A j 1B j (A j 1

1 B j 1)2 (A0

1B0)2 j

Implicit Repeated Squaring

A j1 1 B j1 A j

1Q12 HQ22

HB j (A j 1B j )

2

Q12HB j Q22

HA j 0 Q12 HQ22

H B jA j 1from 1st step:

from 2nd/3rd steps:

Proof:

Sparse Linear Algebra• Thm (George 73, Hoffman/Martin/Rose 73, Eisenstat/

Schultz/Sherman 76 ): Cholesky on an ns x ns sparse matrix whose graph is an s-dimensional nearest neighbor grid does ( n3(s-1) ) flops; attainable by nested dissection ordering

• Proof: most work in dense Cholesky on final submatrix of size (ns-1)

• Corollary [BDHS09,G09]: Sequential Sparse Cholesky on such a matrix moves ( n3(s-1) / M1/2 ) words, and Parallel Sparse Cholesky moves ( n2(s-1) / (P log n )1/2 ) words on P procs

• Assumes “2D” parallel algorithm, i.e. minimal memory

• Thm [G09]. Attainable in parallel case by nested dissection, eg with PSPASES.

91

Avoiding Communication in Iterative Linear Algebra• k-steps of typical iterative solver for sparse Ax=b or Ax=λx

• Does k SpMVs with starting vector (eg with b, if solving Ax=b)• Finds “best” solution among all linear combinations of these k+1 vectors• Many such “Krylov Subspace Methods”

• Conjugate Gradients, GMRES, Lanczos, Arnoldi, … • Goal: minimize communication in Krylov Subspace Methods

• Assume matrix “well-partitioned,” with modest surface-to-volume ratio• Parallel implementation

• Conventional: O(k log p) messages, because k calls to SpMV• New: O(log p) messages - optimal

• Serial implementation• Conventional: O(k) moves of data from slow to fast memory• New: O(1) moves of data – optimal

• Lots of speed up possible (modeled and measured)• Price: some redundant computation

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors

Can be computed without communication

Proc 1 Proc 2

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x


Proc 1 Proc 2

Locally Dependent Entries for [x,Ax,A2x], A tridiagonal, 2 processors

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x


Proc 1 Proc 2

Locally Dependent Entries for [x,Ax,…,A3x], A tridiagonal, 2 processors

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x


Proc 1 Proc 2


x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x


Can be computed without communicationk=8 fold reuse of A

Proc 1 Proc 2

Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

One message to get data needed to compute remotely dependent entries, not k=8Minimizes number of messages = latency cost

Price: redundant work “surface/volume ratio”

Proc 1 Proc 2

Remotely Dependent Entries for [x,Ax,A2x,A3x], A irregular, multiple processors

Sequential [x,Ax,…,A4x], with memory hierarchy

v

One read of matrix from slow memory, not k=4Minimizes words moved = bandwidth cost

No redundant work

Performance results on 8-Core Clovertown

minimizing communication in linear algebra

Documents

ucb eecs nerscbebop

ucb eecsolga holtz

ucb eecsoded schwartz

ucb eecskathy yelick

ucb mathmark hoemmen

ucb eecsioana dumitriu

dense linear algebraex

computational science