1 jim demmel eecs & math departments, uc berkeley [email protected] minimizing...

1

Jim DemmelEECS & Math Departments, UC Berkeley

[email protected]

Minimizing Communication in Numerical Linear Algebra

www.cs.berkeley.edu/~demmel

Communication Lower Bounds for Direct Linear Algebra

Outline • Recall Motivation• Lower bound proof for Matmul, by Irony/Toledo/Tiskin• How to generalize it to linear algebra without orthogonal xforms• How to generalize it to linear algebra with orthogonal xforms• Summary of linear algebra problems for which the lower bound

is attainable• Summary of extensions to Strassen

2

Collaborators

• Grey Ballard, UCB EECS• Ioana Dumitriu, U. Washington• Laura Grigori, INRIA• Ming Gu, UCB Math• Mark Hoemmen, UCB EECS• Olga Holtz, UCB Math & TU Berlin• Julien Langou, U. Colorado Denver• Marghoob Mohiyuddin, UCB EECS• Oded Schwartz , TU Berlin• Hua Xiang, INRIA• Kathy Yelick, UCB EECS & NERSC• BeBOP group, bebop.cs.berkeley.edu

3Summer School Lecture 3

4

Motivation (1/2)

Two kinds of costs:• Arithmetic (FLOPS)• Communication: moving data between

• levels of a memory hierarchy (sequential case) • over a network connecting processors (parallel case).

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Summer School Lecture 3

Motivation (2/2)

• Running time of an algorithm is sum of 3 terms:• # flops * time_per_flop• # words moved / bandwidth• # messages * latency

5

communication

• Exponentially growing gaps between• Time_per_flop << 1/Network BW << Network Latency

• Improving 59%/year vs 26%/year vs 15%/year• Time_per_flop << 1/Memory BW << Memory Latency

• Improving 59%/year vs 23%/year vs 5.5%/year

• Goal : reorganize linear algebra to avoid communication• Between all memory hierarchy levels

• L1 L2 DRAM network, etc • Not just hiding communication (speedup 2x ) • Arbitrary speedups possible


Direct linear algebra: Prior Work on Matmul• Assume n3 algorithm (i.e. not Strassen-like)• Sequential case, with fast memory of size M

• Lower bound on #words moved to/from slow memory = (n3 / M1/2 ) [Hong & Kung (1981)]

• Attained using blocked or recursive (“cache-oblivious”) algorithm

6

• Parallel case on P processors:• Let NNZ be total memory needed; assume load balanced• Lower bound on #words communicated

= (n3 /(P· NNZ )1/2 ) [Irony, Tiskin & Toledo (2004)]

• When NNZ = 3n2 , (“2D alg”), get (n2 / P1/2 )• Attained by Cannon’s Algorithm

• For “3D alg” (NNZ = O(n2 P1/3 )), get (n2 / P2/3 )• Attainable too (details later)

Direct linear algebra: Generalizing Prior Work• Sequential case: # words moved =

(n3 / M1/2 ) = (#flops / (fast_memory_size)1/2 ) • Parallel case: # words moved =

(n3 /(P· NNZ )1/2 ) = ((n3 /P) / (NNZ/P)1/2 )

= (#flops_per_proc / (memory_per_processor)1/2 )(true for at least one processor, assuming “balance” in

either flops or in memory)

• In both cases, we let M = memory size, and write

7

#words_moved by at least one processor =

(#flops / M1/2 )Summer School Lecture 3

Lower bound for all “direct” linear algebra

• Holds for• Matmul, BLAS, LU, QR, eig, SVD

• Need to explain model of computation

• Some whole programs (sequences of these operations, no matter how they are interleaved)

• Dense and sparse matrices (where #flops << n3 )• Sequential and parallel algorithms• Some graph-theoretic algorithms

8

#words_moved by at least one processor = (#flops / M1/2 )

#messages_sent by at least one processor = (#flops / M3/2 )


Proof of Communication Lower Bound on C = A·B (1/5)

• Proof from Irony/Toledo/Tiskin (2004)• Original proof, then generalization

• Think of instruction stream being executed• Looks like “ … add, load, multiply, store, load, add, …”

• Each load/store moves a word between fast and slow memory, or between local memory and remote memory (another processor)

• We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative

• Assuming that at most M words can be stored in fast memory

• Outline:• Break instruction stream into segments, each with M loads and stores

• Somehow bound the maximum number of flops that can be done in

each segment, call it F• So F · # segments T = total flops = 2·n3 , so # segments T / F• So # loads & stores = M · #segments M · T / F 9Summer School Lecture 3

Load

Load

Load

Load

Load

Load

Load

Store

Store

Store

Store

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

FLOP

Tim

e

Segment 1

Segment 2

Segment 3

Illustrating Segments, for M=3

...


• Proof from Irony/Toledo/Tiskin (2004)• Original proof, then generalization

• Think of instruction stream being executed• Looks like “ … add, load, multiply, store, load, add, …”

• Each load/store moves a word between fast and slow memory

• We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly reordered assuming addition is commutative/associative

• Assuming that at most M words can be stored in fast memory

• Outline:• Break instruction stream into segments, each with M loads and stores

• Somehow bound the maximum number of flops that can be done in

each segment, call it F• So F · # segments T = total flops = 2·n3 , so # segments T / F• So # loads & stores = M · #segments M · T / F 11Summer School Lecture 3


• Given segment of instruction stream with M loads & stores, how many adds & multiplies (F) can we do?

• At most 2M entries of C, 2M entries of A and/or 2M entries of B can be accessed

• Use geometry:• Represent n3 multiplications by n x n x n cube• One n x n face represents A

• each 1 x 1 subsquare represents one A(i,k)

• One n x n face represents B• each 1 x 1 subsquare represents one B(k,j)

• One n x n face represents C • each 1 x 1 subsquare represents one C(i,j)

• Each 1 x 1 x 1 subcube represents one C(i,j) += A(i,k) · B(k,j)• May be added directly to C(i,j), or to temporary accumulator



13

k

“A face”“B

face

”

“C face”Cube representing

C(1,1) += A(1,3)·B(3,1)

• If we have at most 2M “A squares”, 2M “B squares”, and 2M “C squares” on faces, how many cubes can we have?

i

j

A(2,1)

A(1,3)

B(1

,3)

B(3

,1)

C(1,1)

C(2,3)

A(1,1) B(1

,1)

A(1,2)

B(2

,1)


14

x

z

z

y

xy

k

“A shadow”

“B shadow”

“C shadow”

j

i

# cubes in black box with side lengths x, y and z= Volume of black box= x·y·z= ( xz · zy · yx)1/2

= (#A□s · #B□s · #C□s )1/2

(i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set

Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) · area(B shadow) · area(C shadow)) 1/2



• Consider one “segment” of instructions with M loads, stores• Can be at most 2M entries of A, B, C available in one segment• Volume of set of cubes representing possible multiply/adds in

one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F• # Segments 2n3 / F • # Loads & Stores = M · #Segments M · 2n3 / F n3 / (2M)1/2 – M = (n3 / M1/2 )

15

• Parallel Case: apply reasoning to one processor out of P

• # Adds and Muls 2n3 / P (at least one proc does this )

• M= n2 / P (each processor gets equal fraction of matrix)

• # “Load & Stores” = # words moved from or to other procs M · (2n3 /P) / F= M · (2n3 /P) / (2M)3/2 = n2 / (2P)1/2


Homework

• Prove more general statement of Loomis-Whitney• Suppose T is d-dimensional• N = |T| = #d-dimensional cubes in T• T(i) is projection of T onto hyperplane x(i)=0 • N(i) = d-1 – dimensional volume of T(i)

• Show N ≤ i=1 to d (N(i))1/(d-1)

17

How to generalize this lower bound (1/4)

• It doesn’t depend on C(i,j) being a matrix entry, just a unique memory location (same for A(i,k) and B(k,j) )

18

• It doesn’t depend on C and A not overlapping (or C and B, or A and B)

• Some reorderings may change answer; still get a lower bound for all reorderings

• It doesn’t depend on C(i,j) just being a sum of products

For all (i,j) S Mem(c(i,j)) =

fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k Sij ,

some other arguments)

C(i,j) = k A(i,k)*B(k,j)

• It doesn’t depend on doing n3 multiply/adds


19

• It does assume the presence of an operand generating a load and/or store; how could this not happen?

• Mem(b(k,j)) could be reused in many more gijk than (P) allows

• Ex: Compute C(m) = A * (B.^m) (Matlab notation) for m=1 to t• Can move many fewer words than (#flops / M1/2 )

• We might generate a result during a segment, use it, and discard it, without generating any memory traffic

• Turns out QR, eig, SVD all may do this• Need a different analysis for them later…




(P)



20

• Need to distinguish Sources, Destinations of each operand in fast memory during a segment:

• Possible Sources:S1: Already in fast memory at start of segment, or read; at most 2MS2: Created during segment; no bound without more information

• Possible Destinations:D1: Left in fast memory at end of segment, or written; at most 2MD2: Discarded; no bound without more information

• Need to assume no S2/D2 arguments; at most 4M of others




(P)



21

• Theorem: To evaluate (P) with memory of size M, where

• fij and gijk are “nontrivial” functions of their arguments

• G is the total number of gijk‘s,

• No S2/D2 arguments

requires at least G/ (8M)1/2 – M slow memory references




(P)

• Corollary: To evaluate (P) requires at least G/ (81/2 M3/2 ) – 1 messages (simpler: (#flops / M3/2 ) )

• Proof: maximum message size is M

• Simpler: #words_moved = (#flops / M1/2 )

Some Corollaries of this lower bound (1/4)

22

• Theorem applies to dense or sparse, parallel or sequential:• MatMul, including ATA or A2

• Triangular solve C = A-1·X • C(i,j) = (X(i,j) - k<i A(i,k)*C(k,j)) / A(i,i) … A lower triangular

• C plays double role of b and c in Model (P)

• LU factorization (any pivoting, LU or “ILU”)• L(i,j) = (A(i,j) - k<j L(i,k)*U(k,j)) / U(j,j), U(i,j) = A(i,j) - k<i L(i,k)*U(k,j)

• L (and U) play double role as c and a (c and b) in Model (P)

• Cholesky (any diagonal pivoting, C or “IC”)• L(i,j) = (A(i,j) - k<j L(i,k)*LT(k,j)) / L(j,j), L(j,j) = (A(j,j) - k<j L(j,k)*LT(k,j))1/2

• L (and LT) play triple role as a, b and c


fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )

for k Sij , some other arguments)

(P)#words_moved

= (#flops / M1/2 )



23

• Applies to “simplified” operations• Ex 1: Compute ||A·B||F where A(i,k) = f(i,k) and B(k,j) = g(k,j), so no

inputs and 1 output; assume each f(i,k) and g(k,j) evaluated once• Ex 2: Compute determinant of A(i,j) = f(i,j) using LU

• Apply lower bound by “imposing reads and writes”• Ex 1: Every time a final value of (A·B)(i,j) is computed, write it;

every time f(i,k), g(k,j) evaluated, insert a read• Ex 2: Every time a final value of L(i,j), U(i,j) computed, write it;

every time f(i,j) evaluated, insert a read

• Still get (#flops / M1/2 – 3n2) words_moved, by subtracting “imposed” reads/writes (sparse case analogous)




(P)#words_moved

= (#flops / M1/2 )


24




(P)#words_moved

= (#flops / M1/2 )

• Applies to compositions of operations• Ex: Compute Ak by repeated multiplication, only input A, output Ak

• Reorganize composition to match (P)

• Impose writes of intermediate results A2 , … , At-1

• Still get #words_moved = (#flops / M1/2 – (t-2) n2) • Holds for any interleaving of operations• Homework: apply to repeated squaring

A2

A3

…

At

A

A2

…

At-1

· A=


25




(P)#words_moved

= (#flops / M1/2 )

• Applies to some graph algorithms• Ex: All-Pairs-Shortest-Path by Floyd-Warshall:

• Matches (P) with gijk = “+” and fij = “min”

• Get #words_moved = (n3 / M1/2) , for dense graphs

• Homework: state result for sparse graphs; how does n3 change?

Initialize all path(i, j) = length of edge from node i to j (or if none)for k := 1 to n for i := 1 to n for j := 1 to n path(i, j) = min ( path(i, j), path(i, k) + path(k, j) );

“3D” parallel matrix multiplication C = C + A· B

• p procs arranged in p1/3 x p1/3 x p1/3

grid, with tree networks along “fibers”• Each A(i,k), B(k,j), C(i,j) is n/p1/3 x n/p1/3

• Initially• Proc (i,j,0) owns C(i,j)• Proc (i,0,k) owns A(i,k)• Proc (0,j,k) owns B(k,j)

• Algorithm• For all (i,k), broadcast A(i,k) to proc (i,j,k) j

… comm. cost = log p1/3 + log p1/3 n2 / p2/3 · • For all (j,k), broadcast B(k,j) to proc(i,j,k) j … same comm. cost • For all (i,j,k) Tmp(i,j,k) = A(i,k) B(k,j) … cost = (2(n/p1/3)3) = 2n3/p flops

• For all (i,j), reduce C(i,j) = k Tmp(i,j,k) … same comm. Cost

• Total comm. cost = O(log(p) n2 / p2/3 · + log(p) ) • Lower bound = ((n3/p)/(n2/p2/3)1/2 · + (n3/p)/(n2/p2/3)3/2 · ) = (n2/p2/3 · + )• Lower bound also ~attainable for all n2/p2/3 M = n2/px n2/p [Toledo et al]

26

i

j

A(2,1)

A(1,3)

B(1

,2)

B(3

,1)

C(1,1)

C(2,2)

A(1,1) B(1

,1)

A(1,2)

B(2

,1)

p1/3

p1/

3

p1/3

So when doesn’t Model (P) apply?

27




(P)#words_moved

= (#flops / M1/2 )

• Ex: for r = 1 to t, C(r) = A * (B.^(1/r)) … Matlab notation

• (P) does not apply• With A(i,k) and B(k,j) in memory, we can do t operations, not just 1 gijk

• Can’t apply (P) with arguments b(k,j) indexed in one-to-one fashion

• Can still analyze using segments, Loomis-Whitney• #flops/segment 8(tM3)1/2

• #segments (t n3 ) / 8(tM3)1/2

• #words_moved = (t1/2 · n3 / M1/2 ), not (t · n3 / M1/2 )• Attainable, using variation of usual blocked matmul algorithm

• Homework! (Hint: need to recompute (B(j,k))1/r when needed)

Lower bounds for Orthogonal Transformations (1/4)

• Needed for QR, eig, SVD, …• Analyze Blocked Householder Transformations

j=1 to b (I – 2 uj uTj ) = I – U T UT where U = [ u1 , … , ub ]

• Treat Givens as 2x2 Householder

• Model details and assumptions• Write (I – U T UT )A = A – U(TUT A) = A – UZ where Z = T(UT A)• Only count operations in all A – UZ operations

• “Generically” a large fraction of the work

• Assume “Forward Progress”, that each successive Householder transformation leaves previously created zeros zero; Ok for

• QR decomposition• Reductions to condensed forms (Hessenberg, tri/bidiagonal)

– Possible exception: bulge chasing in banded case• One sweep of (block) Hessenberg QR iteration



• Perform many A – UZ where Z = T(UT A)• First challenge to applying theorem: need to collect all A-UZ

into one big set to which model (P) applies• Write them all as { A(i,j) = A(i,j) - k U(i,k) Z(k,j) } where k =

index of k-th transformation, • k not necessarily = index of column of A it comes from

• Second challenge: All Z(k,j) may be S2/D2• Recall: S2/D2 means computed on the fly and discarded • Ex: An x 2n =Qn x n · Rn x 2n where 2n2 = M so A fits in cache

• Represent Q as n(n-1)/2 2x2 Householder (Givens) transforms• There are n2(n-1)/2 = (M3/2) nonzero Z(k,j), not O(M)• Still only do (M3/2) flops during segment• But can’t use Loomis-Whitney to prove it! • Need a new idea…



• Dealing with Z(k,j) being S2/D2 • How to bound #flops in A(i,j) = A(i,j) - k U(i,k) Z(k,j) ?

• Neither A nor U is S2/D2• A either turned into R or U, which are output• So at most 2M of each during segment

• #flops ≤ ( #U(i,k) ) · ( #columns A(:,j) present ) ≤ ( #U(i,k) ) · ( #A(i,j) / min #nonzeros per column of A) ≤ h · O(M) / r where h = O(M) … How small can r be?• To store h ≡ #U(i,k) Householder vector entries in r rows,

there can be at most r in the first column, r-1 in the second, etc.,

(to maintain “Forward Progress”) so r(r-1)/2 h so r h1/2

• # flops ≤ h · O(M) / r ≤ O(M) h1/2 = O(M3/2) as desired 30Summer School Lecture 3


• Theorem: #words_moved by QR decomposition using (blocked) Householder transformations = ( #flops / M1/2 )

• Theorem: #words_moved by reduction to Hessenberg form, tridiagonal form, bidiagonal form, or one sweep of QR iteration (or block versions of any of these) = ( #flops / M1/2 )

• Assuming Forward Progress (created zeros remain zero)• Model: Merge left and right orthogonal transformations: A(i,j) = A(i,j) - kL

UL(i,kL) ·ZL(kL,j) - kRZR(i,kR) ·UR(j,kR)


Decompose a matrix into a low rank component and a sparse component:

M = L + S

Robust PCA (Candes, 2009)

Homework: Apply lower bound result

Can we attain these lower bounds?

• Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds?

• Mostly not

• If not, are there other algorithms that do?• Yes

• Several goals for algorithms:• Minimize Bandwidth ( (#flops/ M1/2 )) and latency ( (#flops/ M3/2 ))• Multiple memory hierarchy levels (attain bound for each level?)• Explicit dependence on (multiple) M(s), vs “cache oblivious”

• Fewest flops when fits in smallest memory

• What about sparse algorithms?


Recall: Minimizing latency requires new data structures

• To minimize latency, need to load/store whole rectangular subblock of matrix with one “message”

• Incompatible with conventional columnwise (rowwise) storage• Ex: Rows (columns) not in contiguous memory locations

• Blocked storage: store as matrix of bxb blocks, each block stored contiguously

• Ok for one level of memory hierarchy, what if more?

• Recursive blocked storage: store each block using subblocks

34

Recall: Blocked vs Cache-Oblivious Algorithms

• Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size

35

… Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc… b chosen so 3 bxb blocks fit in cachefor i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops

• Cache-oblivious Matmul C = A·B is independent of cache

Function C = MM(A,B)If A and B are 1x1 C = A · Belse … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + MM( A(i,k), B(k,j) ) … n/2 x n/2 matmul

Summary of dense sequential algorithms attaining communication lower bounds

• Algorithms shown minimizing # Messages assume (recursive) block layout• Many references (see reports), only some shown, plus ours

• Older references may or may not include analysis• Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Algorithm 2 Levels of Memory Multiple Levels of Memory

#Words Moved and # Messages #Words Moved and #Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97]

Cholesky LAPACK (with b = M1/2) [Gustavson 97]

[BDHS09]

[Gustavson,97] [Ahmed,Pingali,00]

[BDHS09]

(←same) (←same)

LU withpivoting

LAPACK (rarely)[Toledo,97] , [GDX 08]

[GDX 08] [Toledo, 97][GDX 08]?

[GDX 08]?

QR LAPACK (rarely) [Elmroth,Gustavson,98]

[DGHL08]

[Frens,Wise,03][DGHL08]

[Elmroth,Gustavson,98]

[DGHL08] ?

[Frens,Wise,03] [DGHL08] ?

Eig, SVD Not LAPACK [BDD10]

[BDD10] [BDD10]

Summary of dense 2D parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors, memory per processor = O(n2 / P)• Many references (see reports), Green are ours• ScaLAPACK assumes best block size b chosen• Recall lower bounds:

#words_moved = ( n2 / P1/2 ) and #messages = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for#messages

Matrix multiply [Cannon, 69] 1 1

Cholesky ScaLAPACK log P log P

LU [GDX08]ScaLAPACK

log P log P

log P log P · N / P1/2

QR [DGHL08] ScaLAPACK

log Plog P

log3 Plog P · N / P1/2

Sym Eig, SVD [BDD10]ScaLAPACK

log Plog P

log3 PN / P1/2

Nonsym Eig [BDD10]ScaLAPACK

log Plog P · P1/2

log3 Plog P· N

Recent Communication Optimal Algorithms

• QR with column pivoting• Cholesky with diagonal pivoting• LU with “complete” pivoting• LDL’ with “complete” pivoting• Sparse Cholesky

• For matrices with “good separators”• For “most sparse matrices” (as hard as dense case)

38

Homework – Extend lower bound

• To algorithms using (block) Givens rotations instead of (block) Householder transformations

• To QR done with Gram-Schmidt orthogonalization• CGS and MGS

• To heterogeneous collections of processors• Suppose processor k

• Does (k) flops per second• Has reciprocal bandwidth (k)• Has latency (k)

• What is a lower bound on solution time, for any fractions of work assigned to each processor (i.e. not all equal)

• To a homogenous parallel shared memory machine• All data initially resides in large shared memory• Processors communicate by data written to/read from slow memory• Each processor has local fast memory size M

39

EXTRA SLIDES

40

1 jim demmel eecs & math departments, uc berkeley [email protected] minimizing...

Documents