communication avoiding algorithms for dense linear algebra jim demmel cs294, lecture #4 fall, 2011...

69
Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Based on: Bootcamp 2011, CS267, Summer-school 2010, many papers

Upload: candace-cooper

Post on 17-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Communication AvoidingAlgorithms

for

Dense Linear Algebra

Jim Demmel

CS294, Lecture #4 Fall, 2011Communication-Avoiding Algorithms

www.cs.berkeley.edu/~odedsc/CS294

Based on:

Bootcamp 2011, CS267, Summer-school 2010, many papers

Page 2: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

2

Outline for today

• Recall communication lower bounds • What a “matching upper bound”, i.e. CA-algorithm,

might have to do• Summary of what is known so far• Case Studies of CA algorithms

• Case study I: Matrix multiplication• Case study II: LU, QR• Case study III: Cholesky decomposition

Page 3: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

3

Communication lower bounds

• Applies to algorithms satisfying technical conditions discussed last time

• “smells like 3-nested loops”• For each processor:

• M = size of fast memory of that processor• G = #operations (multiplies) performed by that processor

#words_moved

to/from fast memory= Ω ( max ( G / M1/2 , #inputs + #outputs ) )

#messages

to/from fast memory≥

#words_moved

max_message_size

= Ω ( G / M3/2 )

Page 4: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

4

Attaining these lower bounds

• Depends on what processor, memory refer to• Sequential vs

Parallel_distributed_memory vs Parallel_shared_memory

• Simple vs Hierarchical vs Messier…• DRAM+cache vs multiple levels of cache• Uniprocessors connected over network vs

multicore processors connected over network vs multicore/multiboard/multirack/multisite

• Uniform communication in network vs slower if farther away• Parallel machine with local memory hierarchies• Register file vs shared memory on GPUs

• Homogeneous vs Heterogeneous• All flop_rates/bandwidths/latencies/mem_sizes same vs

different• Use minimum fast_memory vs all available fast_memory

• Big design space (even just for matmul)

Page 5: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Summary of dense sequential algorithms attaining communication lower bounds• Algorithms shown minimizing # Messages use (recursive) block layout

• Not possible with columnwise or rowwise layouts• Many references (see reports), only some shown, plus ours• Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Algorithm 2 Levels of Memory Multiple Levels of Memory

#Words Moved and # Messages #Words Moved and #Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97]

Cholesky LAPACK (with b = M1/2) [Gustavson 97]

[BDHS09]

[Gustavson,97] [Ahmed,Pingali,00]

[BDHS09]

(←same ) (←same)

LU withpivoting

LAPACK (sometimes)[Toledo,97] , [GDX 08]

[GDX 08]not partial pivoting

[Toledo, 97][GDX 08]?

[GDX 08]?

QRRank-

revealing

LAPACK (sometimes) [Elmroth,Gustavson,98]

[DGHL08]

[Frens,Wise,03]but 3x flops?

[DGHL08]

[Elmroth,Gustavson,98]

[DGHL08] ?

[Frens,Wise,03] [DGHL08] ?

Eig, SVD Not LAPACK [BDD11 ]randomized, but more flops;

[BDK11]

[BDD11,BDK11? ]

[BDD11,BDK11?]

Page 6: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply

Cholesky

LU

QR

Sym Eig, SVD

Nonsym Eig

Page 7: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1

Cholesky ScaLAPACK log P

LU ScaLAPACK log P

QR ScaLAPACK log P

Sym Eig, SVD ScaLAPACK log P

Nonsym Eig ScaLAPACK P1/2 log P

Page 8: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1

Cholesky ScaLAPACK log P log P

LU ScaLAPACK log P n log P / P1/2

QR ScaLAPACK log P n log P / P1/2

Sym Eig, SVD ScaLAPACK log P n / P1/2

Nonsym Eig ScaLAPACK P1/2 log P n log P

Page 9: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (better)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1

Cholesky ScaLAPACK log P log P

LU [GDX10] log P log P

QR [DGHL08] log P log3 P

Sym Eig, SVD [BDD11] log P log3 P

Nonsym Eig [BDD11] log P log3 P

Page 10: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Can we do even better?

• Assume nxn matrices on P processors • Why just one copy of data: M = O(n2 / P) per processor?• Increase M to reduce lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1

Cholesky ScaLAPACK log P log P

LU [GDX10] log P log P

QR [DGHL08] log P log3 P

Sym Eig, SVD [BDD11] log P log3 P

Nonsym Eig [BDD11] log P log3 P

Page 11: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

12

Other algorithms of interest

• Variations on pivoting to find “best” rows and/or columns• Best = most independent

• PAPT = LDLT with symmetric pivoting• A symmetric, P permutation, L lower triangular, D (block) diagonal• Best row/column pairs first

• PAPT = LTLT with symmetric pivoting• A symmetric, P permutation, L lower triangular, T tridiagonal• Best row/column pairs first

• AP = QR with column pivoting• Best columns first

• PrAPcT = LU with complete pivoting

• Best rows and columns first

Page 12: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Case Study I: Matrix Multiply

Page 13: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

14

Naïve Matrix Multiply

{implements C = C + A*B}for i = 1 to n for j = 1 to n

for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)

= + *

C(i,j) C(i,j) A(i,:)

B(:,j)

Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memory

q potentially as large as 2*n3 / 3*n2 = O(n)

Page 14: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

15

Naïve Matrix Multiply

{implements C = C + A*B}for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}

= + *

C(i,j) A(i,:)

B(:,j)C(i,j)

Page 15: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

16

Naïve Matrix Multiply

Number of slow memory references on unblocked matrix multiplym = n3 to read each column of B n times

+ n2 to read each row of A once + 2n2 to read and write each element of C once = n3 + 3n2

So q = f / m = 2n3 / (n3 + 3n2) 2 for large n, no improvement over matrix-vector multiply

Inner two loops are just matrix-vector multiply, of row i of A times BSimilar for any other order of 3 loops

= + *

C(i,j) C(i,j) A(i,:)

B(:,j)

Page 16: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

17

Matrix-multiply, optimized several ways

Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops

Page 17: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

18

Recall: finding implementations for an algorithm that reduce communication

• Allowed:• Reordering arithmetic operations, preserving the dependencies.• Reordering summations, using associativity.• Reordering multiplications, using commutativity.

(irrelevant for matrix multiply and other bilinear algorithms).

• Not allowed (i.e., considered as different algorithms)• Other reorderings even if by doing so we preserve the

arithmetic correctness(e.g., using distributivity, as in Strassen’s algorithms) .

Page 18: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

19

Blocked (Tiled) Matrix Multiply

Consider A,B,C to be n/b-by-n/b matrices of b-by-b subblocks where b is called the block size for i = 1 to n/b

for j = 1 to n/b {read block C(i,j) into fast memory} for k = 1 to n/b {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory}

= + *

C(i,j) C(i,j) A(i,k)

B(k,j)

Page 19: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

20

Blocked (Tiled) Matrix Multiply

Recall: m is #words_moved between slow and fast memory matrix has nxn elements, and n/b x n/b blocks each of size bxb So:

m = n3/b read each block of B (n/b)3 times ((n/b)3 * b2 = n3 /b) + n3/b read each block of A (n/b)3 times + 2n2 read and write each block of C once = 2n3/b + 2n2

So we can reduce communication by increasing the blocksize b Limit: 3b2 ≤ M = fast memory sizem ≥ 2 * 31/2 * n3 / M1/2

Page 20: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

21

What if there are more than 2 levels of memory?

• Recall goal is to minimize communication between all levels• The tiled algorithm requires finding a good block size

• Machine dependent• Need to “block” b x b matrix multiply in inner most loop

• 1 level of memory 3 nested loops (naïve algorithm)• 2 levels of memory 6 nested loops• 3 levels of memory 9 nested loops …

• Cache Oblivious Algorithms offer an alternative• Treat nxn matrix multiply as a set of smaller problems• Eventually, these will fit in cache• Will minimize # words moved between every level of memory

hierarchy (between L1 and L2 cache, L2 and L3, L3 and main memory etc.) – at least asymptotically

Page 21: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Recursive Matrix Multiplication (RMM) (1/2)

• For simplicity: square matrices with n = 2m

• C = = A · B = · ·

=

• True when each Aij etc 1x1 or n/2 x n/2

22

A11 A12

A21 A22

B11 B12

B21 B22

C11 C12

C21 C22

A11·B11 + A12·B21 A11·B12 + A12·B22

A21·B11 + A22·B21 A21·B12 + A22·B22

func C = RMM (A, B, n)

if n = 1, C = A * B, else

{ C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2)

C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2)

C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2)

C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) }

return

Page 22: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Recursive Matrix Multiplication (2/2)

23

func C = RMM (A, B, n)

if n=1, C = A * B, else

{ C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2)

C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2)

C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2)

C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) }

return

A(n) = # arithmetic operations in RMM( . , . , n)

= 8 · A(n/2) + 4(n/2)2 if n > 1, else 1

= 2n3 … same operations as usual, in different order

M(n) = # words moved between fast, slow memory by RMM( . , . , n)

= 8 · M(n/2) + 4(n/2)2 if 3n2 > Mfast , else 3n2

= O( n3 / (Mfast )1/2 + n2 ) … same as blocked matmul

Page 23: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Recursion: Cache Oblivious Algorithms

• Recursion for general A (mxn) * B (nxp)

• Case1: m>= max{n,p}: split A horizontally:• Case 2 : n>= max{m,p}: split A vertically and B horizontally • Case 3: p>= max{m,n}: split B vertically

• Attains lower bound in O() sense

BA

BAB

A

A

2

1

2

1

2121 ,, BABABBA

Case 1

Case 3

BABAB

BAA 21

2

121,

Case 2

1 2

Page 24: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Experience with Cache-Oblivious Algorithms

• In practice, need to cut off recursion well before 1x1 blocks• Call “Micro-kernel” for small blocks, eg 16 x 16• Implementing a high-performance Cache-Oblivious code is not easy

• Using fully recursive approach with highly optimized recursive micro-kernel, Pingali et al report that they never got more than 2/3 of peak.

• Issues with Cache Oblivious (recursive) approach• Recursive Micro-Kernels yield less performance than iterative ones

using same scheduling techniques• Pre-fetching is needed to compete with best code: not well-understood

in the context of Cache Oblivous codes

Unpublished work, presented at LACSI 2006

Page 25: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

How hard is hand-tuning matmul, anyway?

26

• Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09

• Students given “blocked” code to start with

• Still hard to get close to vendor tuned performance (ACML)

• For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/

Page 26: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

How hard is hand-tuning matmul, anyway?

27

Page 27: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Parallel matrix-matrix multiplication

• Consider distributed memory machines• Each processor has its own private memory• Communication by sending messages over a network• Examples: MPI, UPC, Titanium

• First question: how is matrix initially distributed across different processors?

28

Page 28: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Different Parallel Data Layouts for Matrices (not all!)

0123012301230123

0 1 2 3 0 1 2 3

1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout

3) 1D Column Block Cyclic Layout

4) Row versions of the previous layouts

Generalizes others

0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 3 6) 2D Row and Column

Block Cyclic Layout

0 1 2 3

0 1

2 3

5) 2D Row and Column Blocked Layout

b

29

Page 29: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

30

Matrix Multiply with 1D Column Block Layout

• Assume matrices are n x n and n is divisible by p

• A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))

• B(i,j) is the n/p by n/p sublock of B(i) • in rows j*n/p through (j+1)*n/p

• Algorithm uses the formulaC(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i)

p0 p1 p2 p3 p5 p4 p6 p7

May be a reasonable assumption for analysis,

not for code

Page 30: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

31

Matmul for 1D Column Block layout

on a Processor Ring• Pairs of adjacent processors can communicate simultaneously

Copy A(myproc) into Tmp

C(myproc) = C(myproc) + Tmp*B(myproc , myproc)

for j = 1 to p-1

Send Tmp to processor myproc+1 mod p

Receive Tmp from processor myproc-1 mod p

C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc)

• Need to be careful about talking to neighboring processors• May want double buffering in practice for overlap• Ignoring deadlock details in code

• Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2g

• = a latency, = b latency, = g time_per_flop

Page 31: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

32

Matmul for 1D layout on a Processor Ring• Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2g

• Total Time = 2*n* (n/p)2g + (p-1) * Time of inner loop• 2*(n3/p) g + 2*p*a + 2*b*n2

• (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast:• Perfect speedup for arithmetic• A(myproc) must move to each other processor, costs at least (p-1)*cost of sending n*(n/p) words

• Parallel Efficiency = 2*n3 g/ (p * Total Time) = 1/(1+ ( /a )g * (p2/n3) + ( /b )g * (p/n) ) = 1/ (1 + O(p/n))• Grows to 1 as n/p increases (or /a g and /b g shrink)

Page 32: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

33

MatMul with 2D Layout

• Consider processors in 2D grid (physical or logical)• Processors communicate with 4 nearest neighbors

• Assume p processors form square s x s grid, s = p1/2

p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

= *

Page 33: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

34

Cannon’s Algorithm

… C(i,j) = C(i,j) + S A(i,k)*B(k,j)… assume s = sqrt(p) is an integer forall i=0 to s-1 … “skew” A left-circular-shift row i of A by i … so that A(i,j) overwritten by A(i,(j+i)mod s) forall i=0 to s-1 … “skew” B up-circular-shift column i of B by i … so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 … sequential forall i=0 to s-1 and j=0 to s-1 … all processors in parallel C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each column of B by 1

k

Page 34: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

35

C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2)

Cannon’s Matrix Multiplication

Page 35: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

36

Initial Step to Skew Matrices in Cannon

• Initial blocked input

• After skewing before initial block multiplies

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1) B(0,2)

B(1,0)

B(2,0)

B(1,1) B(1,2)

B(2,1) B(2,2)

B(0,0)

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

Page 36: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

37

Skewing Steps in CannonAll blocks of A must multiply all like-colored blocks of B

• First step

• Second

• Third

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0)

B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,0)

A(2,0)

A(0,1) A(0,2)

A(2,1)

A(1,2) B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,0)

A(2,0)

A(0,1)A(0,2)

A(1,1)

A(2,1)

A(1,2)

A(2,2)

A(0,0) B(0,1)

B(0,2)B(1,0)

B(2,0)

B(1,1)

B(1,2)

B(2,1)

B(2,2)B(0,0)

A(1,1)

A(2,2)

A(0,0)

Page 37: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

38

Cost of Cannon’s Algorithm forall i=0 to s-1 … recall s = sqrt(p)

left-circular-shift row i of A by i … cost ≤ s*(a + b*n2/p) forall i=0 to s-1

up-circular-shift column i of B by i … cost ≤ s*(a + b*n2/p) for k=0 to s-1 forall i=0 to s-1 and j=0 to s-1

C(i,j) = C(i,j) + A(i,j)*B(i,j) … cost = 2*(n/s)3 = 2*n3/p3/2

left-circular-shift each row of A by 1 … cost = a + b*n2/p

up-circular-shift each column of B by 1 … cost = a + b*n2/p° Total Time = 2*(n3/p) g + 4* s* + 4**n2/s …. attains lower bound!

° Parallel Efficiency = 2*n3 g/ (p * Total Time)

= 1/( 1 + ( / )a g * 2*(s/n)3 + ( / )b g * 2*(s/n) )

= 1/(1 + O(sqrt(p)/n))

° Grows to 1 as n/s = n/sqrt(p) = sqrt(data per processor) grows

° Better than 1D layout, which had Efficiency = 1/(1 + O(p/n))

Page 38: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

39

Pros and Cons of Cannon• So what if it’s “optimal”, is it fast?

• Yes: Local computation one call to (optimized) matrix-multiply• Hard to generalize for

• p not a perfect square• A and B not square• Dimensions of A, B not perfectly divisible by s=sqrt(p)• A and B not “aligned” in the way they are stored on processors• block-cyclic layouts

• Can you show optimal Cannon-like implementation (up to a constant), that can deal with items 1 and 3 above?

• Can you show optimal Cannon-like implementation (up to a constant), that can deal with item 2 above?(hint: what do you get if you use rectangle blocks with aspect ratios as the that of the input/output matrices?)

• Memory hog (extra copies of local matrices)

Page 39: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

40

SUMMA Algorithm

• SUMMA = Scalable Universal Matrix Multiply • Slightly less efficient than Cannon, but simpler and easier to

generalize• Can accommodate any layout, dimensions, alignment• Uses broadcast of submatrices instead of circular shifts• Sends log p times as much data as Cannon• Can use much less extra memory than Cannon, but send more

messages

• Similar ideas appeared many times

• Used in practice in PBLAS = Parallel BLAS• www.netlib.org/lapack/lawns/lawn{96,100}.ps

Page 40: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

41

SUMMA

* =i

j

A(i,k)

k

k

B(k,j)

• i, j represent all rows, columns owned by a processor

• k is a block of b 1 rows or columns

• C(i,j) = C(i,j) + Sk A(i,k)*B(k,j)

• Assume a pr by pc processor grid (pr = pc = 4 above)

• Need not be square

C(i,j)

Page 41: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

42

SUMMA

For k=0 to n-1 … or n/b-1 where b is the block size

… = # cols in A(i,k) and # rows in B(k,j)

for all i = 1 to pr … in parallel

owner of A(i,k) broadcasts it to whole processor row

for all j = 1 to pc … in parallel

owner of B(k,j) broadcasts it to whole processor column

Receive A(i,k) into Acol

Receive B(k,j) into Brow

C_myproc = C_myproc + Acol * Brow

* =i

j

A(i,k)

k

k

B(k,j)

C(i,j)

Page 42: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

43

SUMMA performance

For k=0 to n/b-1

for all i = 1 to s … s = sqrt(p)

owner of A(i,k) broadcasts it to whole processor row

… time = log s *( a + b * b*n/s), using a tree

for all j = 1 to s

owner of B(k,j) broadcasts it to whole processor column

… time = log s *( a + b * b*n/s), using a tree

Receive A(i,k) into Acol

Receive B(k,j) into Brow

C_myproc = C_myproc + Acol * Brow

… time = 2*(n/s)2*b

° Total time = 2*(n3/p) g + a * log p * n/b + b * log p * n2 /s

° To simplify analysis only, assume s = sqrt(p)

Page 43: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

44

SUMMA performance

• Total time = 2*(n3/p) g + a * log p * n/b + b * log p *

n2 /s• Parallel Efficiency =

1/(1 + ( /a )g * log p * p / (2*b*n2) + ( /b )g * log p * s/(2*n) )

• Same b term as Cannon, except for log p factor

log p grows slowly so this is ok• Latency ( /a g) term can be larger, depending on b

When b=1, get ( /a )g * log p * n

As b grows to n/s, term shrinks to

( /a )g * log p * s (log p times Cannon)• Temporary storage grows like 2*b*n/s• Can change b to tradeoff latency cost with memory

Page 44: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

45

Matrix multiplication – summary

• Cannon and SUMMA both (nearly) attain communication lower bound• Assuming 1 copy of data is allowed (plus a constant

amount of buffer space)• Depend on assumptions about initial, final data layouts

• Called “2D algorithms” to reflect layout• When more memory is available, get 2.5D algorithm

• Future talk (Edgar Solomonik)• When processors heterogeneous, get another algorithm

• Future talk (Grey Ballard and Andrew Gearhart)• Since Matmul naturally decomposes into smaller analogous

problems, hierarchical machines can be accommodated

Page 45: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Why so much about matrix multiplication?

46

Page 46: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

A brief history of (Dense) Linear Algebra software (1/5)

• Libraries like EISPACK (for eigenvalue problems)• Then the BLAS (1) were invented (1973-1977)

• Standard library of 15 operations (mostly) on vectors• “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc• Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC

• Goals• Common “pattern” to ease programming, readability, self-

documentation• Robustness, via careful coding (avoiding over/underflow)• Portability + Efficiency via machine specific implementations

• Why BLAS 1 ? They do O(n1) ops on O(n1) data• Used in libraries like LINPACK (for linear systems)

• Source of the name “LINPACK Benchmark” (not the code!)

• In the beginning was the do-loop…

Page 47: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

A brief history of (Dense) Linear Algebra software (2/5)

• But the BLAS-1 weren’t enough• Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes • “Computational intensity” = #flops / #mem_refs = (2n)/(3n) = 2/3• Too low to run near peak speed (time for mem_refs dominates)• Hard to vectorize (“SIMD’ize”) on supercomputers of the day (1980s)

• So the BLAS-2 were invented (1984-1986)• Standard library of 25 operations (mostly) on matrix/vector pairs

• “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, “TRSV”: y = T-1·x• Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC

• Why BLAS 2 ? They do O(n2) ops on O(n2) data

• So computational intensity still just ~(2n2)/(n2) = 2• OK for vector machines, but not for machine with caches

Page 48: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

A brief history of (Dense) Linear Algebra software (3/5)

• The next step: BLAS-3 (1987-1988)• Standard library of 9 operations (mostly) on matrix/matrix pairs

• “GEMM”: C = α·A·B + β·C, “SYRK”: C = α·A·AT + β·C, “TRSM”: C = T-1·B• Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC

• Why BLAS 3 ? They do O(n3) ops on O(n2) data

• So computational intensity (2n3)/(4n2) = n/2 – big at last!• Tuning opportunities machines with caches, other mem. hierarchy levels

• How much BLAS1/2/3 code so far (all at www.netlib.org/blas)• Source: 142 routines, 31K LOC, Testing: 28K LOC• Reference (unoptimized) implementation only

• Ex: 3 nested loops for GEMM• Lots more optimized code

• Most computer vendors provide own optimized versions• Motivates “automatic tuning” of the BLAS

Page 49: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

A brief history of (Dense) Linear Algebra software (4/5)• LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)

• Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1• How do we reorganize GE to use BLAS-3 ? (details later)

• Contents of LAPACK (summary)• Algorithms we can turn into (nearly) 100% BLAS 3

– Linear Systems: solve Ax=b for x

– Least Squares: choose x to minimize ||r||2 S ri2 where r=Ax-b

• Algorithms we can only make up to ~50% BLAS 3 (so far)– “Eigenproblems”: Find l and x where Ax = l x

– Singular Value Decomposition (SVD): ATAx=2x • Error bounds for everything• Lots of variants depending on A’s structure (banded, A=AT, etc)

• How much code? (Release 3.2, Nov 2008) (www.netlib.org/lapack)• Source: 1582 routines, 490K LOC, Testing: 352K LOC

• Ongoing development (at UCB, UTK and elsewhere)

Page 50: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

A brief history of (Dense) Linear Algebra software (5/5)

• Is LAPACK parallel?• Only if the BLAS are parallel (possible in shared memory)

• ScaLAPACK – “Scalable LAPACK” (1995 – now)• For distributed memory – uses MPI• More complex data structures, algorithms than LAPACK

• Only (small) subset of LAPACK’s functionality available• All at www.netlib.org/scalapack

Page 51: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

53

Success Stories for Sca/LAPACK

Cosmic Microwave Background Analysis, BOOMERanG

collaboration, MADCAP code (Apr. 27, 2000).

ScaLAPACK

• Widely used• Adopted by Mathworks, Cray,

Fujitsu, HP, IBM, IMSL, Intel, NAG, NEC, SGI, …

• >143M web hits(in 2011, 56M in 2006) @ Netlib (incl. CLAPACK, LAPACK95)

• New Science discovered through the solution of dense matrix systems

• Nature article on the flat universe used ScaLAPACK

• Other articles in Physics Review B that also use it

• 1998 Gordon Bell Prize• www.nersc.gov/news/reports/

newNERSCresults050703.pdf

Page 52: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

There is a lot left to do

• Do dense algorithms as implemented in LAPACK and ScaLAPACK attain communication lower bounds?

• Mostly not • If not, are there other algorithms that do?

• Yes• Do LAPACK or ScaLAPACK run (well or at all) on recent

architectures?• Not on Multicore (PLASMA)• Not on GPUs, or heterogeneous clusters (MAGMA)• Not Cloud, grid, …

54

Page 53: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Summer School Lecture 4 55

Gaussian Elimination (GE) for solving Ax=b

• Add multiples of each row to later rows to make A upper triangular• Solve resulting triangular system Ux = c by substitution

… for each column i

… zero it out below the diagonal by adding multiples of row i to later rows

for i = 1 to n-1

… for each row j below row i

for j = i+1 to n

… add a multiple of row i to row j

tmp = A(j,i);

for k = i to n

A(j,k) = A(j,k) - (tmp/A(i,i)) * A(i,k)

0...

0

0...

0

0.

0 0

0

0

0...

0

0...

0

0.

0

0...

0

0...

0

0...

0

After i=1 After i=2 After i=3 After i=n-1

Page 54: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

56

Refine GE Algorithm (1/5)

• Initial Version

• Remove computation of constant tmp/A(i,i) from inner loop.

… for each column i

… zero it out below the diagonal by adding multiples of row i to later rows

for i = 1 to n-1

… for each row j below row i

for j = i+1 to n

… add a multiple of row i to row j

tmp = A(j,i);

for k = i to n

A(j,k) = A(j,k) - (tmp/A(i,i)) * A(i,k)

for i = 1 to n-1

for j = i+1 to n

m = A(j,i)/A(i,i)

for k = i to n

A(j,k) = A(j,k) - m * A(i,k)

m

i

j

Page 55: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

57

Refine GE Algorithm (2/5)

• Last version

• Don’t compute what we already know: zeros below diagonal in column i

for i = 1 to n-1

for j = i+1 to n

m = A(j,i)/A(i,i)

for k = i+1 to n

A(j,k) = A(j,k) - m * A(i,k)

for i = 1 to n-1

for j = i+1 to n

m = A(j,i)/A(i,i)

for k = i to n

A(j,k) = A(j,k) - m * A(i,k)

Do not compute zeros

m

i

j

Page 56: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

58

Refine GE Algorithm (3/5)

• Last version

• Store multipliers m below diagonal in zeroed entries for later use

for i = 1 to n-1

for j = i+1 to n

m = A(j,i)/A(i,i)

for k = i+1 to n

A(j,k) = A(j,k) - m * A(i,k)

for i = 1 to n-1

for j = i+1 to n

A(j,i) = A(j,i)/A(i,i)

for k = i+1 to n

A(j,k) = A(j,k) - A(j,i) * A(i,k)

Store m here

m

i

j

Page 57: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

59

Refine GE Algorithm (4/5)

• Last versionfor i = 1 to n-1

for j = i+1 to n

A(j,i) = A(j,i)/A(i,i)

for k = i+1 to n

A(j,k) = A(j,k) - A(j,i) * A(i,k)

•Split Loop

for i = 1 to n-1

for j = i+1 to n

A(j,i) = A(j,i)/A(i,i)

for j = i+1 to n

for k = i+1 to n

A(j,k) = A(j,k) - A(j,i) * A(i,k) Store all m’s here before updating rest of matrix

i

j

Page 58: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

60

Refine GE Algorithm (5/5)

• Last version

• Express using matrix operations (BLAS)

for i = 1 to n-1

A(i+1:n,i) = A(i+1:n,i) * ( 1 / A(i,i) )

… BLAS 1 (scale a vector)

A(i+1:n,i+1:n) = A(i+1:n , i+1:n )

- A(i+1:n , i) * A(i , i+1:n)

… BLAS 2 (rank-1 update)

for i = 1 to n-1

for j = i+1 to n

A(j,i) = A(j,i)/A(i,i)

for j = i+1 to n

for k = i+1 to n

A(j,k) = A(j,k) - A(j,i) * A(i,k)

Page 59: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

61

What GE really computes

• Call the strictly lower triangular matrix of multipliers M, and let L = I+M

• Call the upper triangle of the final matrix U• Lemma (LU Factorization): If the above algorithm

terminates (does not divide by zero) then A = L*U• Solving A*x=b using GE

• Factorize A = L*U using GE (cost = 2/3 n3 flops)

• Solve L*y = b for y, using substitution (cost = n2 flops)

• Solve U*x = y for x, using substitution (cost = n2 flops)

• Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired

for i = 1 to n-1

A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector)

A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) … BLAS 2 (rank-1 update)

Page 60: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

62

Gaussian Elimination with Partial Pivoting (GEPP)• Partial Pivoting: swap rows so that A(i,i) is largest in column

for i = 1 to n-1

find and record k where |A(k,i)| = max{i j n} |A(j,i)|

… i.e. largest entry in rest of column i

if |A(k,i)| = 0

exit with a warning that A is singular, or nearly so

elseif k ≠ i

swap rows i and k of A

end if

A(i+1:n,i) = A(i+1:n,i) / A(i,i) … each |quotient| ≤ 1

A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)

• Lemma: This algorithm computes A = P*L*U, where P is a permutation matrix.

• This algorithm is numerically stable in practice

• For details see LAPACK code at

http://www.netlib.org/lapack/single/sgetf2.f

• Standard approach – but communication costs?

Page 61: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

63

Converting BLAS2 to BLAS3 in GEPP• Blocking

• Used to optimize matrix-multiplication • Harder here because of data dependencies in GEPP

• BIG IDEA: Delayed Updates• Save updates to “trailing matrix” from several consecutive BLAS2

(rank-1) updates• Apply many updates simultaneously in one BLAS3 (matmul)

operation• Same idea works for much of dense linear algebra

• One-sided factorization are ~100% BLAS3• Two-sided factorizations are ~50% BLAS3

• First Approach: Need to choose a block size b• Algorithm will save and apply b updates• b should be small enough so that active submatrix consisting of b

columns of A fits in cache• b should be large enough to make BLAS3 (matmul) fast

Page 62: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

64

Blocked GEPP (www.netlib.org/lapack/single/sgetrf.f)

for ib = 1 to n-1 step b … Process matrix b columns at a time

end = ib + b-1 … Point to end of block of b columns

apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’

… let LL denote the strict lower triangular part of A(ib:end , ib:end) + I

A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U

A(end+1:n , end+1:n ) = A(end+1:n , end+1:n )

- A(end+1:n , ib:end) * A(ib:end , end+1:n)

… apply delayed updates with single matrix-multiply

… with inner dimension b

(For a correctness proof, see on-line notes from

CS267 / 1996.)

Page 63: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Does GE Minimize Communication? (1/4)

• Model of communication costs with fast memory M• BLAS2 version of GEPP costs

• O(n ·b) if panel fits in M: n·b M• O(n · b2) (#flops) if panel does not fit in M: n·b > M

• Update of A(end+1:n , end+1:n ) by matmul costs• O( max ( n·b·n / M1/2 , n2 ))

• Triangular solve with LL bounded by above term• Total # slow mem refs for GE = (n/b) · sum of above terms

for ib = 1 to n-1 step b … Process matrix b columns at a time

end = ib + b-1 … Point to end of block of b columns

apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’

… let LL denote the strict lower triangular part of A(ib:end , ib:end) + I

A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U

A(end+1:n , end+1:n ) = A(end+1:n , end+1:n )

- A(end+1:n , ib:end) * A(ib:end , end+1:n)

… apply delayed updates with single matrix-multiply

… with inner dimension b

Page 64: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Does GE Minimize Communication? (2/4)• Model of communication costs with fast memory M

• BLAS2 version of GEPP costs • O(n ·b) if panel fits in M: n·b M• O(n · b2) (#flops) if panel does not fit in M: n·b > M

• Update of A(end+1:n , end+1:n ) by matmul costs• O( max ( n·b·n / M1/2 , n2 ))

• Triangular solve with LL bounded by above term• Total # slow mem refs for GE = (n/b) · sum of above terms

• Case 1: M < n (one column too large for fast mem)• Total # slow mem refs for GE = (n/b)*O(max(n b2 , b n2 / M1/2 , n2 ))

= O( max( n2 b , n3 / M1/2 , n3 / b ))• Minimize by choosing b = M1/2 • Get desired lower bound O(n3 / M1/2 )

66

Page 65: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Does GE Minimize Communication? (3/4)• Model of communication costs with fast memory M

• BLAS2 version of GEPP costs • O(n ·b) if panel fits in M: n·b M• O(n · b2) (#flops) if panel does not fit in M: n·b > M

• Update of A(end+1:n , end+1:n ) by matmul costs• O( max ( n·b·n / M1/2 , n2 ))

• Triangular solve with LL bounded by above term• Total # slow mem refs for GE = (n/b) · sum of above terms

• Case 2: M2/3 < n M • Total # slow mem refs for GE = (n/b)*O(max(n b2 , b n2 / M1/2 , n2 ))

= O( max( n2 b , n3 / M1/2 , n3 / b ))• Minimize by choosing b = n1/2 (panel does not fit in M)• Get O(n2.5) slow mem refs • Exceeds lower bound O(n3 / M1/2) by factor (M/n)1/2 M1/6

67

Page 66: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Does GE Minimize Communication? (4/4)• Model of communication costs with fast memory M

• BLAS2 version of GEPP costs • O(n ·b) if panel fits in M: n·b M• O(n · b2) (#flops) if panel does not fit in M: n·b > M

• Update of A(end+1:n , end+1:n ) by matmul costs• O( max ( n·b·n / M1/2 , n2 ))

• Triangular solve with LL bounded by above term• Total # slow mem refs for GE = (n/b) · sum of above terms

• Case 3: M1/2 < n M2/3 • Total # slow mem refs for GE = (n/b)*O(max(n b , b n2 / M1/2 , n2 ))

= O(max( n2 , n3 / M1/2 , n3 / b )) • Minimize by choosing b = M/n (panel fits in M) • Get O(n4/M) slow mem refs • Exceeds lower bound O(n3 / M1/2) by factor n/M1/2 M1/6

• Case 4: n M1/2 – whole matrix fits in fast mem

68

Page 67: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Alternative cache-oblivious GE formulation (1/2)

• Toledo (1997) • Describe without pivoting for simplicity• “Do left half of matrix, then right half”

69

function [L,U] = RLU (A) … assume A is m by n

if (n=1) L = A/A(1,1), U = A(1,1)

else

[L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A

… let L11 denote top n/2 rows of L1

A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n )

… update top n/2 rows of right half of A

A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n )

- A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n )

… update rest of right half of A

[L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A

return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ]

A = L * U

Page 68: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

Alternative cache-oblivious GE formulation (2/2)

70

function [L,U] = RLU (A) … assume A is m by n

if (n=1) L = A/A(1,1), U = A(1,1)

else

[L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A

… let L11 denote top n/2 rows of L1

A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n )

… update top n/2 rows of right half of A

A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n )

- A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n )

… update rest of right half of A

[L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A

return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ]

• Mem(m,n) = Mem(m,n/2) + O(max(m·n,m·n2/M1/2)) + Mem(m-n/2,n/2) 2 · Mem(m,n/2) + O(max(m·n,m·n2/M1/2)) = O(m·n2/M1/2 + m·n·log M) = O(m·n2/M1/2 ) if M1/2·log M = O(n)

Page 69: Communication Avoiding Algorithms for Dense Linear Algebra Jim Demmel CS294, Lecture #4 Fall, 2011 Communication-Avoiding Algorithms odedsc/CS294

One-sided Factorizations (LU, QR), so far

• Classical Approach for i=1 to n update column i update trailing matrix• #words_moved = O(n3)

71

• Blocked Approach (LAPACK)

for i=1 to n/b

update block i of b columns

update trailing matrix• #words moved = O(n3/M1/3)

• Recursive Approach

func factor(A)

if A has 1 column, update it

else

factor(left half of A)

update right half of A

factor(right half of A)• #words moved = O(n3/M1/2)

• None of these approaches

minimizes #messages or

addresses parallelism

• Need another idea