high performance dense linear algebra on spatially distributed processors jeffrey diamond and behnam...

58
High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger Department of Computer Science University of Texas at Austin *Texas Advanced Computing Center University of Texas at Austin

Upload: allison-shelton

Post on 11-Jan-2016

221 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

High Performance Dense Linear Algebra on Spatially Distributed Processors

Jeffrey Diamond and Behnam Robatmili

Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger

Department of Computer ScienceUniversity of Texas at Austin

*Texas Advanced Computing CenterUniversity of Texas at Austin

Page 2: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

2

Trends in Chip Level Parallelism

Emerging architectures more fine grained On chip networks, precise control over communication Tight orchestration of computation across ALUs

Algorithmic insight from most fine grained case

CoarseGrained

FineGrained

Quad Core

(MIMD)TRIPS(SDU)

Cell Tilera

Page 3: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

3

Parallel Programming Paradigms

Programming occurs at many levels Trends towards optimized library model

Special low level APIs for high performance We’re interested in these low level APIs

High Level API

Low Level API

Haskel, F#, Sequoia, CUDA, Ct, UPC, etcDynamic Run Times / CompilationClassic MultithreadingHigh Performance, Low Level Libraries

Page 4: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

4

Case Study: Matrix Multiply

Implementing full scale DGEMM High Performance Dense Linear Algebra Libraries

(Level 3 BLAS) are layered on top of high performance Matrix Multiply kernels: SYMM, SYRK, TRSM, TRMM, etc. Core LAPACK: LU with partial pivoting, Cholesky, QR

factorization, matrix inversion, reduction to tridiagonal/Hessenberg/bidiagonal form

Control theory: Sylvester equation, Lyapunov equation, and many, many others...

Regular operation is very amenable to algorithmic transformations and easy to reason about

Page 5: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

5

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm

High Level Memory Management Low Level Blocking Inner Kernel

Optimizing Inner Kernel Results Conclusion

Page 6: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

6

Spatially Distributed Uniprocessors (SDUs)

Single threaded scalability issues for architectures and implementation technology: Wire delay, Power, Issue Width, Memory Bandwidth… Solution: SDU - partitioned register banks, functional units, …

Still executing a single thread across multiple ALUs Where an instruction executes matters

Program statically determines location of instructions Examples include advanced VLIW processors in embedded

market TRIPS partitions most aspects of single core into tiles:

Tiles connected by on chip 2-D network Large number of distributed ALUs, registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network

Page 7: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

7

TRIPS - a modern SDU

Page 8: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

8

TRIPS - a modern SDU

Core 1

Core 2

Shared L2

Page 9: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

9

TRIPS - a modern SDURegister BanksL1 banksL2 banks

Grid of ALUs

Page 10: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

10

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm

High Level Memory Management Low Level Blocking Inner Kernel

Optimizing Inner Kernel Results Conclusion

Page 11: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

11

Outer-level: Goto streaming algorithm At heart GotoBLAS Linear Algebra Libraries Licensed by many of the top computer vendors Used by many supercomputers in top 500 list

Mid-level: enhanced Goto algorithm with new hierarchical blocking layer to leverage SDU topology

Inner kernel: novel algorithm suited to SDUs

Implementing Matrix Multiply

Page 12: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

12

Goto Streaming Algorithm

Classical blocking algorithm (C += AB): Break matrices into square blocks just big

enough for a, b and c to fit in L1 cache Goto: L2 cache is actually fast enough to

access directly from inner kernel Instead of small, square matrix blocks, use

huge block-panel multiplies Traversal order to maximize reuse Stream full-sized panels of B and C directly out of

DRAM

Page 13: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

13

Goto: High Level Blocking

C A B

High Level Blocking

C’ A’ B’

Original Problem

A’C’ B’

L2 DRAM/L1DRAM/REG

Thousands

Thousands

Thousands ThousandsHundreds

Hundreds

Panel Slices

+=

+=

+=

Page 14: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

14

128 registers hold non-trivial sized blocks 2-D mesh network has high bandwidth in orthogonal

directions (like a systolic array) Additionally store blocks of A in registers

Bring in elements of A and B simultaneously and maximize bandwidth Maximize use of both horizontal and vertical network links

But to amortize use of elements of A in registers, need to add another level of low level blocking to the hierarchy

Enhancing Goto Algorithm

Page 15: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

15

B’, C’ panel slices broken into mini-panels b’, c’ a’-block broken into mini-blocks, a’

a’ block and c mini panel held in registers 4x4 a’ amortized over 4x16 b’

Careful ordering of data movement preserves computational properties of larger block-panel multiply B slice stays in L1 for a LONG time, A stays even longer

A’C’ B’

(L2) (L1)(DRAM)

16 16444 4

+=Hundreds Hundreds

Low Level Blocking Scheme

Page 16: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

16

How do we traverse?

A’C’

B’

128

512

128

512

X

• B’ slice fits in L1 cache• A’ block fits in L2 cache• C’ streams from DRAM

Load c’ and a’ blocks into Registers

+=

16164 44

Page 17: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

17

A’C’

B’

128

512

128

1616

512

X

Stream b’(4x16) from L1 & multiply by a’(4x4)(Reuse a’ four times!)

+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

How do we traverse?

4 4

Page 18: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

18

A’C’

B’

128

512

128

512

X

+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

How do we traverse?

16164 4

Page 19: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

19

A’C’

B’

128

512

128

512

X

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

16164 4

Page 20: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

20

A’C’

B’

128

512

128

512

X

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

16164 4

Page 21: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

21

A’C’

B’

128

512

128

161651

2

X

Reuse register c’, next a’ right, next b’ below:

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

Page 22: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

22

A’C’

B’

128

512

128

161651

2

X

Repeat until at bottom of B slice, right of A row

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

Page 23: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

23

A’C’

B’

128

512

128

161651

2

X

Save c’s, load next row of a’ and c’, reuse entire B’ slice’

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

Page 24: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

24

A’C’

B’

128

512

128

161651

2

X

Repeat process over slice of B’

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

Page 25: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

25

A’C’

B’

128

512

128

161651

2

X

Continue over entire block of A’ and C’

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

Page 26: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

26

C’

B’

A’C’

B’

128

512

128

1616

X

Fetch next slice of B’ and move into next slice of C’

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

Page 27: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

27

A’C’

B’

128

512

128

1616

X

Complete B’, C’ Panels, load next A’ and repeat…

C’

B

C’

B

+=

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

Page 28: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

28

Defined Inner Kernel

C A B

High Level Blocking

C’ A’ B’

Original Problem

A’C’ B’

L2 DRAM/L1DRAM/REG

Thousands

Thousands

Thousands ThousandsHundreds

Hundreds

Panel Slices

+=

+=

+=

16

4

16

4

4

4 Mini Block-PanelREG REG L1

+=c’ b’a’

Page 29: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

29

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

Page 30: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

30

Optimizing the Inner Kernel

Developed several optimization principles: First to apply these principles to TRIPS

Avoiding network contention is critical! Single overscheduled link can cut performance in half Avoided by datapath routing, direction oriented

computation (DOC), register mirroring, data interleaving - got a 5x jump in Instructions Per Cycle, exceeding 10 IPC

Load balance every resource in system In a loop, total performance limited by most used wire link

or execution slot Loop body scaled to match register and data usage and to

minimize architectural overheads

Results in “fragility” of optimization typical of spatial architectures with shared resources

Page 31: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

31

Simplified Schedule Step 1: Reading A from Register files

D0

D1

D2

D3

GT R0 R1 R2 R3

Step 2: Loading B and broadcast it across rows

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

Step 3: Do the multiply and then add across columns Step 4: Write the results back to C

1 2 3 4

Page 32: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

32

Every register use must be retrieved across network Every load and store needs to get an address Need to interleave prefetching, writing, updating pointers, counters Need to account for data movement instructions

What are the complications?

Page 33: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

33

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

Page 34: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

34

Comparison of FPC across major processors

0

1

2

3

4

5

6

7

Opter

on P4

Core

2 Duo

POWER5

Itaniu

m

TRIPS

Kernel FPCDGEMM FPC

Execution Bottlenecks:Integer/Network Ops vs FLOPSSingle Operand Per Cycle

Enhancement OpportunitiesSIMD instruction setLarger Instruction WindowMore network bandwidth

* Results from K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, 2008. 13:748-757, August 2007

Page 35: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

35

0

1

2

3

4

5

6

0 512 1024 1536 2048 2560 3072 3584 4096

FP

C

DGEMM

C Kernel + Goto

C Kernel, no Goto

Performance vs Matrix Size

Page 36: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

36

Role of the Compiler

Kernel has 8x performance of TRIPS C compiler Did exhaustive empirical studies to determine individual

performance contributions of optimizations and their interaction with the TRIPS compiler

TRIPS compiler does scheduling as post process Determined that existing scheduler can handle

orchestration well if algorithm matches topology: If assembly for inner loop specified, scheduler obtained

75% of total performance

Lesson: Orchestration is not the difficult part Need to consider basic topology during compilation Blocking compilers and register clustering are active topics

of research Annotations / hints to compiler?

Page 37: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

37

Conclusions

Fine grained architectures can boost single thread performance

Optimization principles we learned can be applied to many levels of architectural granularity But critical for fine grained architectures

In the future, high performance will depend on algorithms that incorporate both the memory hierarchy and the topology of the processing/ communication substrate

Page 38: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

38

Thank You :)

Any Questions?

Page 39: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

39

Thank You :)

Any Questions?

Page 40: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

40

Back Up Slides

Just a list for now: Comparison of GotoBLAS against

Atlas/LAPACK More detailed diagrams of algorithm Other performance graphs Systolic Array Diagrams of other canonical processors

Page 41: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

41

Future work

Explore applicability of optimization principles beyond dense linear algebra, to irregular, control intensive algorithms

Quantify degree to which principles apply to coarser grained architectures (CMPs) and different memory topologies

Page 42: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

42

Trends in Chip Level Parallelism

Multiple ways to exploit parallelism: Instruction/Thread/Data Level Parallelism Coarse Grained vs Fine Grained

What’s the programming model? High level paradigm of your choice… Dynamic compilation and run time

systems Low level APIs for writing optimized

libraries Likely need to rewrite applications

Page 43: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

43

Trends in Computer Architecture

Emerging architectures are trending towards more fine grained control E.g. Intel Terascale, RAW, Tilera Tightly orchestrated computation On chip networks Precise control over

communication These represent a step down a

path Algorithmic insight can be gained

by looking at the most fine grained examples

Page 44: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

44

Spatially Distributed Uniprocessors Scalability issues for both architectures and underlying

technology Wire delay ,Power, Issue Width…

More and more components of microprocessors becoming distributed Partitioned register banks, functional units, …

SDU partitions all aspects of single core into tiles Tiles connected by on chip 2-D network Large number of distributed registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network

Key performance characteristic: Where an instruction executes matters!

Page 45: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

45

TRIPS - a modern SDU

Grid of ALUs (16) Large number of distributed registers Large number of data ports On chip 2-D mesh network S-NUCA distributed L1 and L2 cache

Page 46: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

46

TRIPS - a modern SDU

Potential Advantages for Matrix Multiply Large number of ALUs Precise placement of instructions

Not a MIMD machine Model of execution is block dataflow graphs Bring in graphs one at a time and execute must also deal with data movement, registers, data bandwidth, control

Page 47: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

47

Classical Matrix Multiply

Need to compute C = AB + C Once just used a triply nested loop… Want to amortize O(n2) data movement over

2n3 computation of matrix multiply Break A, B and C matrices into square blocks

just small enough to fit A, B and C in L1 cache Inner kernel computes block of C by caching

elements of C in registers and using values of A and B from L1 cache

Page 48: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

48

Performance for thin panels

Performance vs Panel Thickness

0

1

2

3

4

5

6

0 512 1024 1536 2048 2560 3072 3584 4096

k (m = n = 4096)

FPC

Cmxn = Amxk x Bkxn

Page 49: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

49

Goto’s Streaming Algorithm

Classical algorithm breaks matrices into blocks just big enough for A, B and C to fit in L1 cache

Goto realized L2 cache is actually fast enough to access directly from inner kernel! Use most of L2 cache for a giant block of A Inner kernel uses all levels of memory hierarchy

simultaneously Cache large slices of B panel in L1 cache, cache small piece

of C in registers

Instead of square matrix blocks, use block-panel multiplies, with traversal order to maximize reuse Stream full-sized contiguous panels of B and C directly out

of DRAM

Use extremely optimized hand tuned assembly

Page 50: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

50

Methodology

So we compiled code using the TRIPS compilerAnd we ran it on a hardware prototype.We kept making changes and seeing how fast it ran.We made notes of the changes.We made graphs from the notes.We made slides based on the graphs.We made conclusions based on the slides.It’s 130nm and 366 MHz, but that’s OK.

Page 51: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

51

Controlling The Cache

AC

B

=+

128

512

128

161651

2

X

• B slice fits in L1 cache• A block fits in L2 cache• C chunks from L2

How do we keep B in L1 cache while streaming all of A through?

Page 52: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

52

A Buffer Size

Affect of Dimensions of A Buffer (same area)

0

1

2

3

4

5

6

0 512 1024 1536 2048 2560 3072 3584 4096

m = n = k

FPC

512*128

256*256

128*512

Page 53: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

53

Block Panel Multiply

C BA

+= x

Doing multiple GEMDOTS in parallel.

Page 54: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

54

Block Panel Multiply

C BA

+= x

Doing multiple GEMDOTS in parallel.

Page 55: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

55

Block Panel Multiply

C BA

+= x

Doing multiple GEMDOTS in parallel.

Page 56: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

56

Block Panel Multiply

C BA

+= x

Doing multiple GEMDOTS in parallel.

Page 57: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

57

Block Panel Multiply

C BA

+= x

Doing multiple GEMDOTS in parallel.

Page 58: High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige

58