high performance dense linear algebra on spatially distributed processors jeffrey diamond and behnam...

High Performance Dense Linear Algebra on Spatially Distributed Processors

Jeffrey Diamond and Behnam Robatmili

Stephen Keckler, Robert van de Geijn, Kazushige Goto*, Doug Burger

Department of Computer ScienceUniversity of Texas at Austin

*Texas Advanced Computing CenterUniversity of Texas at Austin

Trends in Chip Level Parallelism

Emerging architectures more fine grained On chip networks, precise control over communication Tight orchestration of computation across ALUs

Algorithmic insight from most fine grained case

CoarseGrained

FineGrained

Quad Core

(MIMD)TRIPS(SDU)

Cell Tilera

Parallel Programming Paradigms

Programming occurs at many levels Trends towards optimized library model

Special low level APIs for high performance We’re interested in these low level APIs

High Level API

Low Level API

Haskel, F#, Sequoia, CUDA, Ct, UPC, etcDynamic Run Times / CompilationClassic MultithreadingHigh Performance, Low Level Libraries

Case Study: Matrix Multiply

Implementing full scale DGEMM High Performance Dense Linear Algebra Libraries

(Level 3 BLAS) are layered on top of high performance Matrix Multiply kernels: SYMM, SYRK, TRSM, TRMM, etc. Core LAPACK: LU with partial pivoting, Cholesky, QR

factorization, matrix inversion, reduction to tridiagonal/Hessenberg/bidiagonal form

Control theory: Sylvester equation, Lyapunov equation, and many, many others...

Regular operation is very amenable to algorithmic transformations and easy to reason about

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm

High Level Memory Management Low Level Blocking Inner Kernel

Optimizing Inner Kernel Results Conclusion

Spatially Distributed Uniprocessors (SDUs)

Single threaded scalability issues for architectures and implementation technology: Wire delay, Power, Issue Width, Memory Bandwidth… Solution: SDU - partitioned register banks, functional units, …

Still executing a single thread across multiple ALUs Where an instruction executes matters

Program statically determines location of instructions Examples include advanced VLIW processors in embedded

market TRIPS partitions most aspects of single core into tiles:

Tiles connected by on chip 2-D network Large number of distributed ALUs, registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network

TRIPS - a modern SDU

Core 1

Core 2

Shared L2

TRIPS - a modern SDURegister BanksL1 banksL2 banks

Grid of ALUs

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm

High Level Memory Management Low Level Blocking Inner Kernel

Optimizing Inner Kernel Results Conclusion

Outer-level: Goto streaming algorithm At heart GotoBLAS Linear Algebra Libraries Licensed by many of the top computer vendors Used by many supercomputers in top 500 list

Mid-level: enhanced Goto algorithm with new hierarchical blocking layer to leverage SDU topology

Inner kernel: novel algorithm suited to SDUs

Implementing Matrix Multiply

Goto Streaming Algorithm

Classical blocking algorithm (C += AB): Break matrices into square blocks just big

enough for a, b and c to fit in L1 cache Goto: L2 cache is actually fast enough to

access directly from inner kernel Instead of small, square matrix blocks, use

huge block-panel multiplies Traversal order to maximize reuse Stream full-sized panels of B and C directly out of

Goto: High Level Blocking

High Level Blocking

C’ A’ B’

Original Problem

A’C’ B’

L2 DRAM/L1DRAM/REG

Thousands

Thousands ThousandsHundreds

Hundreds

Panel Slices

128 registers hold non-trivial sized blocks 2-D mesh network has high bandwidth in orthogonal

directions (like a systolic array) Additionally store blocks of A in registers

Bring in elements of A and B simultaneously and maximize bandwidth Maximize use of both horizontal and vertical network links

But to amortize use of elements of A in registers, need to add another level of low level blocking to the hierarchy

Enhancing Goto Algorithm

B’, C’ panel slices broken into mini-panels b’, c’ a’-block broken into mini-blocks, a’

a’ block and c mini panel held in registers 4x4 a’ amortized over 4x16 b’

Careful ordering of data movement preserves computational properties of larger block-panel multiply B slice stays in L1 for a LONG time, A stays even longer

A’C’ B’

(L2) (L1)(DRAM)

16 16444 4

+=Hundreds Hundreds

Low Level Blocking Scheme

How do we traverse?

A’C’

• B’ slice fits in L1 cache• A’ block fits in L2 cache• C’ streams from DRAM

Load c’ and a’ blocks into Registers

16164 44

A’C’

Stream b’(4x16) from L1 & multiply by a’(4x4)(Reuse a’ four times!)

+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

How do we traverse?

A’C’

+= B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

How do we traverse?

16164 4

A’C’

How do we traverse?

B’ slice fits in L1 cache A’ block fits in L2 cache C’ streams from DRAM

16164 4

A’C’

How do we traverse?

16164 4

A’C’

161651

Reuse register c’, next a’ right, next b’ below:

How do we traverse?

A’C’

161651

Repeat until at bottom of B slice, right of A row

How do we traverse?

A’C’

161651

Save c’s, load next row of a’ and c’, reuse entire B’ slice’

How do we traverse?

A’C’

161651

Repeat process over slice of B’

How do we traverse?

A’C’

161651

Continue over entire block of A’ and C’

How do we traverse?

A’C’

Fetch next slice of B’ and move into next slice of C’

How do we traverse?

A’C’

Complete B’, C’ Panels, load next A’ and repeat…

How do we traverse?

Defined Inner Kernel

High Level Blocking

C’ A’ B’

Original Problem

A’C’ B’

L2 DRAM/L1DRAM/REG

Thousands

Thousands ThousandsHundreds

Hundreds

Panel Slices

4 Mini Block-PanelREG REG L1

+=c’ b’a’

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

Optimizing the Inner Kernel

Developed several optimization principles: First to apply these principles to TRIPS

Avoiding network contention is critical! Single overscheduled link can cut performance in half Avoided by datapath routing, direction oriented

computation (DOC), register mirroring, data interleaving - got a 5x jump in Instructions Per Cycle, exceeding 10 IPC

Load balance every resource in system In a loop, total performance limited by most used wire link

or execution slot Loop body scaled to match register and data usage and to

minimize architectural overheads

Results in “fragility” of optimization typical of spatial architectures with shared resources

Simplified Schedule Step 1: Reading A from Register files

GT R0 R1 R2 R3

Step 2: Loading B and broadcast it across rows

Step 3: Do the multiply and then add across columns Step 4: Write the results back to C

1 2 3 4

Every register use must be retrieved across network Every load and store needs to get an address Need to interleave prefetching, writing, updating pointers, counters Need to account for data movement instructions

What are the complications?

Talk Outline

Spatially Distributed Uniprocessors Matrix Multiply Algorithm Optimizing Inner Kernel Results Conclusion

Comparison of FPC across major processors

POWER5

Itaniu

Kernel FPCDGEMM FPC

Execution Bottlenecks:Integer/Network Ops vs FLOPSSingle Operand Per Cycle

Enhancement OpportunitiesSIMD instruction setLarger Instruction WindowMore network bandwidth

* Results from K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, 2008. 13:748-757, August 2007

0 512 1024 1536 2048 2560 3072 3584 4096

C Kernel + Goto

C Kernel, no Goto

Performance vs Matrix Size

Role of the Compiler

Kernel has 8x performance of TRIPS C compiler Did exhaustive empirical studies to determine individual

performance contributions of optimizations and their interaction with the TRIPS compiler

TRIPS compiler does scheduling as post process Determined that existing scheduler can handle

orchestration well if algorithm matches topology: If assembly for inner loop specified, scheduler obtained

75% of total performance

Lesson: Orchestration is not the difficult part Need to consider basic topology during compilation Blocking compilers and register clustering are active topics

of research Annotations / hints to compiler?

Conclusions

Fine grained architectures can boost single thread performance

Optimization principles we learned can be applied to many levels of architectural granularity But critical for fine grained architectures

In the future, high performance will depend on algorithms that incorporate both the memory hierarchy and the topology of the processing/ communication substrate

Thank You :)

Any Questions?

Thank You :)

Any Questions?

Back Up Slides

Just a list for now: Comparison of GotoBLAS against

Atlas/LAPACK More detailed diagrams of algorithm Other performance graphs Systolic Array Diagrams of other canonical processors

Future work

Explore applicability of optimization principles beyond dense linear algebra, to irregular, control intensive algorithms

Quantify degree to which principles apply to coarser grained architectures (CMPs) and different memory topologies

Trends in Chip Level Parallelism

Multiple ways to exploit parallelism: Instruction/Thread/Data Level Parallelism Coarse Grained vs Fine Grained

What’s the programming model? High level paradigm of your choice… Dynamic compilation and run time

systems Low level APIs for writing optimized

libraries Likely need to rewrite applications

Trends in Computer Architecture

Emerging architectures are trending towards more fine grained control E.g. Intel Terascale, RAW, Tilera Tightly orchestrated computation On chip networks Precise control over

communication These represent a step down a

path Algorithmic insight can be gained

by looking at the most fine grained examples

Spatially Distributed Uniprocessors Scalability issues for both architectures and underlying

technology Wire delay ,Power, Issue Width…

More and more components of microprocessors becoming distributed Partitioned register banks, functional units, …

SDU partitions all aspects of single core into tiles Tiles connected by on chip 2-D network Large number of distributed registers, data ports Enormous aggregate bandwidth to registers and data, but… Communication between ALUs must go through network

Key performance characteristic: Where an instruction executes matters!

Grid of ALUs (16) Large number of distributed registers Large number of data ports On chip 2-D mesh network S-NUCA distributed L1 and L2 cache

Potential Advantages for Matrix Multiply Large number of ALUs Precise placement of instructions

Not a MIMD machine Model of execution is block dataflow graphs Bring in graphs one at a time and execute must also deal with data movement, registers, data bandwidth, control

Classical Matrix Multiply

Need to compute C = AB + C Once just used a triply nested loop… Want to amortize O(n2) data movement over

2n3 computation of matrix multiply Break A, B and C matrices into square blocks

just small enough to fit A, B and C in L1 cache Inner kernel computes block of C by caching

elements of C in registers and using values of A and B from L1 cache

Performance for thin panels

Performance vs Panel Thickness

0 512 1024 1536 2048 2560 3072 3584 4096

k (m = n = 4096)

Cmxn = Amxk x Bkxn

Goto’s Streaming Algorithm

Classical algorithm breaks matrices into blocks just big enough for A, B and C to fit in L1 cache

Goto realized L2 cache is actually fast enough to access directly from inner kernel! Use most of L2 cache for a giant block of A Inner kernel uses all levels of memory hierarchy

simultaneously Cache large slices of B panel in L1 cache, cache small piece

of C in registers

Instead of square matrix blocks, use block-panel multiplies, with traversal order to maximize reuse Stream full-sized contiguous panels of B and C directly out

of DRAM

Use extremely optimized hand tuned assembly

Methodology

So we compiled code using the TRIPS compilerAnd we ran it on a hardware prototype.We kept making changes and seeing how fast it ran.We made notes of the changes.We made graphs from the notes.We made slides based on the graphs.We made conclusions based on the slides.It’s 130nm and 366 MHz, but that’s OK.

Controlling The Cache

161651

• B slice fits in L1 cache• A block fits in L2 cache• C chunks from L2

How do we keep B in L1 cache while streaming all of A through?

A Buffer Size

Affect of Dimensions of A Buffer (same area)

0 512 1024 1536 2048 2560 3072 3584 4096

m = n = k

512*128

256*256

128*512

Block Panel Multiply

Doing multiple GEMDOTS in parallel.

high performance dense linear algebra on spatially distributed processors jeffrey diamond and behnam...

low level librariescase

high performance matrix

algorithm c

matrix multiplyimplementing

matrix inversion

square matrix blocks

enhanced goto algorithm

novel algorithm

Documents

1 dose reduction actions in tsuruga-2 npp hiroshi wada,...

voxhumana-english.comvoxhumana-english.com/scinews20070723.doc ·...

aftershocks at short times after large earthquakes in...

pre-trainingwithout natural images · 2020. 11. 24. ·...

light aﬃne set theory: a naive set theory of polynomial...

muscalietjs: rethinking layered dynamic web runtimes ·...

kazushige abe ride on time

designing on-chip memory systems for throughput...

sugar-regulated cation channel formed by an insect ... ·...

muscalietjs: rethinking layered dynamic web...

platelet-rich fibrin prepared from stored whole-blood...

naosite: nagasaki university's academic output...

verification of ptime reducibility for system f terms via...

nonvolcanic deep tremor associated with...

woc2005 mapping with gps kazushige hatori woc2005 mapping...

on space efﬁciency of krivine’s abstract machine and...

detection of abnormal reactions on a new clinical...

colour printing kazushige hatori iof it commission woc2005...

register bank assignment for spatially partitioned...

#fo +vvsmjol 4qbujp ufnqpsbm 4*.5 boe …has been proposed...