minimizing communication in numerical linear algebra demmel sparse-matrix-vector-multiplication...

63
Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Sparse-Matrix-Vector- Multiplication (SpMV) Jim Demmel EECS & Math Departments, UC Berkeley [email protected]

Upload: clementine-wilcox

Post on 11-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Minimizing Communication in Numerical Linear Algebra

www.cs.berkeley.edu/~demmel

Sparse-Matrix-Vector-Multiplication (SpMV)

Jim DemmelEECS & Math Departments, UC Berkeley

[email protected]

Page 2: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Outline

• Motivation for Automatic Performance Tuning

• Results for sparse matrix kernels– Sparse Matrix Vector Multiplication (SpMV)– Sequential and Multicore results

• OSKI = Optimized Sparse Kernel Interface

• Need to understand tuning of single SpMV to understand opportunities and limits for tuning entire sparse solvers Summer School Lecture 7

2

Page 3: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Berkeley Benchmarking and OPtimization (BeBOP)

• Prof. Katherine Yelick• Current members

– Kaushik Datta, Mark Hoemmen, Marghoob Mohiyuddin, Shoaib Kamil, Rajesh Nishtala, Vasily Volkov, Sam Williams, …

• Previous members– Hormozd Gahvari, Eun-Jim Im, Ankit Jain, Rich Vuduc,

many undergrads, …

• Many results here from current, previous students• bebop.cs.berkeley.edu

Summer School Lecture 7

3

Page 4: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Automatic Performance Tuning

• Goal: Let machine do hard work of writing fast code• What does tuning of dense BLAS, FFTs, signal processing,

have in common?– Can do the tuning off-line: once per architecture, algorithm– Can take as much time as necessary (hours, a week…)– At run-time, algorithm choice may depend only on few

parameters (matrix dimensions, size of FFT, etc.)

• Can’t always do tuning off-line– Algorithm and implementation may strongly depend on data

only known at run-time– Ex: Sparse matrix nonzero pattern determines both best data

structure and implementation of Sparse-matrix-vector-multiplication (SpMV)

– Part of search for best algorithm just be done (very quickly!) at run-time

Summer School Lecture 7

4

Page 5: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Source: Accelerator Cavity Design Problem (Ko via Husbands)

5

Page 6: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Linear Programming Matrix

Summer School Lecture 7

6

Page 7: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

A Sparse Matrix You Encounter Every Day

7

Page 8: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

for each row i

for k=ptr[i] to ptr[i+1] do

y[i] = y[i] + val[k]*x[ind[k]]

SpMV with Compressed Sparse Row (CSR) Storage

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

for each row i

for k=ptr[i] to ptr[i+1] do

y[i] = y[i] + val[k]*x[ind[k]]

Only 2 flops per 2 mem_refs:Limited by getting data from memorySummer School Lecture 7

8

Page 9: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Example: The Difficulty of Tuning

• n = 21200• nnz = 1.5 M• kernel: SpMV

• Source: NASA structural analysis problem

9

Page 10: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Example: The Difficulty of Tuning

• n = 21200• nnz = 1.5 M• kernel: SpMV

• Source: NASA structural analysis problem

• 8x8 dense substructure: exploit this to limit #mem_refs 10

Page 11: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Taking advantage of block structure in SpMV

• Bottleneck is time to get matrix from memory– Only 2 flops for each nonzero in matrix

• Don’t store each nonzero with index, instead store each nonzero r-by-c block with index– Storage drops by up to 2x, if rc >> 1, all 32-bit

quantities– Time to fetch matrix from memory decreases

• Change both data structure and algorithm– Need to pick r and c– Need to change algorithm accordingly

• In example, is r=c=8 best choice?– Minimizes storage, so looks like a good idea…

Page 12: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Speedups on Itanium 2: The Need for Search

Reference

Best: 4x2

Mflop/s

Mflop/s

12

Page 13: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Register Profile: Itanium 2

190 Mflop/s

1190 Mflop/s

13

Page 14: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Register Profiles: IBM and Intel IA-64Power3 - 17% Power4 - 16%

Itanium 2 - 33%Itanium 1 - 8%

252 Mflop/s

122 Mflop/s

820 Mflop/s

459 Mflop/s

247 Mflop/s

107 Mflop/s

1.2 Gflop/s

190 Mflop/s

Page 15: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Register Profiles: Sun and Intel x86

Ultra 2i - 11% Ultra 3 - 5%

Pentium III-M - 15%Pentium III - 21%

72 Mflop/s

35 Mflop/s

90 Mflop/s

50 Mflop/s

108 Mflop/s

42 Mflop/s

122 Mflop/s

58 Mflop/s

Page 16: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Another example of tuning challenges

• More complicated non-zero structure in general

• N = 16614• NNZ = 1.1M

16

Page 17: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Zoom in to top corner

• More complicated non-zero structure in general

• N = 16614• NNZ = 1.1M

17

Page 18: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

3x3 blocks look natural, but…

• More complicated non-zero structure in general

• Example: 3x3 blocking– Logical grid of 3x3 cells

• But would lead to lots of “fill-in”

18

Page 19: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Extra Work Can Improve Efficiency!

• More complicated non-zero structure in general

• Example: 3x3 blocking– Logical grid of 3x3 cells– Fill-in explicit zeros– Unroll 3x3 block multiplies– “Fill ratio” = 1.5

• On Pentium III: 1.5x speedup!– Actual mflop rate 1.52 =

2.25 higher19

Page 20: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Automatic Register Block Size Selection

• Selecting the r x c block size– Off-line benchmark

• Precompute Mflops(r,c) using dense A for each r x c

• Once per machine/architecture– Run-time “search”

• Sample A to estimate Fill(r,c) for each r x c– Run-time heuristic model

• Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

Summer School Lecture 7

20

Page 21: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Accurate and Efficient Adaptive Fill Estimation

• Idea: Sample matrix– Fraction of matrix to sample: s [0,1]– Control cost = O(s * nnz ) by controlling s

• Search at run-time: the constant matters!• Control s automatically by computing statistical

confidence intervals, by monitoring variance• Cost of tuning

– Lower bound: convert matrix in 5 to 40 unblocked SpMVs

– Heuristic: 1 to 11 SpMVs• Tuning only useful when we do many SpMVs

– Common case, eg in sparse solvers 21

Page 22: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Accuracy of the Tuning Heuristics (1/4)

NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)See p. 375 of Vuduc’s thesis for matrices

22

Page 23: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Accuracy of the Tuning Heuristics (2/4)

23

Page 24: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Accuracy of the Tuning Heuristics (2/4)DGEMV

24

Page 25: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Upper Bounds on Performance for blocked SpMV

• P = (flops) / (time)– Flops = 2 * nnz(A)

• Upper bound on speed: Two main assumptions– 1. Count memory ops only (streaming)– 2. Count only compulsory, capacity misses: ignore conflicts

• Account for line sizes• Account for matrix size and nnz

• Charge minimum access “latency” i at Li cache & mem

– e.g., Saavedra-Barrera and PMaC MAPS benchmarks

1mem11

1memmem

Misses)(Misses)(Loads

HitsHitsTime

iiii

iii

Summer School Lecture 7

25

Page 26: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Example: Bounds on Itanium 2

26

Page 27: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Example: Bounds on Itanium 2

27

Page 28: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Example: Bounds on Itanium 2

28

Page 29: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Summary of Other Performance Optimizations

• Optimizations for SpMV– Register blocking (RB): up to 4x over CSR– Variable block splitting: 2.1x over CSR, 1.8x over RB– Diagonals: 2x over CSR– Reordering to create dense structure + splitting: 2x over

CSR– Symmetry: 2.8x over CSR, 2.6x over RB– Cache blocking: 2.8x over CSR– Multiple vectors (SpMM): 7x over CSR– And combinations…

• Sparse triangular solve– Hybrid sparse/dense data structure: 1.8x over CSR

• Higher-level kernels– A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB– A·x: 2x over CSR, 1.5x over RB– [A·x, A·x, A·x, .. , Ak·x] …. more to say later

29

Page 30: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Example: Sparse Triangular Factor

• Raefsky4 (structural problem) + SuperLU + colmmd

• N=19779, nnz=12.6 M

Dense trailing triangle: dim=2268, 20% of total nz

Can be as high as 90+%!1.8x over CSR

30

Page 31: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Cache Optimizations for AAT*x

• Cache-level: Interleave multiplication by A, AT

– Only fetch A from memory once

• Register-level: aiT to be rc block row, or diag

row

n

i

Tii

Tn

T

nT xaax

a

a

aaxAA1

1

1 )(

dot product“axpy”

… …

31

Page 32: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Example: Combining Optimizations (1/2)

• Register blocking, symmetry, multiple (k) vectors– Three low-level tuning parameters: r, c, v

v

kX

Y A

cr

+=

*

32

Page 33: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Example: Combining Optimizations (2/2)

• Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB]– Symmetric, blocked, 1 vector

• Up to 2.6x over nonsymmetric, blocked, 1 vector

– Symmetric, blocked, k vectors• Up to 2.1x over nonsymmetric, blocked, k vecs.• Up to 7.3x over nonsymmetric, nonblocked, 1, vector

– Symmetric Storage: up to 64.7% savings

33

Page 34: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Potential Impact on Applications: Omega3P

• Application: accelerator cavity design [Ko]• Relevant optimization techniques

– Symmetric storage– Register blocking– Reordering, to create more dense blocks

• Reverse Cuthill-McKee ordering to reduce bandwidth– Do Breadth-First-Search, number nodes in reverse order visited

• Traveling Salesman Problem-based ordering to create blocks

– Nodes = columns of A– Weights(u, v) = no. of nz u, v have in common– Tour = ordering of columns– Choose maximum weight tour– See [Pinar & Heath ’97]

• 2.1x speedup on IBM Power 4

34

Page 35: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Source: Accelerator Cavity Design Problem (Ko via Husbands)

35

Page 36: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Post-RCM Reordering

36

Page 37: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

100x100 Submatrix Along Diagonal

Summer School Lecture 7

37

Page 38: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Before: Green + RedAfter: Green + Blue

“Microscopic” Effect of RCM Reordering

Summer School Lecture 7

38

Page 39: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

“Microscopic” Effect of Combined RCM+TSP Reordering

Before: Green + RedAfter: Green + Blue

Summer School Lecture 7

39

Page 40: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

(Omega3P)

40

Page 41: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Optimized Sparse Kernel Interface - OSKI

• Provides sparse kernels automatically tuned for user’s matrix & machine– BLAS-style functionality: SpMV, Ax & ATy, TrSV– Hides complexity of run-time tuning– Includes new, faster locality-aware kernels: ATAx, Akx

• Faster than standard implementations– Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x

• For “advanced” users & solver library writers– Available as stand-alone library (OSKI 1.0.1h, 6/07)– Available as PETSc extension (OSKI-PETSc .1d, 3/06)– Bebop.cs.berkeley.edu/oski

Summer School Lecture 7

41

Page 42: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

How the OSKI Tunes (Overview)

Benchmarkdata

1. Build forTargetArch.

2. Benchmark

Heuristicmodels

1. EvaluateModels

Generatedcode

variants

2. SelectData Struct.

& Code

Library Install-Time (offline) Application Run-Time

To user:Matrix handlefor kernelcalls

Workloadfrom program

monitoring

Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

HistoryMatrix

Page 43: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

How to Call OSKI: Basic Usage

• May gradually migrate existing apps– Step 1: “Wrap” existing data structures– Step 2: Make BLAS-like kernel calls

int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */

double* x = …, *y = …; /* Let x and y be two dense vectors */

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

my_matmult( ptr, ind, val, , x, , y );

43

Page 44: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

How to Call OSKI: Basic Usage

• May gradually migrate existing apps– Step 1: “Wrap” existing data structures– Step 2: Make BLAS-like kernel calls

int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */

double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);

oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);

oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

my_matmult( ptr, ind, val, , x, , y );

44

Page 45: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

How to Call OSKI: Basic Usage

• May gradually migrate existing apps– Step 1: “Wrap” existing data structures– Step 2: Make BLAS-like kernel calls

int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */

double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);

oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);

oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);/* Step 2 */

45

Page 46: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

How to Call OSKI: Tune with Explicit Hints

• User calls “tune” routine– May provide explicit tuning hints (OPTIONAL)

oski_matrix_t A_tunable = oski_CreateMatCSR( … );

/* … */

/* Tell OSKI we will call SpMV 500 times (workload hint) */oski_SetHintMatMult(A_tunable, OP_NORMAL, , x_view, , y_view, 500);/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);

oski_TuneMat(A_tunable); /* Ask OSKI to tune */

for( i = 0; i < 500; i++ )

oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

46

Page 47: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

How the User Calls OSKI: Implicit Tuning

• Ask library to infer workload– Library profiles all kernel calls– May periodically re-tune

oski_matrix_t A_tunable = oski_CreateMatCSR( … );

/* … */

for( i = 0; i < 500; i++ ) {

oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);oski_TuneMat(A_tunable); /* Ask OSKI to tune */

}

Summer School Lecture 7

47

Page 48: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Multicore SMPs Used for Tuning SpMV

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

48

Page 49: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Multicore SMPs with Conventional cache-based memory hierarchy

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

49

Page 50: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Multicore SMPs with local store-based memory hierarchy

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell Blade

Sun T2+ T5140 (Victoria Falls)

50

Page 51: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Multicore SMPs with CMT = Chip-MultiThreading

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

Page 52: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Multicore SMPs: Number of threads

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

8 threads 8 threads

16* threads128 threads

*SPEs only

Page 53: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Multicore SMPs: peak double precision flops

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

75 GFlop/s 74 Gflop/s

29* GFlop/s19 GFlop/s

*SPEs only

Page 54: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Multicore SMPs: total DRAM bandwidth

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

21 GB/s (read)

10 GB/s (write)21 GB/s

51 GB/s42 GB/s (read)

21 GB/s (write)

*SPEs only

Page 55: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Multicore SMPs with Non-Uniform Memory Access - NUMA

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

*SPEs only

Page 56: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

Set of 14 test matrices

• All bigger than the caches of our SMPs

Dense

ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /Harbor

QCDFEM /Ship

Economics Epidemiology

FEM /Accelerator

Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Summer School Lecture 7

56

Page 57: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

SpMV Performance: Naive parallelization

• Out-of-the box SpMV performance on a suite of 14 matrices

• Scalability isn’t great:

Compare to # threads

8 8 128 16

Naïve Pthreads

Naïve

57

Page 58: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

SpMV Performance: NUMA and Software Prefetching

NUMA-aware allocation is essential on NUMA SMPs.

Explicit software prefetching can boost bandwidth and change cache replacement policies

used exhaustive search

58

Page 59: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

SpMV Performance: “Matrix Compression”

Compression includes register blocking other formats smaller indices

Use heuristic rather than search

59

Page 60: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

SpMV Performance: cache and TLB blocking

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

60

Page 61: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

SpMV Performance: Architecture specific optimizations

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

61

Page 62: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

62

SpMV Performance: max speedup

• Fully auto-tuned SpMV performance across the suite of matrices

• Included SPE/local store optimized version

• Why do some optimizations work better on some architectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

2.7x2.7x 4.0x4.0x

2.9x2.9x 35x35x

Page 63: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

EXTRA SLIDES

Summer School Lecture 7

63