lecture 6: memory hierarchy and cache (continued)

73
1 Lecture 6: Memory Hierarchy and Cache (Continued) Jack Dongarra University of Tennessee and Oak Ridge National Laboratory Cache: A safe place for hiding and storing things. Webster’s New World Dictionary (1976)

Upload: yoland

Post on 03-Feb-2016

37 views

Category:

Documents


1 download

DESCRIPTION

Lecture 6: Memory Hierarchy and Cache (Continued). Cache: A safe place for hiding and storing things. Webster’s New World Dictionary (1976). Jack Dongarra University of Tennessee and Oak Ridge National Laboratory. Homework Assignment. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 6: Memory Hierarchy and Cache (Continued)

1

Lecture 6: Memory Hierarchy and Cache (Continued)

Jack Dongarra University of Tennessee andOak Ridge National Laboratory

Cache: A safe place for hiding and storing things. Webster’s New World Dictionary (1976)

Page 2: Lecture 6: Memory Hierarchy and Cache (Continued)

2

Homework Assignment

• Implement, in Fortran or C, the six different ways to perform matrix multiplication by interchanging the loops. (Use 64-bit arithmetic.) Make each implementation a subroutine, like:

• subroutine ijk ( a, m, n, lda, b, k, ldb, c, ldc )• subroutine ikj ( a, m, n, lda, b, k, ldb, c, ldc )• …

Page 3: Lecture 6: Memory Hierarchy and Cache (Continued)

3

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

Page 4: Lecture 6: Memory Hierarchy and Cache (Continued)

4

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

ijkC i,j A I,k B k,j

Page 5: Lecture 6: Memory Hierarchy and Cache (Continued)

5

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

ijk

ikj

C i,j A I,k B k,j

Page 6: Lecture 6: Memory Hierarchy and Cache (Continued)

6

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

ijk

ikj

kij

C i,j A I,k B k,j

Page 7: Lecture 6: Memory Hierarchy and Cache (Continued)

7

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

ijk

ikj

kij

kji

C i,j A I,k B k,j

Page 8: Lecture 6: Memory Hierarchy and Cache (Continued)

8

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

ijk

ikj

kij

kji

jki

C i,j A I,k B k,j

Page 9: Lecture 6: Memory Hierarchy and Cache (Continued)

9

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

ijk

ikj

kij

kji

jki

jik

C i,j A I,k B k,j

Page 10: Lecture 6: Memory Hierarchy and Cache (Continued)

10

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

ijk

ikj

kij

kji

jki

jik

C i,j A I,k B k,j

FortranC

Page 11: Lecture 6: Memory Hierarchy and Cache (Continued)

11

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

ijk

ikj

kij

kji

jki

jik

C i,j A I,k B k,j

FortranC

However, only part of the story

Page 12: Lecture 6: Memory Hierarchy and Cache (Continued)

SUN Ultra 2 200 MHz (L1=16KB, L2=1MB)

• ijk

• jki

• kij

• dgemm

• jik

• kji

• ikj

Page 13: Lecture 6: Memory Hierarchy and Cache (Continued)

13

Matrices in Cache

• L1 cache 16 KB

• L2 cache 2 MB

16 8 45KB /

2 8 362120MB /

Page 14: Lecture 6: Memory Hierarchy and Cache (Continued)

14

Page 15: Lecture 6: Memory Hierarchy and Cache (Continued)

15

Optimizing Matrix Addition for Caches

• Dimension A(n,n), B(n,n), C(n,n) • A, B, C stored by column (as in Fortran) • Algorithm 1:

– for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)

• Algorithm 2:– for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)

• What is “memory access pattern” for Algs 1 and 2?• Which is faster?• What if A, B, C stored by row (as in C)?

Page 16: Lecture 6: Memory Hierarchy and Cache (Continued)

16

Using a Simpler Model of Memory to Optimize

• Assume just 2 levels in the hierarchy, fast and slow• All data initially in slow memory

– m = number of memory elements (words) moved between fast and slow memory

– tm = time per slow memory operation– f = number of arithmetic operations– tf = time per arithmetic operation < tm– q = f/m average number of flops per slow element access

• Minimum possible Time = f*tf, when all data in fast memory

• Actual Time = f*tf + m*tm = f*tf*(1 + (tm/tf)*(1/q))• Larger q means Time closer to minimum f*tf

Page 17: Lecture 6: Memory Hierarchy and Cache (Continued)

17

Simple example using memory model

s = 0

for i = 1, n

s = s + h(X[i])

• Assume tf=1 Mflop/s on fast memory

• Assume moving data is tm = 10• Assume h takes q flops• Assume array X is in slow memory

• To see results of changing q, consider simple computation

• So m = n and f = q*n• Time = read X + compute = 10*n + q*n• Mflop/s = f/t = q/(10 + q)• As q increases, this approaches the “peak” speed of 1

Mflop/s

Page 18: Lecture 6: Memory Hierarchy and Cache (Continued)

18

Simple Example (continued)• Algorithm 1

s1 = 0; s2 = 0

for j = 1 to n

s1 = s1+h1(X(j))

s2 = s2 + h2(X(j))

° Algorithm 2

s1 = 0; s2 = 0

for j = 1 to n

s1 = s1 + h1(X(j))

for j = 1 to n

s2 = s2 + h2(X(j))

° Which is faster?

Page 19: Lecture 6: Memory Hierarchy and Cache (Continued)

19

Loop Fusion Example/* Before */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

Page 20: Lecture 6: Memory Hierarchy and Cache (Continued)

20

Optimizing Matrix Multiply for Caches

• Several techniques for making this faster on modern processors

– heavily studied• Some optimizations done automatically by

compiler, but can do much better• In general, you should use optimized libraries

(often supplied by vendor) for this and other very common linear algebra operations

– BLAS = Basic Linear Algebra Subroutines• Other algorithms you may want are not going to be

supplied by vendor, so need to know these techniques

Page 21: Lecture 6: Memory Hierarchy and Cache (Continued)

21

Warm up: Matrix-vector multiplication y = y + A*x

for i = 1:nfor j = 1:n

y(i) = y(i) + A(i,j)*x(j)

= + *

y(i) y(i)

A(i,:)

x(:)

Page 22: Lecture 6: Memory Hierarchy and Cache (Continued)

22

Warm up: Matrix-vector multiplication y = y + A*x

{read x(1:n) into fast memory}{read y(1:n) into fast memory}for i = 1:n

{read row i of A into fast memory} for j = 1:n

y(i) = y(i) + A(i,j)*x(j){write y(1:n) back to slow memory}

° m = number of slow memory refs = 3*n + n2

° f = number of arithmetic operations = 2*n2

° q = f/m ~= 2° Matrix-vector multiplication limited by slow memory speed

Page 23: Lecture 6: Memory Hierarchy and Cache (Continued)

23

Matrix Multiply C=C+A*B

for i = 1 to n for j = 1 to n

for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

Page 24: Lecture 6: Memory Hierarchy and Cache (Continued)

24

Matrix Multiply C=C+A*B(unblocked, or untiled)

for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

Page 25: Lecture 6: Memory Hierarchy and Cache (Continued)

25

Matrix Multiply (unblocked, or untiled)

Number of slow memory references on unblocked matrix multiplym = n3 read each column of B n times

+ n2 read each column of A once for each i + 2*n2 read and write each element of C once = n3 + 3*n2

So q = f/m = (2*n3)/(n3 + 3*n2) ~= 2 for large n, no improvement over matrix-vector mult

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

q=ops/slow mem ref

Page 26: Lecture 6: Memory Hierarchy and Cache (Continued)

26

Matrix Multiply (blocked, or tiled)

Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize

for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on

blocks} {write block C(i,j) back to slow memory}

= + *C(i,j) C(i,j) A(i,k)

B(k,j)

Page 27: Lecture 6: Memory Hierarchy and Cache (Continued)

27

Matrix Multiply (blocked or tiled)

Why is this algorithm correct?

Number of slow memory references on blocked matrix multiplym = N*n2 read each block of B N3 times (N3 * n/N * n/N)

+ N*n2 read each block of A N3 times + 2*n2 read and write each block of C once = (2*N + 2)*n2

So q = f/m = 2*n3 / ((2*N + 2)*n2) ~= n/N = b for large n

So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)

Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)

Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M))

q=ops/slow mem ref

Page 28: Lecture 6: Memory Hierarchy and Cache (Continued)

28

Model• As much as possible will be overlapped• Dot Product: ACC = 0 do i = x,n ACC = ACC + x(i) y(i) end do• Experiments done on an IBM RS6000/530

– 25 MHz– 2 cycle to complete FMA can be pipelined

» => 50 Mflop/s peak– one cycle from cache

Page 29: Lecture 6: Memory Hierarchy and Cache (Continued)

29

DOT Operation - Data in Cache

Do 10 I = 1, n T = T + X(I)*Y(I) 10 CONTINUE

• Theoretically, 2 loads for X(I) and Y(I), one FMA operation, no re-use of data

• Pseudo-assembler LOAD fp0,T label: LOAD fp1,X(I) LOAD fp2,Y(I) FMA fp0,fp0,fp1,fp2 BRANCH label:

Load x Load y FMA

Load x Load y

1 result per cycle = 25 Mflop/sFMA

Page 30: Lecture 6: Memory Hierarchy and Cache (Continued)

30

Matrix-Vector Product

• DOT version DO 20 I = 1, M DO 10 J = 1, N Y(I) = Y(I) + A(I,J)*X(J) 10 CONTINUE 20 CONTINUE

• From Cache = 22.7 Mflops • From Memory = 12.4 Mflops

Page 31: Lecture 6: Memory Hierarchy and Cache (Continued)

31

Loop Unrolling

DO 20 I = 1, M, 2 T1 = Y(I ) T2 = Y(I+1) DO 10 J = 1, N T1 = T1 + A(I,J )*X(J) T2 = T2 + A(I+1,J)*X(J)10 CONTINUE Y(I ) = T1 Y(I+1) = T2 20 CONTINUE

• 3 loads, 4 flops• Speed of y=y+ATx,

N=48

Depth 1 2 3 4 Speed 25 33.3 37.5 40 50Measured 22.7 30.5 34.3 36.5Memory 12.4 12.7 12.7 12.6

• unroll 1: 2 loads : 2 ops per 2 cycles• unroll 2: 3 loads : 4 ops per 3 cycles• unroll 3: 4 loads : 6 ops per 4 cycles• …• unroll n: n+1 loads : 2n ops per n+1 cycles

• problem: only so many registers

Page 32: Lecture 6: Memory Hierarchy and Cache (Continued)

32

Matrix Multiply

• DOT version - 25 Mflops in cache DO 30 J = 1, M DO 20 I = 1, M DO 10 K = 1, L C(I,J) = C(I,J) + A(I,K)*B(K,J) 10 CONTINUE 20 CONTINUE 30 CONTINUE

Page 33: Lecture 6: Memory Hierarchy and Cache (Continued)

33

How to Get Near Peak DO 30 J = 1, M, 2 DO 20 I + 1, M, 2 T11 = C(I, J ) T12 = C(I, J+1) T21 = C(I+1,J ) T22 = C(I+1,J+1) DO 10 K = 1, L T11 = T11 + A(I, K) *B(K,J ) T12 = T12 + A(I, K) *B(K,J+1) T21 = T21 + A(I+1,K)*B(K,J ) T22 = T22 + A(I+1,K)*B(K,J+1) 10 CONTINUE C(I, J ) = T11 C(I, J+1) = T12 C(I+1,J ) = T21 C(I+1,J+1) = T22 20 CONTINUE 30 CONTINUE

• Inner loop: – 4 loads, 8 operations,

optimal.

• In practice we have measured 48.1 out of a peak of 50 Mflop/s when in cache

Page 34: Lecture 6: Memory Hierarchy and Cache (Continued)

34

BLAS -- Introduction

• Clarity: code is shorter and easier to read,• Modularity: gives programmer larger building

blocks,• Performance: manufacturers will provide

tuned machine-specific BLAS,• Program portability: machine dependencies

are confined to the BLAS

Page 35: Lecture 6: Memory Hierarchy and Cache (Continued)

35

Memory Hierarchy

RegistersL 1

CacheL 2 CacheLocal

MemoryRemote MemorySecondary Memory

• Key to high performance in effective use of memory hierarchy

• True on all architectures

Page 36: Lecture 6: Memory Hierarchy and Cache (Continued)

36

Level 1, 2 and 3 BLAS• Level 1 BLAS

Vector-Vector operations

• Level 2 BLAS Matrix-Vector operations

• Level 3 BLAS Matrix-Matrix operations

+ *

*

+ *

Page 37: Lecture 6: Memory Hierarchy and Cache (Continued)

37

More on BLAS (Basic Linear Algebra Subroutines)

• Industry standard interface(evolving)• Vendors, others supply optimized implementations• History

– BLAS1 (1970s): » vector operations: dot product, saxpy (y=*x+y), etc» m=2*n, f=2*n, q ~1 or less

– BLAS2 (mid 1980s)» matrix-vector operations: matrix vector multiply, etc» m=n2, f=2*n2, q~2, less overhead » somewhat faster than BLAS1

– BLAS3 (late 1980s)» matrix-matrix operations: matrix matrix multiply, etc» m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is

potentially much faster than BLAS2• Good algorithms used BLAS3 when possible (LAPACK)• www.netlib.org/blas, www.netlib.org/lapack

Page 38: Lecture 6: Memory Hierarchy and Cache (Continued)

38

Why Higher Level BLAS?

• Can only do arithmetic on data at the top of the hierarchy

• Higher level BLAS lets us do this

BLAS MemoryRefs

Flops Flops/MemoryRefs

Level 1y=y+x

3n 2n 2/3

Level 2y=y+Ax

n2 2n2 2

Level 3C=C+AB

4n2 2n3 n/2

RegistersL 1

CacheL 2

CacheLocal

MemoryRemote Memory

Secondary Memory

Page 39: Lecture 6: Memory Hierarchy and Cache (Continued)

39

BLAS for Performance

• Development of blocked algorithms important for performance

IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)

0

50

100

150

200

250

10 100 200 300 400 500Order of vector/Matrices

Mflo

p/s

Level 3 BLAS

Level 2 BLAS

Level 1 BLAS

Page 40: Lecture 6: Memory Hierarchy and Cache (Continued)

40

BLAS for Performance

• Development of blocked algorithms important for performance

Alpha EV 5/6 500MHz (1Gflop/s peak)

0100200300400500600700

10 100 200 300 400 500Order of vector/Matrices

Mflo

p/s

Level 3 BLAS

Level 2 BLASLevel 1 BLAS

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors)

Page 41: Lecture 6: Memory Hierarchy and Cache (Continued)

Fast linear algebra kernels: BLAS

• Simple linear algebra kernels such as matrix-matrix multiply

• More complicated algorithms can be built from these basic kernels.

• The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutines (BLAS).

• Early agreement on standard interface (~1980) • Led to portable libraries for vector and shared

memory parallel machines. • On distributed memory, there is a less-

standard interface called the PBLAS

Page 42: Lecture 6: Memory Hierarchy and Cache (Continued)

Level 1 BLAS

• Operate on vectors or pairs of vectors– perform O(n) operations; – return either a vector or a scalar.

• saxpy – y(i) = a * x(i) + y(i), for i=1 to n. – s stands for single precision, daxpy is for double

precision, caxpy for complex, and zaxpy for double complex,

• sscal y = a * x, for scalar a and vectors x,y

• sdot computes s = S ni=1 x(i)*y(i)

Page 43: Lecture 6: Memory Hierarchy and Cache (Continued)

Level 2 BLAS

• Operate on a matrix and a vector; – return a matrix or a vector;– O(n2) operations

• sgemv: matrix-vector multiply– y = y + A*x– where A is m-by-n, x is n-by-1 and y is m-by-1.

• sger: rank-one update – A = A + y*xT, i.e., A(i,j) = A(i,j)+y(i)*x(j) – where A is m-by-n, y is m-by-1, x is n-by-1, – strsv: triangular solve – solves y=T*x for x, where T is triangular

Page 44: Lecture 6: Memory Hierarchy and Cache (Continued)

Level 3 BLAS

• Operate on pairs or triples of matrices– returning a matrix;– complexity is O(n3).

• sgemm: Matrix-matrix multiplication– C = C +A*B, – where C is m-by-n, A is m-by-k, and B is k-by-n

• strsm: multiple triangular solve– solves Y = T*X for X, – where T is a triangular matrix, and X is a rectangular

matrix.

Page 45: Lecture 6: Memory Hierarchy and Cache (Continued)

45

Optimizing in practice

• Tiling for registers– loop unrolling, use of named “register” variables

• Tiling for multiple levels of cache• Exploiting fine-grained parallelism within the

processor– super scalar – pipelining

• Complicated compiler interactions• Hard to do by hand (but you’ll try)• Automatic optimization an active research area

– PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac– www.cs.berkeley.edu/~iyer/asci_slides.ps– ATLAS: www.netlib.org/atlas/index.html

Page 46: Lecture 6: Memory Hierarchy and Cache (Continued)

46

BLAS -- References

• BLAS software and documentation can be obtained via:

– WWW: http://www.netlib.org/blas,– (anonymous) ftp ftp.netlib.org: cd blas; get index– email [email protected] with the message: send

index from blas

• Comments and questions can be addressed to: [email protected]

Page 47: Lecture 6: Memory Hierarchy and Cache (Continued)

47

BLAS Papers

• C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, 5:308--325, 1979.

• J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 14(1):1--32, 1988.

• J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 16(1):1--17, 1990.

Page 48: Lecture 6: Memory Hierarchy and Cache (Continued)

Performance of BLAS

• BLAS are specially optimized by the vendor

– Sun BLAS uses features in the Ultrasparc• Big payoff for algorithms that can be

expressed in terms of the BLAS3 instead of BLAS2 or BLAS1.

• The top speed of the BLAS3• Algorithms like Gaussian elimination

organized so that they use BLAS3

Page 49: Lecture 6: Memory Hierarchy and Cache (Continued)

49

How To Get Performance From Commodity Processors?

• Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.

• Routines have a large design space w/many parameters– blocking sizes, loop nesting permutations, loop unrolling depths,

software pipelining strategies, register allocations, and instruction schedules.

– Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.

• A few months ago no tuned BLAS for Pentium for Linux.• Need for quick/dynamic deployment of optimized routines.• ATLAS - Automatic Tuned Linear Algebra Software

– PhiPac from Berkeley

Page 50: Lecture 6: Memory Hierarchy and Cache (Continued)

M C A B

N

K

N

M

K

*NB

Adaptive Approach for Level 3• Do a parameter study of the operation on the

target machine, done once.• Only generated code is on-chip multiply• BLAS operation written in terms of generated on-

chip multiply• All tranpose cases coerced through data copy to

1 case of on-chip multiply– Only 1 case generated per platform

Page 51: Lecture 6: Memory Hierarchy and Cache (Continued)

51

Code Generation Strategy

• Code is iteratively generated & timed until optimal case is found. We try:

– Differing NBs– Breaking false dependencies– M, N and K loop unrolling

• On-chip multiply optimizes for:

– TLB access– L1 cache reuse– FP unit usage– Memory fetch– Register reuse– Loop overhead minimization

• Takes a couple of hours to run.

Page 52: Lecture 6: Memory Hierarchy and Cache (Continued)

52

500x500 Double Precision Matrix-Matrix Multiply Across Multiple Architectures

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

DE

C A

lpha

2116

4a-4

33

HP

PA

8000

180M

hz

HP

9000

/735

/125

IBM

Pow

er2-

135

IBM

Pow

erP

C60

4e-3

32

Pen

tium

MM

X-15

0

Pen

tium

Pro

-200

Pen

tium

II-2

66

SG

I R46

00

SG

I R50

00

SG

I R80

00ip

21

SG

I R10

000i

p27

Sun

Mic

rosp

arc

IIM

odel

70

Sun

Dar

win

-270

Sun

Ultr

a2 M

odel

2200

System

Mflo

ps

Vendor Matrix Multiply ATLAS Matrix Multiply

Page 53: Lecture 6: Memory Hierarchy and Cache (Continued)

53

500 x 500 Double Precision LU Factorization Performance Across Multiple Architectures

0.0

100.0

200.0

300.0

400.0

500.0

600.0

DC

G L

X 21

164a

-53

3

DE

C A

lpha

211

64a-

433

HP

PA

8000

IBM

Pow

er2-

135

IBM

Pow

erP

C60

4e-3

32

Pen

tium

Pro

-200

Pen

tium

II-2

66

SG

I R50

00

SG

I R10

000i

p27

Sun

Dar

win

-270

Sun

Ultr

a2 M

odel

2200

MFL

OPS

LU w/Vendor BLAS LU w/ATLAS & GEMM-based BLAS

Page 54: Lecture 6: Memory Hierarchy and Cache (Continued)

54

500x500 gemm-based BLAS on SGI R10000ip28

0

50

100

150

200

250

300

DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM

MFL

OPS

Vendor BLAS ATLAS/SSBLAS Reference BLAS

Page 55: Lecture 6: Memory Hierarchy and Cache (Continued)

55

500x500 gemm-based BLAS on UltraSparc 2200

0

50

100

150

200

250

300

DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM

Level 3 BLAS Routine

MFL

OPS

Vendor BLAS ATLAS/GEMM-based BLAS Reference BLAS

Page 56: Lecture 6: Memory Hierarchy and Cache (Continued)

56

Recursive Approach for Other Level 3 BLAS

• Recur down to L1 cache block size

• Need kernel at bottom of recursion

– Use gemm-based kernel for portability

Recursive TRMM

00

0

00

0

0

0

0

0

0

0

00

0

0

0

0

0

0

Page 57: Lecture 6: Memory Hierarchy and Cache (Continued)

57

500x500 Level 2 BLAS DGEMV

0

50

100

150

200

250

300

Architectures

MFL

OPS

Vendor NoTrans ATLAS NoTrans

F77 NoTrans

Page 58: Lecture 6: Memory Hierarchy and Cache (Continued)

58

0100200300400500600700800

Size

Mflo

p/s

Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc

Multi-Threaded DGEMMIntel PIII 550 MHz

Page 59: Lecture 6: Memory Hierarchy and Cache (Continued)

59

ATLAS

• Keep a repository of kernels for specific machines.

• Develop a means of dynamically downloading code

• Extend work to allow sparse matrix operations

• Extend work to include arbitrary code segments

• See: http://www.netlib.org/atlas/

Page 60: Lecture 6: Memory Hierarchy and Cache (Continued)

60

BLAS Technical Forum http://www.netlib.org/utk/papers/blast-forum.html

• Established a Forum to consider expanding the BLAS in light of modern software, language, and hardware developments.

• Minutes available from each meeting• Working proposals for the following:

– Dense/Band BLAS– Sparse BLAS– Extended Precision BLAS– Distributed Memory BLAS– C and Fortran90 interfaces to Legacy BLAS

Page 61: Lecture 6: Memory Hierarchy and Cache (Continued)

61

Strassen’s Matrix Multiply

• The traditional algorithm (with or without tiling) has O(n3) flops

• Strassen discovered an algorithm with asymptotically lower flops

– O(n2.81)• Consider a 2x2 matrix multiply, normally 8 multiplies

Let M = [m11 m12] = [a11 a12] * [b11 b12]

[m21 m22] [a21 a22] [b21 b22]

Let p1 = (a12 - 122) * (b21 + b22) p5 = a11 * (b12 - b22)

p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11)

p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11

p4 = (a11 + a12) * b22

Then m11 = p1 + p2 - p4 + p6

m12 = p4 + p5

m21 = p6 + p7

m22 = p2 - p3 + p5 - p7

Extends to nxn by divide&conquer

Page 62: Lecture 6: Memory Hierarchy and Cache (Continued)

62

Strassen (continued)

T(n) = Cost of multiplying nxn matrices

= 7*T(n/2) + 18*(n/2)2 = O(nlog_2 7) = O(n2.81)

° Available in several libraries° Up to several time faster if n large enough (100s)° Needs more memory than standard algorithm° Can be less accurate because of roundoff error° Current world’s record is O(n2.376.. )

Page 63: Lecture 6: Memory Hierarchy and Cache (Continued)

63

Summary• Performance programming on uniprocessors

requires– understanding of memory system

» levels, costs, sizes– understanding of fine-grained parallelism in processor to

produce good instruction mix

• Blocking (tiling) is a basic approach that can be applied to many matrix algorithms

• Applies to uniprocessors and parallel processors– The technique works for any architecture, but choosing the

blocksize b and other details depends on the architecture

• Similar techniques are possible on other data structures

Page 64: Lecture 6: Memory Hierarchy and Cache (Continued)

64

Summary: Memory Hierachy• Virtual memory was controversial at the time:

can SW automatically manage 64KB across many programs?

– 1000X DRAM growth removed the controversy

• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy

• Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?

Page 65: Lecture 6: Memory Hierarchy and Cache (Continued)

65

BLAS MemoryRefs

Flops Flops/MemoryRefs

Level 1y=y+x

3n 2n 2/3

Level 2y=y+Ax

n2 2n2 2

Level 3C=C+AB

4n2 2n3 n/2

Performance = Effective Use of Memory Hierarchy

• Can only do arithmetic on data at the top of the hierarchy

• Higher level BLAS lets us do this

• Development of blocked algorithms important for performance

Level 1, 2 & 3 BLAS Intel PII 450MHz

0

100

200

300

400

10 100 200 300 400 500Order of vector/Matrices

Mflo

p/s

Page 66: Lecture 6: Memory Hierarchy and Cache (Continued)

66

Engineering: SUN Enterprise

• Proc + mem card - I/O card– 16 cards of either type– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 Fi

berC

hann

el

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

Page 67: Lecture 6: Memory Hierarchy and Cache (Continued)

67

Engineering: Cray T3E

– Scale up to 1024 processors, 480MB/s links– Memory controller generates request message for non-local references– No hardware mechanism for coherence

» SGI Origin etc. provide this

Switch

P$

XY

Z

External I/O

Memctrl

and NI

Mem

Page 68: Lecture 6: Memory Hierarchy and Cache (Continued)

68

000001

010011

100

110

101

111

Evolution of Message-Passing Machines

• Early machines: FIFO on each link– HW close to prog. Model; – synchronous ops– topology central (hypercube algorithms)

CalTech Cosmic Cube (Seitz, CACM Jan 95)

Page 69: Lecture 6: Memory Hierarchy and Cache (Continued)

69

Diminishing Role of Topology

• Shift to general links– DMA, enabling non-blocking ops

» Buffered by system at destination until recv

– Store&forward routing• Diminishing role of topology

– Any-to-any pipelined routing– node-network interface dominates

communication time

– Simplifies programming– Allows richer design space

» grids vs hypercubes

H x (T0 + n/B)

vs

T0 + H + n/B

Intel iPSC/1 -> iPSC/2 -> iPSC/860

Page 70: Lecture 6: Memory Hierarchy and Cache (Continued)

70

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

Page 71: Lecture 6: Memory Hierarchy and Cache (Continued)

71

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Building on the mainstream: IBM SP-2

• Made out of essentially complete RS6000 workstations

• Network interface integrated in I/O bus (bw limited by I/O bus)

Page 72: Lecture 6: Memory Hierarchy and Cache (Continued)

72

Berkeley NOW

• 100 Sun Ultra2 workstations

• Inteligent network interface

– proc + mem

• Myrinet Network

– 160 MB/s per link

– 300 ns per hop

Page 73: Lecture 6: Memory Hierarchy and Cache (Continued)

73

Thanks • These slides came in part from

courses taught by the following people:

– Kathy Yelick, UC, Berkeley– Dave Patterson, UC, Berkeley– Randy Katz, UC, Berkeley– Craig Douglas, U of Kentucky

• Computer Architecture A Quantitative Approach, Chapter 8, Hennessy and Patterson, Morgan Kaufman Pub.