lecture 6: memory hierarchy and cache (continued)

1

Lecture 6: Memory Hierarchy and Cache (Continued)

Jack Dongarra University of Tennessee andOak Ridge National Laboratory

Cache: A safe place for hiding and storing things. Webster’s New World Dictionary (1976)

2

Homework Assignment

• Implement, in Fortran or C, the six different ways to perform matrix multiplication by interchanging the loops. (Use 64-bit arithmetic.) Make each implementation a subroutine, like:

• subroutine ijk ( a, m, n, lda, b, k, ldb, c, ldc )• subroutine ikj ( a, m, n, lda, b, k, ldb, c, ldc )• …

3

6 Variations of Matrix Multiple

for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend

C C A Bi j i j i k k j, , , ,

4


for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend


ijkC i,j A I,k B k,j

5


for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend


ijk

ikj

C i,j A I,k B k,j

6


for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend


ijk

ikj

kij

C i,j A I,k B k,j

7


for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend


ijk

ikj

kij

kji

C i,j A I,k B k,j

8


for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend


ijk

ikj

kij

kji

jki

C i,j A I,k B k,j

9


for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend


ijk

ikj

kij

kji

jki

jik

C i,j A I,k B k,j

10


for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend


ijk

ikj

kij

kji

jki

jik

C i,j A I,k B k,j

FortranC

11


for _ = 1:n; for _ = 1:n; for _ = 1:n;

end endend


ijk

ikj

kij

kji

jki

jik

C i,j A I,k B k,j

FortranC

However, only part of the story

SUN Ultra 2 200 MHz (L1=16KB, L2=1MB)

• ijk

• jki

• kij

• dgemm

• jik

• kji

• ikj

13

Matrices in Cache

• L1 cache 16 KB

• L2 cache 2 MB

16 8 45KB /

2 8 362120MB /

15

Optimizing Matrix Addition for Caches

• Dimension A(n,n), B(n,n), C(n,n) • A, B, C stored by column (as in Fortran) • Algorithm 1:

– for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)

• Algorithm 2:– for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)

• What is “memory access pattern” for Algs 1 and 2?• Which is faster?• What if A, B, C stored by row (as in C)?

16

Using a Simpler Model of Memory to Optimize

• Assume just 2 levels in the hierarchy, fast and slow• All data initially in slow memory

– m = number of memory elements (words) moved between fast and slow memory

– tm = time per slow memory operation– f = number of arithmetic operations– tf = time per arithmetic operation < tm– q = f/m average number of flops per slow element access

• Minimum possible Time = f*tf, when all data in fast memory

• Actual Time = f*tf + m*tm = f*tf*(1 + (tm/tf)*(1/q))• Larger q means Time closer to minimum f*tf

17

Simple example using memory model

s = 0

for i = 1, n

s = s + h(X[i])

• Assume tf=1 Mflop/s on fast memory

• Assume moving data is tm = 10• Assume h takes q flops• Assume array X is in slow memory

• To see results of changing q, consider simple computation

• So m = n and f = q*n• Time = read X + compute = 10*n + q*n• Mflop/s = f/t = q/(10 + q)• As q increases, this approaches the “peak” speed of 1

Mflop/s

18

Simple Example (continued)• Algorithm 1

s1 = 0; s2 = 0

for j = 1 to n

s1 = s1+h1(X(j))

s2 = s2 + h2(X(j))

° Algorithm 2

s1 = 0; s2 = 0

for j = 1 to n

s1 = s1 + h1(X(j))

for j = 1 to n

s2 = s2 + h2(X(j))

° Which is faster?

19

Loop Fusion Example/* Before */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; improve spatial locality

20

Optimizing Matrix Multiply for Caches

• Several techniques for making this faster on modern processors

– heavily studied• Some optimizations done automatically by

compiler, but can do much better• In general, you should use optimized libraries

(often supplied by vendor) for this and other very common linear algebra operations

– BLAS = Basic Linear Algebra Subroutines• Other algorithms you may want are not going to be

supplied by vendor, so need to know these techniques

21

Warm up: Matrix-vector multiplication y = y + A*x

for i = 1:nfor j = 1:n

y(i) = y(i) + A(i,j)*x(j)

= + *

y(i) y(i)

A(i,:)

x(:)

22

Warm up: Matrix-vector multiplication y = y + A*x

{read x(1:n) into fast memory}{read y(1:n) into fast memory}for i = 1:n

{read row i of A into fast memory} for j = 1:n

y(i) = y(i) + A(i,j)*x(j){write y(1:n) back to slow memory}

° m = number of slow memory refs = 3*n + n2

° f = number of arithmetic operations = 2*n2

° q = f/m ~= 2° Matrix-vector multiplication limited by slow memory speed

23

Matrix Multiply C=C+A*B

for i = 1 to n for j = 1 to n

for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

24

Matrix Multiply C=C+A*B(unblocked, or untiled)

for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory}

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

25

Matrix Multiply (unblocked, or untiled)

Number of slow memory references on unblocked matrix multiplym = n3 read each column of B n times

+ n2 read each column of A once for each i + 2*n2 read and write each element of C once = n3 + 3*n2

So q = f/m = (2*n3)/(n3 + 3*n2) ~= 2 for large n, no improvement over matrix-vector mult

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

q=ops/slow mem ref

26

Matrix Multiply (blocked, or tiled)

Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize

for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on

blocks} {write block C(i,j) back to slow memory}

= + *C(i,j) C(i,j) A(i,k)

B(k,j)

27

Matrix Multiply (blocked or tiled)

Why is this algorithm correct?

Number of slow memory references on blocked matrix multiplym = N*n2 read each block of B N3 times (N3 * n/N * n/N)

+ N*n2 read each block of A N3 times + 2*n2 read and write each block of C once = (2*N + 2)*n2

So q = f/m = 2*n3 / ((2*N + 2)*n2) ~= n/N = b for large n

So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2)

Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)

Theorem (Hong, Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q =O(sqrt(M))

q=ops/slow mem ref

28

Model• As much as possible will be overlapped• Dot Product: ACC = 0 do i = x,n ACC = ACC + x(i) y(i) end do• Experiments done on an IBM RS6000/530

– 25 MHz– 2 cycle to complete FMA can be pipelined

» => 50 Mflop/s peak– one cycle from cache

29

DOT Operation - Data in Cache

Do 10 I = 1, n T = T + X(I)*Y(I) 10 CONTINUE

• Theoretically, 2 loads for X(I) and Y(I), one FMA operation, no re-use of data

• Pseudo-assembler LOAD fp0,T label: LOAD fp1,X(I) LOAD fp2,Y(I) FMA fp0,fp0,fp1,fp2 BRANCH label:

Load x Load y FMA

Load x Load y

1 result per cycle = 25 Mflop/sFMA

30

Matrix-Vector Product

• DOT version DO 20 I = 1, M DO 10 J = 1, N Y(I) = Y(I) + A(I,J)*X(J) 10 CONTINUE 20 CONTINUE

• From Cache = 22.7 Mflops • From Memory = 12.4 Mflops

31

Loop Unrolling

DO 20 I = 1, M, 2 T1 = Y(I ) T2 = Y(I+1) DO 10 J = 1, N T1 = T1 + A(I,J )*X(J) T2 = T2 + A(I+1,J)*X(J)10 CONTINUE Y(I ) = T1 Y(I+1) = T2 20 CONTINUE

• 3 loads, 4 flops• Speed of y=y+ATx,

N=48

Depth 1 2 3 4 Speed 25 33.3 37.5 40 50Measured 22.7 30.5 34.3 36.5Memory 12.4 12.7 12.7 12.6

• unroll 1: 2 loads : 2 ops per 2 cycles• unroll 2: 3 loads : 4 ops per 3 cycles• unroll 3: 4 loads : 6 ops per 4 cycles• …• unroll n: n+1 loads : 2n ops per n+1 cycles

• problem: only so many registers

32

Matrix Multiply

• DOT version - 25 Mflops in cache DO 30 J = 1, M DO 20 I = 1, M DO 10 K = 1, L C(I,J) = C(I,J) + A(I,K)*B(K,J) 10 CONTINUE 20 CONTINUE 30 CONTINUE

33

How to Get Near Peak DO 30 J = 1, M, 2 DO 20 I + 1, M, 2 T11 = C(I, J ) T12 = C(I, J+1) T21 = C(I+1,J ) T22 = C(I+1,J+1) DO 10 K = 1, L T11 = T11 + A(I, K) *B(K,J ) T12 = T12 + A(I, K) *B(K,J+1) T21 = T21 + A(I+1,K)*B(K,J ) T22 = T22 + A(I+1,K)*B(K,J+1) 10 CONTINUE C(I, J ) = T11 C(I, J+1) = T12 C(I+1,J ) = T21 C(I+1,J+1) = T22 20 CONTINUE 30 CONTINUE

• Inner loop: – 4 loads, 8 operations,

optimal.

• In practice we have measured 48.1 out of a peak of 50 Mflop/s when in cache

34

BLAS -- Introduction

• Clarity: code is shorter and easier to read,• Modularity: gives programmer larger building

blocks,• Performance: manufacturers will provide

tuned machine-specific BLAS,• Program portability: machine dependencies

are confined to the BLAS

35

Memory Hierarchy

RegistersL 1

CacheL 2 CacheLocal

MemoryRemote MemorySecondary Memory

• Key to high performance in effective use of memory hierarchy

• True on all architectures

36

Level 1, 2 and 3 BLAS• Level 1 BLAS

Vector-Vector operations

• Level 2 BLAS Matrix-Vector operations

• Level 3 BLAS Matrix-Matrix operations

+ *

*

+ *

37

More on BLAS (Basic Linear Algebra Subroutines)

• Industry standard interface(evolving)• Vendors, others supply optimized implementations• History

– BLAS1 (1970s): » vector operations: dot product, saxpy (y=*x+y), etc» m=2*n, f=2*n, q ~1 or less

– BLAS2 (mid 1980s)» matrix-vector operations: matrix vector multiply, etc» m=n2, f=2*n2, q~2, less overhead » somewhat faster than BLAS1

– BLAS3 (late 1980s)» matrix-matrix operations: matrix matrix multiply, etc» m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is

potentially much faster than BLAS2• Good algorithms used BLAS3 when possible (LAPACK)• www.netlib.org/blas, www.netlib.org/lapack

38

Why Higher Level BLAS?

• Can only do arithmetic on data at the top of the hierarchy

• Higher level BLAS lets us do this

BLAS MemoryRefs

Flops Flops/MemoryRefs

Level 1y=y+x

3n 2n 2/3

Level 2y=y+Ax

n2 2n2 2

Level 3C=C+AB

4n2 2n3 n/2

RegistersL 1

CacheL 2

CacheLocal

MemoryRemote Memory

Secondary Memory

39

BLAS for Performance

• Development of blocked algorithms important for performance

IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)

0

50

100

150

200

250

10 100 200 300 400 500Order of vector/Matrices

Mflo

p/s

Level 3 BLAS

Level 2 BLAS

Level 1 BLAS

40

BLAS for Performance


Alpha EV 5/6 500MHz (1Gflop/s peak)

0100200300400500600700


Mflo

p/s

Level 3 BLAS

Level 2 BLASLevel 1 BLAS

BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors)

Fast linear algebra kernels: BLAS

• Simple linear algebra kernels such as matrix-matrix multiply

• More complicated algorithms can be built from these basic kernels.

• The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutines (BLAS).

• Early agreement on standard interface (~1980) • Led to portable libraries for vector and shared

memory parallel machines. • On distributed memory, there is a less-

standard interface called the PBLAS

Level 1 BLAS

• Operate on vectors or pairs of vectors– perform O(n) operations; – return either a vector or a scalar.

• saxpy – y(i) = a * x(i) + y(i), for i=1 to n. – s stands for single precision, daxpy is for double

precision, caxpy for complex, and zaxpy for double complex,

• sscal y = a * x, for scalar a and vectors x,y

• sdot computes s = S ni=1 x(i)*y(i)

Level 2 BLAS

• Operate on a matrix and a vector; – return a matrix or a vector;– O(n2) operations

• sgemv: matrix-vector multiply– y = y + A*x– where A is m-by-n, x is n-by-1 and y is m-by-1.

• sger: rank-one update – A = A + y*xT, i.e., A(i,j) = A(i,j)+y(i)*x(j) – where A is m-by-n, y is m-by-1, x is n-by-1, – strsv: triangular solve – solves y=T*x for x, where T is triangular

Level 3 BLAS

• Operate on pairs or triples of matrices– returning a matrix;– complexity is O(n3).

• sgemm: Matrix-matrix multiplication– C = C +A*B, – where C is m-by-n, A is m-by-k, and B is k-by-n

• strsm: multiple triangular solve– solves Y = T*X for X, – where T is a triangular matrix, and X is a rectangular

matrix.

45

Optimizing in practice

• Tiling for registers– loop unrolling, use of named “register” variables

• Tiling for multiple levels of cache• Exploiting fine-grained parallelism within the

processor– super scalar – pipelining

• Complicated compiler interactions• Hard to do by hand (but you’ll try)• Automatic optimization an active research area

– PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac– www.cs.berkeley.edu/~iyer/asci_slides.ps– ATLAS: www.netlib.org/atlas/index.html

46

BLAS -- References

• BLAS software and documentation can be obtained via:

– WWW: http://www.netlib.org/blas,– (anonymous) ftp ftp.netlib.org: cd blas; get index– email [email protected] with the message: send

index from blas

• Comments and questions can be addressed to: [email protected]

47

BLAS Papers

• C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, 5:308--325, 1979.

• J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 14(1):1--32, 1988.

• J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 16(1):1--17, 1990.

Performance of BLAS

• BLAS are specially optimized by the vendor

– Sun BLAS uses features in the Ultrasparc• Big payoff for algorithms that can be

expressed in terms of the BLAS3 instead of BLAS2 or BLAS1.

• The top speed of the BLAS3• Algorithms like Gaussian elimination

organized so that they use BLAS3

49

How To Get Performance From Commodity Processors?

• Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.

• Routines have a large design space w/many parameters– blocking sizes, loop nesting permutations, loop unrolling depths,

software pipelining strategies, register allocations, and instruction schedules.

– Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.

• A few months ago no tuned BLAS for Pentium for Linux.• Need for quick/dynamic deployment of optimized routines.• ATLAS - Automatic Tuned Linear Algebra Software

– PhiPac from Berkeley

M C A B

N

K

N

M

K

*NB

Adaptive Approach for Level 3• Do a parameter study of the operation on the

target machine, done once.• Only generated code is on-chip multiply• BLAS operation written in terms of generated on-

chip multiply• All tranpose cases coerced through data copy to

1 case of on-chip multiply– Only 1 case generated per platform

51

Code Generation Strategy

• Code is iteratively generated & timed until optimal case is found. We try:

– Differing NBs– Breaking false dependencies– M, N and K loop unrolling

• On-chip multiply optimizes for:

– TLB access– L1 cache reuse– FP unit usage– Memory fetch– Register reuse– Loop overhead minimization

• Takes a couple of hours to run.

52

500x500 Double Precision Matrix-Matrix Multiply Across Multiple Architectures

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

DE

C A

lpha

2116

4a-4

33

HP

PA

8000

180M

hz

HP

9000

/735

/125

IBM

Pow

er2-

135

IBM

Pow

erP

C60

4e-3

32

Pen

tium

MM

X-15

0

Pen

tium

Pro

-200

Pen

tium

II-2

66

SG

I R46

00

SG

I R50

00

SG

I R80

00ip

21

SG

I R10

000i

p27

Sun

Mic

rosp

arc

IIM

odel

70

Sun

Dar

win

-270

Sun

Ultr

a2 M

odel

2200

System

Mflo

ps

Vendor Matrix Multiply ATLAS Matrix Multiply

53

500 x 500 Double Precision LU Factorization Performance Across Multiple Architectures

0.0

100.0

200.0

300.0

400.0

500.0

600.0

DC

G L

X 21

164a

-53

3

DE

C A

lpha

211

64a-

433

HP

PA

8000

IBM

Pow

er2-

135

IBM

Pow

erP

C60

4e-3

32

Pen

tium

Pro

-200

Pen

tium

II-2

66

SG

I R50

00

SG

I R10

000i

p27

Sun

Dar

win

-270

Sun

Ultr

a2 M

odel

2200

MFL

OPS

LU w/Vendor BLAS LU w/ATLAS & GEMM-based BLAS

54

500x500 gemm-based BLAS on SGI R10000ip28

0

50

100

150

200

250

300

DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM

MFL

OPS

Vendor BLAS ATLAS/SSBLAS Reference BLAS

55

500x500 gemm-based BLAS on UltraSparc 2200

0

50

100

150

200

250

300

DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM

Level 3 BLAS Routine

MFL

OPS

Vendor BLAS ATLAS/GEMM-based BLAS Reference BLAS

56

Recursive Approach for Other Level 3 BLAS

• Recur down to L1 cache block size

• Need kernel at bottom of recursion

– Use gemm-based kernel for portability

Recursive TRMM

00

0

00

0

0

0

0

0

0

0

00

0

0

0

0

0

0

57

500x500 Level 2 BLAS DGEMV

0

50

100

150

200

250

300

Architectures

MFL

OPS

Vendor NoTrans ATLAS NoTrans

F77 NoTrans

58

0100200300400500600700800

Size

Mflo

p/s

Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc

Multi-Threaded DGEMMIntel PIII 550 MHz

59

ATLAS

• Keep a repository of kernels for specific machines.

• Develop a means of dynamically downloading code

• Extend work to allow sparse matrix operations

• Extend work to include arbitrary code segments

• See: http://www.netlib.org/atlas/

60

BLAS Technical Forum http://www.netlib.org/utk/papers/blast-forum.html

• Established a Forum to consider expanding the BLAS in light of modern software, language, and hardware developments.

• Minutes available from each meeting• Working proposals for the following:

– Dense/Band BLAS– Sparse BLAS– Extended Precision BLAS– Distributed Memory BLAS– C and Fortran90 interfaces to Legacy BLAS

61

Strassen’s Matrix Multiply

• The traditional algorithm (with or without tiling) has O(n3) flops

• Strassen discovered an algorithm with asymptotically lower flops

– O(n2.81)• Consider a 2x2 matrix multiply, normally 8 multiplies

Let M = [m11 m12] = [a11 a12] * [b11 b12]

[m21 m22] [a21 a22] [b21 b22]

Let p1 = (a12 - 122) * (b21 + b22) p5 = a11 * (b12 - b22)

p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11)

p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11

p4 = (a11 + a12) * b22

Then m11 = p1 + p2 - p4 + p6

m12 = p4 + p5

m21 = p6 + p7

m22 = p2 - p3 + p5 - p7

Extends to nxn by divide&conquer

62

Strassen (continued)

T(n) = Cost of multiplying nxn matrices

= 7*T(n/2) + 18*(n/2)2 = O(nlog_2 7) = O(n2.81)

° Available in several libraries° Up to several time faster if n large enough (100s)° Needs more memory than standard algorithm° Can be less accurate because of roundoff error° Current world’s record is O(n2.376.. )

63

Summary• Performance programming on uniprocessors

requires– understanding of memory system

» levels, costs, sizes– understanding of fine-grained parallelism in processor to

produce good instruction mix

• Blocking (tiling) is a basic approach that can be applied to many matrix algorithms

• Applies to uniprocessors and parallel processors– The technique works for any architecture, but choosing the

blocksize b and other details depends on the architecture

• Similar techniques are possible on other data structures

64

Summary: Memory Hierachy• Virtual memory was controversial at the time:

can SW automatically manage 64KB across many programs?

– 1000X DRAM growth removed the controversy

• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy

• Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?

65

BLAS MemoryRefs

Flops Flops/MemoryRefs

Level 1y=y+x

3n 2n 2/3

Level 2y=y+Ax

n2 2n2 2

Level 3C=C+AB

4n2 2n3 n/2

Performance = Effective Use of Memory Hierarchy

• Can only do arithmetic on data at the top of the hierarchy

• Higher level BLAS lets us do this


Level 1, 2 & 3 BLAS Intel PII 450MHz

0

100

200

300

400


Mflo

p/s

66

Engineering: SUN Enterprise

• Proc + mem card - I/O card– 16 cards of either type– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 Fi

berC

hann

el

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

67

Engineering: Cray T3E

– Scale up to 1024 processors, 480MB/s links– Memory controller generates request message for non-local references– No hardware mechanism for coherence

» SGI Origin etc. provide this

Switch

P$

XY

Z

External I/O

Memctrl

and NI

Mem

68

000001

010011

100

110

101

111

Evolution of Message-Passing Machines

• Early machines: FIFO on each link– HW close to prog. Model; – synchronous ops– topology central (hypercube algorithms)

CalTech Cosmic Cube (Seitz, CACM Jan 95)

69

Diminishing Role of Topology

• Shift to general links– DMA, enabling non-blocking ops

» Buffered by system at destination until recv

– Store&forward routing• Diminishing role of topology

– Any-to-any pipelined routing– node-network interface dominates

communication time

– Simplifies programming– Allows richer design space

» grids vs hypercubes

H x (T0 + n/B)

vs

T0 + H + n/B

Intel iPSC/1 -> iPSC/2 -> iPSC/860

70

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

71

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Building on the mainstream: IBM SP-2

• Made out of essentially complete RS6000 workstations

• Network interface integrated in I/O bus (bw limited by I/O bus)

72

Berkeley NOW

• 100 Sun Ultra2 workstations

• Inteligent network interface

– proc + mem

• Myrinet Network

– 160 MB/s per link

– 300 ns per hop

73

Thanks • These slides came in part from

courses taught by the following people:

– Kathy Yelick, UC, Berkeley– Dave Patterson, UC, Berkeley– Randy Katz, UC, Berkeley– Craig Douglas, U of Kentucky

• Computer Architecture A Quantitative Approach, Chapter 8, Hennessy and Patterson, Morgan Kaufman Pub.

lecture 6: memory hierarchy and cache (continued)

Documents