memory system performance of high end smps, pcs and clusters of pcs

24
Eidgenössische Technische Hochschule Zürich Ecole polytechnique fédérale de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich 25th Annual International Symposium on Computer Architecture 7th Workshop on Scalable Shared Memory Multiprocessor Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/

Upload: dana-tran

Post on 02-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Memory System Performance of High End SMPs, PCs and Clusters of PCs. Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/. Memory Systems. Low End designs in PCs: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

EidgenössischeTechnische Hochschule

Zürich

Ecole polytechnique fédérale de ZurichPolitecnico federale di Zurigo

Swiss Federal Institute of Technology Zurich

25th Annual International Symposium on Computer Architecture

7th Workshop on Scalable Shared Memory Multiprocessor

Memory System Performance of High End SMPs, PCs and

Clusters of PCs

Ch. Kurmann, T. Stricker

Laboratory for Computer SystemsETHZ - Swiss Institute of Technology

CH-8092 Zurich

Color Slides: http://www.cs.inf.ethz.ch/CoPs/isca98ws/

Page 2: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

2

Memory Systems

Low End designs in PCs: extremely low cost standard I/O interface

High End designs in “Killer” Workstations: well engineered memory systems support for additional datastreams better I/O busses

Are Low End SMPs the universal compute nodes for parallel and distributed systems?

Page 3: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

3

Contribution

The answer is probably the memory system performance.

How significant are the differences in memory system performance?

Limitations of Low End memory systems for local computation (e.g. in scientific applications) for inter-node communication (e.g. in databases)

Page 4: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

4

Extended Copy Transfer Characterization

ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): Categories

Access pattern, stride (spatial locality) Working set (temporal locality)

Value Transfer bandwidth (large amount of data)

Same chart resulting from one microbenchmark Local and Remote transfers compute and communicate accesses

Page 5: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

5

Measurement Problems

Some parameter combinations are hard tomeasure, even with carefully tuned C code: Reduced performance for large strides and small

working-sets in L1 caches is a measurement artifact and not architecture related.

Compilers occasionally generate suboptimal instruction schedules for loads / stores.

Page 6: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

6

Local Load Access: Pentium Pro PC

Working set

Access pattern

(stride between 64bit words)

12

81

279664634832312416151287654321

16

M8

M4

M2

M1

M5

12

K2

56

K1

28

K6

4 K

32

K1

6 K

8 K

4 K

2 K

1 K

0.5

K

600

500

400

300

200

100

0

600

500

400

300

200

100

0

Lo

ad b

and

wid

th (

MB

ytes

/sec

)

Lo

ad b

and

wid

th (

MB

yte/

s)

Pentium Pro FXone processor

200 MHz

DRAM

L1

L2

Page 7: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

7

Local Load Access: SGI Origin

12

81

279664634832312416151287654321

64

M3

2 M

16

M8

M4

M2

M1

M5

12

K2

56

K1

28

K6

4 K

32

K1

6 K

8 K

4 K

2 K

1 K

0.5

K

1600

1400

1200

1000

800

600

400

200

0

1600

1400

1200

1000

800

600

400

200

0

Lo

ad b

and

wid

th (

MB

ytes

/sec

)

Lo

ad b

and

wid

th (

MB

yte/

s)

SGI Origin 10000one processor

195 MHz

L1

L2

Working set

Access pattern

(stride between 64bit words)

Page 8: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

8

Local Load Access: DEC 8400

12

81

279664634832312416151287654321

64

M3

2M

16

M8

M4

M2

M1

M5

12

k2

56

k1

28

k6

4k

32

k1

6k

8k

4k

2k

1k

.5k

1200

1000

800

600

400

200

0

1200

1000

800

600

400

200

0

Lo

ad b

and

wid

th (

MB

ytes

/sec

)

Lo

ad b

and

wid

th (

MB

yte/

s)

DEC Alpha 8400one processor

300 MHz

L2

L3

L1

Working set

Access pattern

(stride between 64bit words)

Page 9: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

9

Local Load Access: Sun Enterprise

Working set

Access pattern

(stride between 64bit words)

12

81

279664634832312416151287654321

16

M8

M4

M2

M1

M5

12

K2

56

K1

28

K6

4 K

32

K1

6 K

8 K

4 K

2 K

1 K

0.5

K

700

600

500

400

300

200

100

0

700

600

500

400

300

200

100

0

Lo

ad b

and

wid

th (

MB

ytes

/sec

)

Lo

ad b

and

wid

th (

MB

yte/

s)

Sun Ultra Enterpriseone Ultra SPARC II

248 MHz

DRAM

L1

L2

Page 10: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

10

Local Load Access: SGI Cray T3E

12

81

279664634832312416151287654321

16

M8

M4

M2

M1

M5

12

K2

56

K1

28

K6

4 K

32

K1

6 K

8 K

4 K

2 K

1 K

0.5

K

1200

1000

800

600

400

200

0

1200

1000

800

600

400

200

0

Lo

ad b

and

wid

th (

MB

ytes

/sec

)

Lo

ad b

and

wid

th (

MB

yte/

s)

Cray T3Eone processor

300 MHz

DRAM

L1L2

Working set

Access pattern

(stride between 64bit words)

Page 11: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

11

Comparison - Local Access

1 2 3 4 5 6 7 81

21

51

62

43

1 32

48

63

64

96

12

71

28

19

2

0

50

100

150

200

250

300

Me

mo

ry L

oa

d b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

Pentium Pro

SGI Origin

DEC 8400

Sun Enterp.

Cray T3E

450

Page 12: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

12

Performance in an SMP setting

Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors

Topics of interest: small working sets in caches: performance remains

same large working sets in memory: interesting

differences behavior for even/uneven strides

“Gather copy stream” (strided load / contiguous store)

Page 13: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

13

Local Copy: Pentium Pro SMP

1 2 3 4 5 6 7 81

21

51

62

43

1 32

48

63

64

96

12

71

28

19

2

0

5

10

15

20

25

30

35

40

45

50

Me

mo

ry C

op

y b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

one processor two processors

Page 14: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

14

Local Copy: SGI Origin CC-NUMA

1 2 3 4 5 6 7 81

21

51

62

43

1 32

48

63

64

96

12

71

28

19

2

0

20

40

60

80

100

120

140

Me

mo

ry C

op

y b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

1 processor

2 processors

4 processors

Page 15: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

15

Local Copy: DEC 8400 SMP

1 2 3 4 5 6 7 8 12 16 24 32 48 640

10

20

30

40

50

60

Me

mo

ry C

op

y b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

1 processor 4 processors

Page 16: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

16

Local Copy: Sun Enterprise SMP

1 2 3 4 5 6 7 81

21

51

62

43

1 32

48

63

64

96

12

71

28

19

2

0

10

20

30

40

50

60

70

Me

mo

ry C

op

y b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

8 processors

4 processors

2 processors

1 processor

Page 17: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

17

Remote in Parallel Computers

Parallel & Network Symmetric Computers Multiprocessors

SGI Cray T3E, SGI Origin DEC 8400, Sun Enterprise, Clusters of PCs (CoPs) Pentium Pro SMPs

Processor Caches Memory

P

C

M

P

C

M

P

C

M

Network

P

C

P

C

P

C

M M

Bus/Network

P C M

Page 18: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

18

1 2 3 4 5 6 7 8 12 16 24 32 48 640

10

20

30

40

50

60

70

80

Rem

ote

Cop

y ba

ndw

idth

(M

byte

/s)

Access pattern (stride between 64bit words)

local copy

remote copy by Myrinet

remote copy by SCI

128

Remote Transfers: CoPsPentium Pro with SCI / Myrinet

Page 19: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

19

Remote Transfers: SGI Origin

1 2 3 4 5 6 7 8 12 16 24 32 48 640

20

40

60

80

100

120

Re

mo

te c

op

y b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

local copy remote copy

Page 20: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

20

Remote Transfers: DEC 8400

1 2 3 4 5 6 7 8 12 16 24 32 48 640

20

40

60

80

100

120

140

160

Me

mo

ry L

oa

d b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

local loads remote loads

Page 21: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

21

Remote Transfers: SGI Cray T3E

1 2 3 4 5 6 7 8 12 16 24 32 48 640

20

40

60

80

100

120

140

160

180

200

Me

mo

ry L

oa

d b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

local loads remote loads

Page 22: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

22

1 2 3 4 5 6 7 8 12 16 24 32 48 640

20

40

60

80

100

120

140

160

180

200

Me

mo

ry L

oa

d b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

PPro-Myrinet

PPro-SCI

SGI Origin

DEC 8400

Cray T3E

350

Comparison - Remote Transfers

Page 23: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

23

Improvement of PC Chipsets

Intel 440 BX AGP Chip Set400 MHz / 100 MHz

Intel 440 LX AGP Chip Set233 MHz / 66 MHz

Intel 440 FX Natoma Chip Set200 MHz / 66 MHz

1 2 3 4 5 6 7 81

21

51

62

43

1 32

48

63

64

96

12

71

28

19

2

0

10

20

30

40

50

60

70

80

90

100

Me

mo

ry C

op

y b

an

dw

idth

(M

byt

e/s

)

Access pattern (stride between 64bit words)

440FX 440 LX 440 BX

Page 24: Memory System Performance  of High End SMPs, PCs and  Clusters of PCs

24

Conclusion

ECT-Characterizations for different memory systems: T3E (MMP-Node), Origin (NUMA), DEC8400 (SMP) CoPs Intel P6 SMPs and Clusters

High End SMP vs. Low End SMP: Less than half performance on two processor PCs.

Fast communication puts high demands on the memory system: Unlike in traditional SMPs and CC-NUMAs fine grained

remote access do not perform at all in PC-SMPs and CoPs Adding more commodity microprocessors processors

without reinforcing the memory system is therefore questionable.