understanding application scaling nas parallel benchmarks 2.2 on now and sgi origin 2000

18
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science Computer Science Division University of California, Berkeley June 15 , 1998

Upload: darrin

Post on 02-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000. Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Understanding Application Scaling

NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

Frederick Wong, Rich Martin,Remzi Arpaci-Dusseau, David Wu,and David Culler

{fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU

Department of Electrical Engineering and Computer Science

Computer Science Division

University of California, Berkeley

June 15th, 1998

Page 2: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Introduction

NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems

7 scientific benchmarks that represents the most common computation kernels

NPB is written on top of Message Passing Interface (MPI) for portability

NPB is a Constant Problem Size (CPS) scaling benchmark suite

This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000

Page 3: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Speedup on NOW

0

5

10

15

20

25

30

35

40

1 6 11 16 21 26 31Nodes

Spee

dup

lu

mg

sp

Ideal

Motivation Early study on NPB shows ideal speedup on NOW!

Scaling as good as T3D and better than SP-2 Per node performance better than T3D, close to SP-2

Speedup on SGI Origin 2000

0

5

10

15

20

25

30

35

40

1 6 11 16 21 26 31

Nodes

Sp

eed

up

lu

mg

sp

Ideal

Submitted results for Origin 2000 show a spread

Page 4: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Presentation Outline

Hardware Configuration Time Breakdown of the Applications Communication Performance Computation Performance Conclusion

Page 5: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Hardware Configuration

SGI Origin 2000 (64 nodes) MIPS R10000 processor, 195 MHz, 32KB/32KB L1 4MB external L2 cache per processor 16GB memory total MPI performance: 13 sec one-way latency, 150 MB

peak, half-power at 8KB message size

Network Of Workstations (NOW) UltraSPARC I processor, 167MHz, 16KB/16KB L1 512KB external L2 cache per processor 128 MB memory per processor MPI performance: 22 sec one-way latency, 27 MB

peak, half-power at 4KB message size

Page 6: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Time Breakdown -- LU

Time Breakdown of LU on NOW

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32Processors

Tim

e (se

cond

s) Cummulative

Computation

CommunicationIdeal

Black line -- total running time a single-man - 10

secs job ideally, requires 5

secs for 2 men total amount of work

-- 10 secs More work, need

communication

Page 7: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Time Breakdown of LU on Origin 2000

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32Processors

Time (

seco

nds)

CummulativeComputationCommunicationIdeal

Time Breakdown -- LU

Time Breakdown of LU on NOW

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32Processors

Time

(sec

onds

) Cummulative

Computation

CommunicationIdeal

Page 8: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Time Breakdown -- SP

Time Breakdown on NOW

0

500

1000

1500

2000

2500

3000

3500

1 4 9 16 25Processors

Time (

seco

nds) Cummulative

ComputationCommunicationIdeal

Time Breakdown on SGI

0

500

1000

1500

2000

2500

3000

1 4 9 16 25Processors

Time (

seco

nds)

CummulativeComputationCommunicationIdeal

Page 9: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Communication Performance

Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW

MPI Pp2pt Latency (One-way)

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

1E+06Message Size

Late

ncy

(use

c)

Origin 2000

NOW

MPI Pt2pt Bandwidth (One-way)

0

2040

60

80

100120

140

160

1 100 10,000 1,000,000Message Size

MB/

sec

SGI

NOW

SGI 1/2 Power

NOW 1/2 Power

Page 10: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Communication Efficiency

Communication Efficiency

0%

10%

20%

30%

40%

50%60%

70%

80%

90%

100%

0 10 20 30 40

Processors

Effic

iency

(%)

NOW-LUSGI-LUNOW-SPSGI-SP

absolute bandwidth delivered are close SP/32 on NOW -- 215s SP/32 on SGI -- 289s

comm. efficiency on SGI only achieved 30% of potential bandwidth

protocols tradeoff are pronounce hand-shake vs. bulk-

send in pt2pt collective ops

Page 11: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Computation Performance Relative performance of the benchmarks on single node

roughly close to the processor performance difference

LU SPSGI 1373 1652NOW 2469 2807

Both computational CPI and L2 misses change significantly on both platforms when scaled

LU SPCPI decrease 94% 93%L2 misses decrease 25% 27%

Page 12: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Recap on CPS Scaling

4

8163264

128256

Page 13: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

LU Working Set

4-processor Knee starts at 256KB

0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

Page 14: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

LU Working Set

4-processor Knee starts at 256KB

0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node 8-processor

Knee starts at 128KB

Page 15: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

LU Working Set

4-processor Knee starts at 256KB

8-processor Knee starts at 128KB

0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node

16-Node

16-processor Knee starts at 64KB

Page 16: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

LU Working Set

4-processor Knee starts at 256KB

8-processor Knee starts at 128KB

16-processor Knee starts at 64KB

0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node

16-Node

32-Node

32-processor Knee starts at 32KB

miss rate drops from 2MB to 4 MB global cache

Page 17: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

0

5

10

15

20

25

1 10 100 1000 10000

Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node

16-Node

32-Node

Cost under scaling extra work worsen

memory system’s performance

SP Working Set

total memory references on SGI 4-processor has 64.38

billion memory reference

25-processor has 72.35 billion memory reference

12.38% increase

CostBenefit

Page 18: Understanding Application Scaling NAS Parallel Benchmarks 2.2 on  NOW and SGI Origin 2000

Conclusion NPB

-benchmarks hard to predict comm performance global cache increases effectively reduce comp. time sequential node arch. is a dominant factor in NPB perf.

NOW an inexpensive way to go parallel absolute performance is excellent MPI on NOW has good scalability and performance NOW vs. proprietary system -- detail instrumentation ability

speedup cannot tell the whole story, scalability involves: the interplay of program and machine scaling delivered comm. performance, not -benchmarks complicated memory system performance