understanding application scaling nas parallel benchmarks 2.2 on now and sgi origin 2000

Understanding Application Scaling

NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

Frederick Wong, Rich Martin,Remzi Arpaci-Dusseau, David Wu,and David Culler

{fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU

Department of Electrical Engineering and Computer Science

Computer Science Division

University of California, Berkeley

June 15th, 1998

Introduction

NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems

7 scientific benchmarks that represents the most common computation kernels

NPB is written on top of Message Passing Interface (MPI) for portability

NPB is a Constant Problem Size (CPS) scaling benchmark suite

This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000

Speedup on NOW

0

5

10

15

20

25

30

35

40

1 6 11 16 21 26 31Nodes

Spee

dup

lu

mg

sp

Ideal

Motivation Early study on NPB shows ideal speedup on NOW!

Scaling as good as T3D and better than SP-2 Per node performance better than T3D, close to SP-2

Speedup on SGI Origin 2000

0

5

10

15

20

25

30

35

40

1 6 11 16 21 26 31

Nodes

Sp

eed

up

lu

mg

sp

Ideal

Submitted results for Origin 2000 show a spread

Presentation Outline

Hardware Configuration Time Breakdown of the Applications Communication Performance Computation Performance Conclusion

Hardware Configuration

SGI Origin 2000 (64 nodes) MIPS R10000 processor, 195 MHz, 32KB/32KB L1 4MB external L2 cache per processor 16GB memory total MPI performance: 13 sec one-way latency, 150 MB

peak, half-power at 8KB message size

Network Of Workstations (NOW) UltraSPARC I processor, 167MHz, 16KB/16KB L1 512KB external L2 cache per processor 128 MB memory per processor MPI performance: 22 sec one-way latency, 27 MB

peak, half-power at 4KB message size

Time Breakdown -- LU

Time Breakdown of LU on NOW

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32Processors

Tim

e (se

cond

s) Cummulative

Computation

CommunicationIdeal

Black line -- total running time a single-man - 10

secs job ideally, requires 5

secs for 2 men total amount of work

-- 10 secs More work, need

communication

Time Breakdown of LU on Origin 2000

0

500

1000

1500

2000

2500

3000


Time (

seco

nds)

CummulativeComputationCommunicationIdeal

Time Breakdown -- LU

Time Breakdown of LU on NOW

0

500

1000

1500

2000

2500

3000


Time

(sec

onds

) Cummulative

Computation

CommunicationIdeal

Time Breakdown -- SP

Time Breakdown on NOW

0

500

1000

1500

2000

2500

3000

3500

1 4 9 16 25Processors

Time (

seco

nds) Cummulative

ComputationCommunicationIdeal

Time Breakdown on SGI

0

500

1000

1500

2000

2500

3000

1 4 9 16 25Processors

Time (

seco

nds)

CummulativeComputationCommunicationIdeal

Communication Performance

Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW

MPI Pp2pt Latency (One-way)

1

10

100

1000

10000

100000

1 10 100 1000 10000 100000

1E+06Message Size

Late

ncy

(use

c)

Origin 2000

NOW

MPI Pt2pt Bandwidth (One-way)

0

2040

60

80

100120

140

160

1 100 10,000 1,000,000Message Size

MB/

sec

SGI

NOW

SGI 1/2 Power

NOW 1/2 Power

Communication Efficiency

Communication Efficiency

0%

10%

20%

30%

40%

50%60%

70%

80%

90%

100%

0 10 20 30 40

Processors

Effic

iency

(%)

NOW-LUSGI-LUNOW-SPSGI-SP

absolute bandwidth delivered are close SP/32 on NOW -- 215s SP/32 on SGI -- 289s

comm. efficiency on SGI only achieved 30% of potential bandwidth

protocols tradeoff are pronounce hand-shake vs. bulk-

send in pt2pt collective ops

Computation Performance Relative performance of the benchmarks on single node

roughly close to the processor performance difference

LU SPSGI 1373 1652NOW 2469 2807

Both computational CPI and L2 misses change significantly on both platforms when scaled

LU SPCPI decrease 94% 93%L2 misses decrease 25% 27%

Recap on CPS Scaling

4

8163264

128256

LU Working Set

4-processor Knee starts at 256KB

0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

LU Working Set


0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node 8-processor

Knee starts at 128KB

LU Working Set



0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node

16-Node


LU Working Set




0

2

4

6

8

10

12

14

1 10 100 1000 10000Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node

16-Node

32-Node


miss rate drops from 2MB to 4 MB global cache

0

5

10

15

20

25

1 10 100 1000 10000

Cache Size (KB)

Mis

s R

ate

(%)

4-Node

8-Node

16-Node

32-Node

Cost under scaling extra work worsen

memory system’s performance

SP Working Set

total memory references on SGI 4-processor has 64.38

billion memory reference

25-processor has 72.35 billion memory reference

12.38% increase

CostBenefit

Conclusion NPB

-benchmarks hard to predict comm performance global cache increases effectively reduce comp. time sequential node arch. is a dominant factor in NPB perf.

NOW an inexpensive way to go parallel absolute performance is excellent MPI on NOW has good scalability and performance NOW vs. proprietary system -- detail instrumentation ability

speedup cannot tell the whole story, scalability involves: the interplay of program and machine scaling delivered comm. performance, not -benchmarks complicated memory system performance

understanding application scaling nas parallel benchmarks 2.2 on now and sgi origin 2000

Documents

sgi origin

optimal resultssgi class

understanding npb scaling

ideal speedup

motivationearly study

david wu

david culler

remzi arpacidusseau