tools for engineering analysis of high performance parallel programs

20
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley http://www.cs.berkeley.edu/ ~culler/talks

Upload: bevis

Post on 06-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Tools for Engineering Analysis of High Performance Parallel Programs. David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley http://www.cs.berkeley.edu/~culler/talks. Traditional Parallel Programming Tools. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tools for Engineering Analysis of High Performance Parallel Programs

Tools for Engineering Analysis of High Performance Parallel Programs

David Culler,

Frederick Wong, Alan Mainwaring

Computer Science Division

U.C.Berkeley

http://www.cs.berkeley.edu/~culler/talks

Page 2: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 2

Traditional Parallel Programming Tools

• Focus on showing “what program did” and “when it did it”– microscopic analysis of deterministic

events

– oriented towards initial development of small programs on small data sets and small machines

• Instrumentation– traces, counters, profiles

• Visualization

• Examples– AIMS, PTOOLS, PPP

– pablo + paradyn + ... => delphi

– ACTS TAU - tuning and analysis util.

Page 3: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 3

Example: Pablo

Page 4: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 4

Beyond Zeroth-order Analysis

• Basic level to get to a system design that is reasonable and behaves properly under “ideal condition”

• Subject the system to various stresses to understand its operating regime and gain deeper insight into its dynamic behavior

• Combine empirical data with analytical models

• Iterate

• from What? to What if?

Wind Speed

max

dis

pla

cem

en

t

Page 5: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 5

Approach: Framework for Parameterized Sensitivity Analsys

• framework performs analysis over numerous runs– statistical filtering

– vary parameter of interest

• provides means of combining data to isolate effects of interest

=> ROBUSTNESS

Well-developedParallel Program

StudyParameter

Problem Data SetGenerator

InstrumentationTools

MachineCharacterizers

visualization, modeling

• Procs

• Comm. perf.

• Cache

• Scheduling

• ...

Page 6: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 6

Simplest Example: Performance( P )

• NPB2.2 on NOW and Origin 2000 (250)

Origin Speedup

048

12162024283236404448

0 4 8 12 16 20 24 28 32 36 40 44 48

Machine Size (Processors)

Spee

dup

BT

SP

LU

MG

FT

IS

Ideal

Cluster Speedup

048

12162024283236404448

0 4 8 12 16 20 24 28 32 36 40 44 48

Machine Size (Processors)

Spee

dup

BT

SP

LU

MG

FT

IS

Ideal

Page 7: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 7

Where Time is Spent ( P )

• Reveal basic Processor and network loading (vs P)

• Basis for model derivation - comm(P)

LU (Origin)

0

500

1,000

1,500

2,000

2,500

3,000

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Tim

e (s

econ

ds)

Total

Comp

Comm

Ideal

LU (Cluster)

0

500

1,000

1,500

2,000

2,500

3,000

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Tim

e (s

econ

ds)

Total

Comp

Comm

Ideal

Page 8: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 8

Where Time is Spent ( P ) - cont

• Reveal basic Processor and network loading (vs P)

FT (Cluster)

0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Tim

e (s

econ

ds)

Total

Comp

Comm

Ideal

FT (Origin)

0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Tim

e (s

econ

ds)

Total

Comp

Comm

Ideal

Page 9: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 9

Communication Volume ( P )

Total Communication Volume

0

20

40

60

80

100

120

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Vol

ume

(GB

)

BT

SP

LU

MG

FT

IS

Bytes Per Processor

0

1,000

2,000

3,000

4,000

5,000

6,000

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Vol

ume

(MB

)

BT

SP

LU

MG

FT

IS

Page 10: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 10

Communication Structure ( P )

Normalized Messages Per Processor

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Mes

sage

s P

er P

roce

ssor BT

SP

LU

MG

FT

IS

Average Message Size

1

10

100

1,000

10,000

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Ave

rage

Mes

sage

Siz

e (K

B)

BT

SP

LU

MG

FT

IS

Page 11: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 11

Understanding Efficiency ( P, M )

• Want to understand both what load the program is placing on the system

• and how well the system is handling that load=> characterize the capability of the system via simple benchmarks

(rather than advertised peaks)

=> combine with measured load for predictive model, & compare

MPI One-way Latency on Cluster

0

10

20

30

40

50

60

70

1 10 100 1000

Message Size (Bytes)

Tim

e (u

sec)

MPI One-way Latency on Origin

0

10

20

30

40

50

60

70

1 10 100 1,000

Message Size (Bytes)

Tim

e (u

sec)

Page 12: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 12

Communication Efficiency

Cluster (rendezvous)

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Eff

icie

ncy

(%) BT

SP

LU

MG

FT

Origin

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Eff

icie

ncy

(%)

BT

SP

LU

MG

FT

IS

Page 13: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 13

Tools => Improvements in Run Time

• Efficiency analysis (vs parameters) gives insight into where to improve the system or the program– use traditional profiling to see where is program the ‘bad

stuff’ happens

– or go back and tune the system to do better

Cluster (eager)

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Eff

icie

ncy

(%)

BT

SP

LU

MG

FT

IS

Cluster (rendezvous)

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32 36 40

Machine Size (Processors)

Eff

icie

ncy

(%) BT

SP

LU

MG

FT

Page 14: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 14

Cache Behavior (P, $)

• Combining trace generation with simulation provides new structural insight

• Here: clear knees in program working set ($)these shift with machine size (P)

LU

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

8

16

32

LU

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

8

16

LU

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

8

LU

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

Page 15: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 15

Cache Behavior (P, $)

• Clear knees in program working set ($) not affected by P

FT

0

5

10

15

20

25

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

Per Processor Cache Size (KB)

Mis

s R

ate

(%)

4

8

16

32

Page 16: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 16

Sensitivity to Multiprogramming

• Parallel machines are increasingly general purpose– multiprogramming, at least interrupts and daemons

• Many ‘ideal’ programs very sensitive to perturbations– Msg Passing is loosely coupled, but implementation may not be!

1 1 1

6.39

1.43

4.11

19.05

1.63

5.86

20.25

1.65

6.53

0

24

68

1012

1416

1820

22

LU FT MG

Slowdown

Dedicated1-Seq2-Seq3-Seq

1 1 1

4.20

1.28

4.18

18.24

1.51

6.27

0

24

68

1012

1416

1820

22

LU FT MG

Slowdown

Dedicated2-PP3-PP

Page 17: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 17

Tools => Improvements in Run Time

• MPI implementation spin-waits on send till network available (or queue not full) or on recv-complete

• Should use two-phase spin-block

1 1 1

1.24

0.96

1.16

1.31

0.91

1.20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

LU FT MG

Slowdown

Dedicated2-PP3-PP

Page 18: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 18

Sensitivity to Seemingly Unrelated Activity

• The mechanism for doing parameter studies is naturally extended to get statistically valid data through multiple samples at each point– tend to get crisp, fast results in the wee hours

• Extend study outside the app

• Example: two programs on big Origin– alone together on

64 P

– 8 processor IS run: 4.71 sec 6.18

– 36 processor SP run: 26.36 sec 65.28

Page 19: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 19

Repeatability

• The variance for the repeated runs is a key result for production codes - the real world is not ideal

Scatter Plot of FT Runtime on Origin (30 samples)

0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32

Machine Size (processors)T

ime

(sec

onds

) Average

Scatter Plot of LU Runtime on Origin (30 samples)

0

200

400

600

800

1000

1200

1400

0 4 8 12 16 20 24 28 32

Machine Size (processors)

Tim

e (s

econ

ds) Average

Page 20: Tools for Engineering Analysis of High Performance Parallel Programs

11/5/99 LLNL ASCI III 20

Plans

• Integrate our instrumentation and analysis tools with ACTS TAU– port to UCB Millennium environment

– experiment with ASCI platforms

• Refine and complete the automated sensitivity analysis framework

• Backend performance data storage– Pablo SPPF?

• Next Year– integrate performance model development, prediction