benchmarking and performance evaluations

40
Benchmarking and Performance Evaluations Todd Mytkowicz Microsoft Research

Upload: tiana

Post on 23-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Benchmarking and Performance Evaluations. Todd Mytkowicz Microsoft Research. Let’s pole for an upcoming election. I ask 3 of my co-workers who they are voting for. Let’s pole for an upcoming election. I ask 3 of my co-workers who they are voting for. My approach does not deal with - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Benchmarking and Performance Evaluations

Benchmarking and Performance Evaluations

Todd MytkowiczMicrosoft Research

Page 2: Benchmarking and Performance Evaluations

Let’s pole for an upcoming election

I ask 3 of my co-workers who they are voting for.

Page 3: Benchmarking and Performance Evaluations

Let’s pole for an upcoming election

I ask 3 of my co-workers who they are voting for.

• My approach does not deal with – Variability – Bias

Page 4: Benchmarking and Performance Evaluations

Issues with my approach

Variability source: http://www.pollster.com

My approach is not reproducible

Page 5: Benchmarking and Performance Evaluations

Issues with my approach(II)

Bias

source: http://www.pollster.com

My approach is not generalizable

Page 6: Benchmarking and Performance Evaluations

Take Home Message

• Variability and Bias are two different things– Difference between reproducible and

generalizable!

Page 7: Benchmarking and Performance Evaluations

Take Home Message

• Variability and Bias are two different things– Difference between reproducible and

generalizable!

Do we have to worry about Variability and Bias when we benchmark?

Page 8: Benchmarking and Performance Evaluations

Let’s evaluate the speedup of my whizbang idea

What do we do about Variability?

Page 9: Benchmarking and Performance Evaluations

Let’s evaluate the speedup of my whizbang idea

What do we do about Variability?

Page 10: Benchmarking and Performance Evaluations

Let’s evaluate the speedup of my whizbang idea

What do we do about Variability?

• Statistics to the rescue– mean– confidence interval

Page 11: Benchmarking and Performance Evaluations

Intuition for T-Test

• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean

Page 12: Benchmarking and Performance Evaluations

Intuition for T-Test

• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean

Trial Mean of 10 throws

1 4.0

2 4.3

3 4.9

4 3.8

5 4.3

6 2.9

… …

Page 13: Benchmarking and Performance Evaluations

Intuition for T-Test

• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean

Trial Mean of 10 throws

1 4.0

2 4.3

3 4.9

4 3.8

5 4.3

6 2.9

… …

Page 14: Benchmarking and Performance Evaluations

Back to our Benchmark: Managing Variability

Page 15: Benchmarking and Performance Evaluations

Back to our Benchmark: Managing Variability

> x=scan('file')Read 20 items> t.test(x)

One Sample t-test

data: x t = 49.277, df = 19, p-value < 2.2e-1695 percent confidence interval: 1.146525 1.248241 sample estimates:mean of x 1.197383

Page 16: Benchmarking and Performance Evaluations

So we can handle Variability. What about Bias?

Page 17: Benchmarking and Performance Evaluations

System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench

Evaluating compiler optimizations

Page 18: Benchmarking and Performance Evaluations

Madan:speedup = 1.18 ± 0.0002

Conclusion: O3 is good

System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench

Evaluating compiler optimizations

Page 19: Benchmarking and Performance Evaluations

Madan:speedup = 1.18 ± 0.0002

Conclusion: O3 is good

Todd:speedup = 0.84 ± 0.0002

Conclusion: O3 is bad

System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench

Evaluating compiler optimizations

Page 20: Benchmarking and Performance Evaluations

Madan:speedup = 1.18 ± 0.0002

Conclusion: O3 is good

Todd:speedup = 0.84 ± 0.0002

Conclusion: O3 is bad

System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench

Why does this happen?

Evaluating compiler optimizations

Page 21: Benchmarking and Performance Evaluations

Madan:HOME=/home/madan

Todd:HOME=/home/toddmytkowicz

env

stack

text text

env

stack

Differences in our experimental setup

Page 22: Benchmarking and Performance Evaluations

Runtime of SPEC CPU 2006 perlbench depends on who runs it!

Page 23: Benchmarking and Performance Evaluations

32 randomly generated linking orders

Bias from linking ordersp

eedu

p

Page 24: Benchmarking and Performance Evaluations

32 randomly generated linking orders

Order of .o files can lead to contradictory conclusions

Bias from linking ordersp

eedu

p

Page 25: Benchmarking and Performance Evaluations

Where exactly does Bias come from?

Page 26: Benchmarking and Performance Evaluations

Interactions with hardware buffers

O2

Page N Page N + 1

Page 27: Benchmarking and Performance Evaluations

Interactions with hardware buffers

O2

Page N Page N + 1

Dead Code

Page 28: Benchmarking and Performance Evaluations

Interactions with hardware buffers

O2

Page N Page N + 1

Code affected by O3

Page 29: Benchmarking and Performance Evaluations

Interactions with hardware buffers

O2

Page N Page N + 1

Hot code

Page 30: Benchmarking and Performance Evaluations

Page N Page N + 1

Interactions with hardware buffers

O2

O3

O3 better than O2

Page 31: Benchmarking and Performance Evaluations

Page N Page N + 1

Interactions with hardware buffers

O2

O3

O2

O3

O3 better than O2

O2 better than O3

Page 32: Benchmarking and Performance Evaluations

Cachline N Cacheline N + 1

Interactions with hardware buffers

O2

O3

O2

O3

O3 better than O2

O2 better than O3

Page 33: Benchmarking and Performance Evaluations

Other Sources of Bias

• JIT • Garbage Collection• CPU Affinity• Domain specific (e.g. size of input data)

• How do we manage these?

Page 34: Benchmarking and Performance Evaluations

Other Sources of Bias

How do we manage these?– JIT:

• ngen to remove impact of JIT• “warmup” phase to JIT code before measurement

– Garbage Collection• Try different heap sizes (JVM)• “warmup” phase to build data structures• Ensure program is not “leaking” memory

– CPU Affinity• Try to bind threads to CPUs (SetProcessAffinityMask)

– Domain Specific:• Up to you!

Page 35: Benchmarking and Performance Evaluations

R for the T-Test

• Where to download– http://cran.r-project.org

• Simple intro to get data into R• Simple intro to do t.test

Page 36: Benchmarking and Performance Evaluations
Page 37: Benchmarking and Performance Evaluations
Page 38: Benchmarking and Performance Evaluations
Page 39: Benchmarking and Performance Evaluations
Page 40: Benchmarking and Performance Evaluations

Some Conclusions

• Performance Evaluations are hard!– Variability and Bias are not easy to deal with

• Other experimental sciences go to great effort to work around variability and bias– We should too!