optimizely stats engine: an overview and practical tips for running experiments

93
Optimizely Stats Engine Leo Pekelis Darwish Gani Robin Pam

Upload: optimizely

Post on 14-Jul-2015

1.760 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Optimizely Stats Engine: An overview and practical tips for running experiments

Optimizely Stats EngineLeo Pekelis

Darwish Gani Robin Pam

Page 2: Optimizely Stats Engine: An overview and practical tips for running experiments

Housekeeping notes

• Chat box is available for questions • There will be time for Q&A at the end • We will be recording the webinar for future viewing • All attendees will receive a copy of slides after the webinar

Page 3: Optimizely Stats Engine: An overview and practical tips for running experiments

Your speakers

Darwish Gani Product manager

Robin Pam Product marketing

Leo Pekelis Statistician

Page 4: Optimizely Stats Engine: An overview and practical tips for running experiments

Objectives

Understand why Optimizely built Stats Engine

Introduce the methods Stats Engine uses to calculate results

Get practical recommendations for how to test with Stats Engine

Page 5: Optimizely Stats Engine: An overview and practical tips for running experiments

Why make a new Stats Engine?

Page 6: Optimizely Stats Engine: An overview and practical tips for running experiments

Meet Joe: A farmer who uses a traditional t-test

Page 7: Optimizely Stats Engine: An overview and practical tips for running experiments

Joe wants to try a new fertilizer this year

Page 8: Optimizely Stats Engine: An overview and practical tips for running experiments

With his original fertilizer, 10% of plants survive the winter

He thinks that this new fertilizer might helpmore survive.

Joe has a hypothesis

Page 9: Optimizely Stats Engine: An overview and practical tips for running experiments

Joe calculates a sample size for his experiment in advance, given how much better he thinks the new fertilizer might be

Page 10: Optimizely Stats Engine: An overview and practical tips for running experiments

He waits through the winter…

Page 11: Optimizely Stats Engine: An overview and practical tips for running experiments

10% of plants survive the winter

15% of plants survive

96% statistical significance!

…and he is rewarded for his patience

Page 12: Optimizely Stats Engine: An overview and practical tips for running experiments

Does that work today?

Page 13: Optimizely Stats Engine: An overview and practical tips for running experiments

Meet Kyle: Head of Optimization at Optimizely

Page 14: Optimizely Stats Engine: An overview and practical tips for running experiments

Kyle doesn’t know what improvement to expect

Page 15: Optimizely Stats Engine: An overview and practical tips for running experiments

Kyle also gets data from Optimizely all the time

Page 16: Optimizely Stats Engine: An overview and practical tips for running experiments

Kyle wants to test many goals and variations at once, instead of just one hypothesis

vs.

Page 17: Optimizely Stats Engine: An overview and practical tips for running experiments

Actually that’s a lot of work. It’s cumbersome and error-prone.

Page 18: Optimizely Stats Engine: An overview and practical tips for running experiments

What’s your chance of making incorrect decision?

Page 19: Optimizely Stats Engine: An overview and practical tips for running experiments

30%

Page 20: Optimizely Stats Engine: An overview and practical tips for running experiments

Objectives

Understand why Optimizely built Stats Engine

Introduce the methods Stats Engine uses to calculate results

Get practical recommendations for how to test with Stats Engine

Page 21: Optimizely Stats Engine: An overview and practical tips for running experiments

Introducing Stats Engine

Page 22: Optimizely Stats Engine: An overview and practical tips for running experiments

How we did it

• Partnered with Stanford statisticians

• Talked with customers • Examined historical experiments

• Found the best methods for real-time data

Page 23: Optimizely Stats Engine: An overview and practical tips for running experiments

What does Stats Engine do?

Provides a principled and mathematical way to calculate your chance of making an incorrect decision.

Page 24: Optimizely Stats Engine: An overview and practical tips for running experiments

Sequential Testing False Discovery Rate• First used in 1940s for military weapons testing

• Sample size is not fixed in advance

• Data is evaluated as it’s collected

• First used in 1990s in genetics

• Correct error rates for multiple goals and variations

• Expected number of false discoveries

Page 25: Optimizely Stats Engine: An overview and practical tips for running experiments

Sequential Testing

False Discovery Rate control

+

=

Statistical Significance for Digital Experimentation

Page 26: Optimizely Stats Engine: An overview and practical tips for running experiments

Statistical Significance for Digital ExperimentationContinuously Evaluate Test Results

Run many goals and variations

Don’t worry about estimating a MDE upfront

Page 27: Optimizely Stats Engine: An overview and practical tips for running experiments

Sequential Testing Finding the right stopping rule

Page 28: Optimizely Stats Engine: An overview and practical tips for running experiments

Variation #1

Variation #2

Declare a winner?

500

Visitors

50%

65%

Is this lift big enough for the visitors I saw?

?

Page 29: Optimizely Stats Engine: An overview and practical tips for running experiments

Desired Stopping Rule: I will be “wrong” only 5% of the time.

Page 30: Optimizely Stats Engine: An overview and practical tips for running experiments

Variation #1

Variation #2

1000

Visitors

55%

59%

Traditional Error Rates 5%

Find a stopping rule, so I declare a winner incorrectly <5% of the

time

Page 31: Optimizely Stats Engine: An overview and practical tips for running experiments

Variation #1

Variation #2

Traditional Error Rates

Visitors

500

50%

65%

5%

1000

55%

59%

5%

5000

52%

57%

5%

10000

54%

59%

5%

Page 32: Optimizely Stats Engine: An overview and practical tips for running experiments

Variation #1

Variation #2

Traditional Error Rates

500 1000 5000 10000

Visitors

50% 55% 52% 54%

65% 59% 57% 59%

5% 5% 5% 5%

Look only once: 5% Error rate

Page 33: Optimizely Stats Engine: An overview and practical tips for running experiments

Variation #1

Variation #2

Traditional Error Rates

500 1000 5000 10000

Visitors

50% 55% 52% 54%

65% 59% 57% 59%

5% 5% 5% 5%

Look only once: 5% Error rate

Look more than once: >5% Error rate

Page 34: Optimizely Stats Engine: An overview and practical tips for running experiments

Variation #1

Variation #2

Traditional Error Rates

500 1000 5000 10000

Visitors

50% 55% 52% 54%

65% 59% 57% 59%

5% 5% 5% 5%

<5% Error rate for the entire test!

Sequential Testing Error Rate 1% .5% 1.5% 1.5% < 5%

Page 35: Optimizely Stats Engine: An overview and practical tips for running experiments

“P-Hacking” “Continuous Monitoring”

“Repeated Significance Testing”

These are not new problems!

Page 36: Optimizely Stats Engine: An overview and practical tips for running experiments

Steven Goodman, Stanford Physician & Statistician, nature.com

“The P value was never meant to be used the way it's used today.”

Page 37: Optimizely Stats Engine: An overview and practical tips for running experiments

Source: Evan Miller, How not to Run an A/B Test

Page 38: Optimizely Stats Engine: An overview and practical tips for running experiments
Page 39: Optimizely Stats Engine: An overview and practical tips for running experiments

Sample Size + Power Calculations Focus on creating and running tests

Page 40: Optimizely Stats Engine: An overview and practical tips for running experiments

Sequential Testing

• Continuously Evaluate Test Results

• Don’t worry about estimating a MDE upfront

Framework of hypothesis testing that was created to allow the experimenter to evaluate test results as they come in

Page 41: Optimizely Stats Engine: An overview and practical tips for running experiments

False discovery rate control Error rates for a world with many goals and

variations

Page 42: Optimizely Stats Engine: An overview and practical tips for running experiments

Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5

Control

Variation 1

Variation 2

Page 43: Optimizely Stats Engine: An overview and practical tips for running experiments

Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5

Control

Variation 1

Variation 2

Significance Level 90 (False Positive Rate 10%)

Page 44: Optimizely Stats Engine: An overview and practical tips for running experiments

Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5

Control

Variation 1

Variation 2

Significance Level 90 (False Positive Rate 10%)

1 False Positive!

Page 45: Optimizely Stats Engine: An overview and practical tips for running experiments

Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5

Control

Variation 1

Variation 2

1 False Positive!

Significance Level 90 (False Positive Rate 10%)

1 other variation x goal has a large improvement.

Page 46: Optimizely Stats Engine: An overview and practical tips for running experiments

Variations Goal 1 Goal 2 Goal 3 Goal 4 Goal 5

Control

Variation 1

Variation 2

1 False Positive!

Significance Level 90 (False Positive Rate 10%)

1 other variation x goal has a large improvement.

1 True Positive!

Page 47: Optimizely Stats Engine: An overview and practical tips for running experiments

My Report

• Variation 2 is improving on Goal 1

• Variation 1 is improving on Goal 4

“10% of what I

report could be wrong.”X50%

• Variation 2 is improving on Goal 1• Variation 1 is improving on Goal 4

Furthermore, I found the following results.This leads me to conclude that …

Page 48: Optimizely Stats Engine: An overview and practical tips for running experiments

Statistical Significance:

The chance that any variation on any goal

that is reported as a winner or loser

is correct.

Page 49: Optimizely Stats Engine: An overview and practical tips for running experiments

• “New York Times has a feature in its Tuesday science section, Take a Number … Today’s column is in error … This is the old, old error of confusing p(A|B) with p(B|A).”

• Andrew Gelman, Misunderstanding the p-value

• “If I were to randomly select a drug out of the lot of 100, run it through my tests, and discover a p<0.05 statistically significant benefit, there is only a 62% chance that the drug is actually effective.”

• Alex Reinhart, The p value and the base rate fallacy

• “In this article I’ll show that badly performed A/B tests can produce winning results which are more likely to be false than true. At best, this leads to the needless modification of websites; at worst, to modification which damages profits.”

• Martin Goodson, Most Winning A/B Test Results are Illusory

• “An unguarded use of single-inference procedures results in a greatly increased false positive (significance) rate”

• Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289-300. APA

Page 50: Optimizely Stats Engine: An overview and practical tips for running experiments

• “New York Times has a feature in its Tuesday science section, Take a Number … Today’s column is in error … This is the old, old error of confusing p(A|B) with p(B|A).”

• Andrew Gelman, Misunderstanding the p-value

• “If I were to randomly select a drug out of the lot of 100, run it through my tests, and discover a p<0.05 statistically significant benefit, there is only a 62% chance that the drug is actually effective.”

• Alex Reinhart, The p value and the base rate fallacy

• “In this article I’ll show that badly performed A/B tests can produce winning results which are more likely to be false than true. At best, this leads to the needless modification of websites; at worst, to modification which damages profits.”

• Martin Goodson, Most Winning A/B Test Results are Illusory

• “An unguarded use of single-inference procedures results in a greatly increased false positive (significance) rate”

• Benjamini, Yoav, and Yosef Hochberg. "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society. Series B (Methodological) (1995): 289-300. APA

Page 51: Optimizely Stats Engine: An overview and practical tips for running experiments

False Discovery Rate controlFramework for controlling errors that arise from running multiple

experiments at once.

Run many goals & variations

Page 52: Optimizely Stats Engine: An overview and practical tips for running experiments

What Stats Engine means for you

• You see fewer, but more accurate conclusive results.

• You can implement winners as soon as significance is reached.

• You get • easy experiment workflow. • reduced unforeseen, and hidden errors.

Page 53: Optimizely Stats Engine: An overview and practical tips for running experiments

Objectives

Understand why Optimizely built Stats Engine

Introduce the methods Stats Engine uses to calculate results

Get practical recommendations for how to test with Stats Engine

Page 54: Optimizely Stats Engine: An overview and practical tips for running experiments

And now, for some practical guidance

Page 55: Optimizely Stats Engine: An overview and practical tips for running experiments

First, some vocabulary• Baseline conversion rate

The control group’s expected conversion rate.

• Minimum detectable effect The smallest conversion rate difference it is possible to detect in an A/B Test.

• Statistical significance The likelihood that the observed difference in conversion rates is not due to chance.

• Minimum sample size The smallest number of visitors required to reliably detect a given conversion rate difference

Page 56: Optimizely Stats Engine: An overview and practical tips for running experiments

Sample size calculators and statistical power

Page 57: Optimizely Stats Engine: An overview and practical tips for running experiments

How many visitors do you need to see significant results?

Visitors needed to reach significance with Stats Engine

Improvement

5% 10% 25%

Baseline conversion rate

1% 458,900 101,600 13,000

5% 69,500 15,000 1,800

10% 29,200 6,200 700

25% 8,100 1,700 200

Lower conversion rate, lower effects = more visitors

Page 58: Optimizely Stats Engine: An overview and practical tips for running experiments

One example of calculating your opportunity cost

12% minimum detectable effect

Page 59: Optimizely Stats Engine: An overview and practical tips for running experiments

Original

Variation

Page 60: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Page 61: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

Page 62: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

Page 63: Optimizely Stats Engine: An overview and practical tips for running experiments
Page 64: Optimizely Stats Engine: An overview and practical tips for running experiments

Difference intervals

Page 65: Optimizely Stats Engine: An overview and practical tips for running experiments

Difference intervals

15.4

11.6 Middle Ground

Best Case

Worst case

7.3

Page 66: Optimizely Stats Engine: An overview and practical tips for running experiments

Seasonality• We DO take into account seasonality while a test is running.

• We DO NOT take into account future seasonality after an experiment is stopped.

Page 67: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

• Use Difference Intervals to understand the types of lifts you could see.

Page 68: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

Page 69: Optimizely Stats Engine: An overview and practical tips for running experiments

A reasonable number of visitors left, relative to your traffic?

Page 70: Optimizely Stats Engine: An overview and practical tips for running experiments

Visitors Remaining Explained

Improvement Increases in Magnitude

Improvement Stays the Same

Improvement Decreases in Magnitude

Page 71: Optimizely Stats Engine: An overview and practical tips for running experiments

A good idea to wait

5761 + 3200 Note!

Page 72: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

• Use Visitors Remaining to evaluate if waiting makes sense.

Page 73: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

Page 74: Optimizely Stats Engine: An overview and practical tips for running experiments
Page 75: Optimizely Stats Engine: An overview and practical tips for running experiments

If you’re an organization that can

• Iterate quickly on new variations • Run lots of experiments • Have little downside risk of implementing non-winning variations

then you can likely tolerate a higher error rate.

Page 76: Optimizely Stats Engine: An overview and practical tips for running experiments

Difference intervals can guide your decision

Page 77: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes • Use Difference Intervals to measure risk you take on

Page 78: Optimizely Stats Engine: An overview and practical tips for running experiments

> 100,000 visitors? What next?

Page 79: Optimizely Stats Engine: An overview and practical tips for running experiments

Time to move on to the next idea

Page 80: Optimizely Stats Engine: An overview and practical tips for running experiments

Should you stop or continue a test?

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

• Use Visitors Remaining to evaluate if waiting makes sense.

Page 81: Optimizely Stats Engine: An overview and practical tips for running experiments

Recap

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

• Use Difference Intervals to understand the types of lifts you could see.

Page 82: Optimizely Stats Engine: An overview and practical tips for running experiments

Recap

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

• Use Difference Intervals to understand the types of lifts you could see.

Can I afford to wait?

Continue

Stop

Concede inconclusive

Yes

NoNo

• Use Visitors Remaining to evaluate if waiting makes sense.

Page 83: Optimizely Stats Engine: An overview and practical tips for running experiments

Recap

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

• Use Difference Intervals to understand the types of lifts you could see.

Can I afford to wait?

Continue

Stop

Concede inconclusive

Yes

NoNo

• Use Visitors Remaining to evaluate if waiting makes sense.

Stop

Accept lowersignificance

Concede inconclusive

Yes

No

Can I afford to wait?

• Use Difference Intervals to measure risk you take on

Page 84: Optimizely Stats Engine: An overview and practical tips for running experiments

Recap

Is my test significant?

Congrats

Can I afford to wait?

Continue

Stop

Accept lowersignificance

Concede inconclusive

Yes

NoNo

Yes

• Use Difference Intervals to understand the types of lifts you could see.

Can I afford to wait?

Continue

Stop

Concede inconclusive

Yes

NoNo

• Use Visitors Remaining to evaluate if waiting makes sense.

Stop

Accept lowersignificance

Concede inconclusive

Yes

No

Can I afford to wait?

• Use Difference Intervals to measure risk you take on

• Use Visitors Remaining to evaluate if waiting makes sense.

Page 85: Optimizely Stats Engine: An overview and practical tips for running experiments

Tuning your testing strategy for your traffic and business

1%6% 4% 3% inconclusive inconclusive inconclusiveinconclusive 18%

Looking for small effects?

• Tests take longer to reach significance

• Might find more winners, if you are willing to wait long enough

inconclusive

Testing for larger effects?

• Run more tests, faster

• Know when it’s time to move on to the next idea

Page 86: Optimizely Stats Engine: An overview and practical tips for running experiments

Frequently asked questions

Page 87: Optimizely Stats Engine: An overview and practical tips for running experiments

Do I need to re-run my historical tests?

Page 88: Optimizely Stats Engine: An overview and practical tips for running experiments

Is this a one-tailed or two-tailed test? Why did you switch?

Page 89: Optimizely Stats Engine: An overview and practical tips for running experiments

Why does Stats Engine report 0% Statistical Significance when other tools

report higher values?

Page 90: Optimizely Stats Engine: An overview and practical tips for running experiments

Why does Statistical Significance increase step-wise?

Page 91: Optimizely Stats Engine: An overview and practical tips for running experiments

If I take the results I see in Optimizely and plug them into any other statistics

calculator, the statistical significance is different. Why?

Page 92: Optimizely Stats Engine: An overview and practical tips for running experiments

How does Stats Engine handle revenue calculations?

Page 93: Optimizely Stats Engine: An overview and practical tips for running experiments

Questions?