optimizely workshop: take action on results with statistics

67
Take Action on Results With Statisitcs An Optimizely Online Workshop Statistician: Leonid Pekelis

Upload: optimizely

Post on 14-Apr-2017

2.706 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Optimizely Workshop: Take Action on Results with Statistics

Take Action on Results With Statisitcs

An Optimizely Online Workshop

Statistician: Leonid Pekelis

Page 2: Optimizely Workshop: Take Action on Results with Statistics

Optimizely’s Stats Engine is designed to work with you, not

against you, to provide results which are reliable and

accurate, without requiring statistical training.

At the same time, by knowing some statistics of your own, you

can tune Stats Engine to get the most performance for your

unique needs.

Page 3: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

2. What are the three tradeoffs in an A/B Test? And how are they related?

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

After this workshop, you should be able to answer…

Page 4: Optimizely Workshop: Take Action on Results with Statistics

How to choose the number

of goals and variations for your experiment.

We will also preview

Page 5: Optimizely Workshop: Take Action on Results with Statistics

First, some vocabulary (yay!)

Page 6: Optimizely Workshop: Take Action on Results with Statistics

• A) The original, or baseline version of content that you are testing through a variation.

• B) Metric used to measure impact of control and variation

• C) The control group’s expected conversion rate.

• D) The relative percentage difference of your variation from baseline.

• E) The number of visitors in your test.

Which is the

Improvement?

Page 7: Optimizely Workshop: Take Action on Results with Statistics

• A) Control and Variation The original, or baseline version of content that you are testing through a variation.

• B) Goal Metric used to measure impact of control and variation

• C) Baseline conversion rate The control group’s expected conversion rate.

• D) Improvement The relative percentage difference of your variation from baseline.

• E) Sample size The number of visitors in your test.

Page 8: Optimizely Workshop: Take Action on Results with Statistics

Stats Engine corrects the pitfalls of A/B Testing with

classical statistics.

Page 9: Optimizely Workshop: Take Action on Results with Statistics

A procedure for classical statistics (a.k.a. “T-test”, a.k.a. “Traditional Frequentist”, a.k.a “Fixed Horizon Testing” )

Farmer Fred

wants to compare the effect of two fertilizers on crop yield.

1. Chooses how many plots to use (sample size).

2. Waits for a crop cycle, collects data once at the end.

3. Asks “What are the chances I’d have gotten these results if there was no difference between the fertilizers?” (a.k.a. p-value) If p-value < 5%, his results are significant.

4. Goes on, maybe to test irrigation methods.

Page 10: Optimizely Workshop: Take Action on Results with Statistics

1915

Data is expensive.

Data is slow.

Practitioners are trained.

2015

Data is cheap.

Data is real-time.

Practitioners are everyone.

Classical statistics were designed for an offline world.

Page 11: Optimizely Workshop: Take Action on Results with Statistics

The modern A/B Testing procedure is different

1. Start without good estimate of sample size.

2. Check results early and often. Estimate ROI as quickly as possible.

3. Ask “How likely did my testing procedure give a wrong answer?”

4. Many variations on multiple goals, not just 1.

5. Iterate. Iterate. Iterate.

Page 12: Optimizely Workshop: Take Action on Results with Statistics

Pitfall 1. Peeking

Page 13: Optimizely Workshop: Take Action on Results with Statistics

p-Value < 5%. Significant!

p-Value > 5%. Inconclusive.

p-Value > 5%. Inconclusive.

Min Sample Size

Peeking

Time

Experiment Starts p-Value > 5%. Inconclusive.

Page 14: Optimizely Workshop: Take Action on Results with Statistics

Why is this a problem?

There is a ~5% chance of false positive each time you peek.

Page 15: Optimizely Workshop: Take Action on Results with Statistics

p-Value < 5%. Significant!

p-Value > 5%. Inconclusive.

p-Value > 5%. Inconclusive.

Min Sample Size

Peeking

Time

Experiment Starts p-Value > 5%. Inconclusive.

4 peeks —> ~18% chance of seeing a false positive

Page 16: Optimizely Workshop: Take Action on Results with Statistics

Pitfall 2. Mistaking “False Positive Rate” for “Chance of a wrong conclusion”

Page 17: Optimizely Workshop: Take Action on Results with Statistics

Say I run an experiment.

Page 18: Optimizely Workshop: Take Action on Results with Statistics

1 original page, 5 variations, 6 goals = 30 “A/B Tests”

Page 19: Optimizely Workshop: Take Action on Results with Statistics

After I reach my minimum sample size,

I stop the experiment and see 2 of my variations beating control

and 1 variation losing to control

Page 20: Optimizely Workshop: Take Action on Results with Statistics

Winner

Winner

Loser

Classical statistics guarantee <= 5% false positives.

What % of my 2 winners and 1 loser do I expect to be false

positives?

Page 21: Optimizely Workshop: Take Action on Results with Statistics

Winner

Winner

Loser

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

2 winners, 1 loser, and 27 inconclusives

Page 22: Optimizely Workshop: Take Action on Results with Statistics

Winner

Winner

Loser

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

Inconclusive

30 A/B Tests x 5% = 1.5 false positives!

Page 23: Optimizely Workshop: Take Action on Results with Statistics

Winner

Winner

Loser

Classical statistics guarantee <= 5% false positives.

What % of my winners & losers do I expect to be false positives?

Answer: C) With 30 A/B Tests, we can expect a

= 50% chance of a wrong conclusion!

In general, we can’t say without knowing how many other goals &

variations were tested.

1.5 3

Page 24: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

2. What are the three tradeoffs in an A/B Test? And how are they related?

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

After this workshop you should be able to answer …

Page 25: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a wrong conclusion.”

After this webinar, you should be able to answer …

Page 26: Optimizely Workshop: Take Action on Results with Statistics

The tradeoffs of A/B Testing

Page 27: Optimizely Workshop: Take Action on Results with Statistics

Error rates Runtime

Improvement & Baseline CR

Page 28: Optimizely Workshop: Take Action on Results with Statistics

Error rates Runtime

Improvement & Baseline CR

“Chance of a wrong conclusion”

Page 29: Optimizely Workshop: Take Action on Results with Statistics

Error rates Runtime

Improvement & Baseline CR

“Chance of a wrong conclusion calling a non-

winner a winner, or a non-loser a loser.”

Page 30: Optimizely Workshop: Take Action on Results with Statistics

Error rates Runtime

Improvement & Baseline CR

Page 31: Optimizely Workshop: Take Action on Results with Statistics

Where is the error rate on Optimizely’s results page?

I. II. III. IV.

Statistical Significance

=

“Chance of a right conclusion”

= (a.k.a.)

100 x (1 - False Discovery Rate)

Page 32: Optimizely Workshop: Take Action on Results with Statistics

How can you control the error rate?

Page 33: Optimizely Workshop: Take Action on Results with Statistics

Error rates Runtime

Improvement & Baseline CR

Page 34: Optimizely Workshop: Take Action on Results with Statistics

Where is runtime on Optimizely’s results page?

Page 35: Optimizely Workshop: Take Action on Results with Statistics

Error rates Runtime

Were you expecting a funny picture?

Improvement & Baseline CR

Page 36: Optimizely Workshop: Take Action on Results with Statistics

Where is effect size on Optimizely’s results page?

Page 37: Optimizely Workshop: Take Action on Results with Statistics

Improvement & Baseline CR

These three quantities are all …

Error rates Runtime

Inversely Related

Page 38: Optimizely Workshop: Take Action on Results with Statistics

At any number of visitors,

the higher error rate I allow,

the smaller improvement you can detect.

Error rates Runtime

Inversely Related

Improvement & Baseline CR

Page 39: Optimizely Workshop: Take Action on Results with Statistics

Error rates Runtime

Inversely Related

At any error rate threshold,

stopping your test earlier means

you can only detect larger improvements.

Improvement & Baseline CR

Page 40: Optimizely Workshop: Take Action on Results with Statistics

For any improvement,

the lower error rate you want,

the longer you need to run your test.

Error rates Runtime

Inversely Related

Improvement & Baseline CR

Page 41: Optimizely Workshop: Take Action on Results with Statistics

What does this look like in practice?

Average Visitors needed to reach significance with Stats Engine

Improvement (relative)

5% 10% 25%

Significance Threshold

(Error Rate)

95 (5%) 62 K 14 K 1,800

90 (10%) 59 K 12 K 1,700

80 (20%) 53 K 11 K 1,500

Baseline conversion rate = 10%

Page 42: Optimizely Workshop: Take Action on Results with Statistics

~ 1 K visitors per day

Average Visitors needed to reach significance with Stats Engine

Improvement (relative)

5% 10% 25%

Significance Threshold

(Error Rate)

95 (5%) 62 K 14 K 1,800

90 (10%) 59 K 12 K 1,700

80 (20%) 53 K 11 K 1,500 (1 day)

Baseline conversion rate = 10%

Page 43: Optimizely Workshop: Take Action on Results with Statistics

~ 10K visitors per day

Average Visitors needed to reach significance with Stats Engine

Improvement (relative)

5% 10% 25%

Significance Threshold

(Error Rate)

95 (5%) 62 K 14 K 1,800

90 (10%) 59 K 12 K 1,700

80 (20%) 53 K 11 K (1 day) 1,500

Baseline conversion rate = 10%

Page 44: Optimizely Workshop: Take Action on Results with Statistics

~ 50K visitors per day

Average Visitors needed to reach significance with Stats Engine

Improvement (relative)

3% 5% 10%

Significance Threshold

(Error Rate)

95 (5%) 190 K 62 K 14 K

90 (10%) 180 K 59 K 12 K

80 (20%) 160 K 53 K (1 day) 11 K

Baseline conversion rate = 10%

Page 45: Optimizely Workshop: Take Action on Results with Statistics

> 100K visitors per day

Average Visitors needed to reach significance with Stats Engine

Improvement (relative)

3% 5% 10%

Significance Threshold

(Error Rate)

95 (5%) 190 K 62 K 14 K

90 (10%) 180 K 59 K 12 K

80 (20%) 160 K (1 day) 53 K 11 K

Baseline conversion rate = 10%

Page 46: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

2. What are the three tradeoffs in an A/B Test? And how are they related?

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

After this workshop, you should be able to answer …

Page 47: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

2. What are the three tradeoffs in an A/B Test? And how are they related?

A. Error Rates, Runtime, and Effect Size. They are all inversely related.

After this workshop, you should be able to answer …

Page 48: Optimizely Workshop: Take Action on Results with Statistics

Use tradeoffs to align your testing goals

Page 49: Optimizely Workshop: Take Action on Results with Statistics

5%

Error rates Runtime

Improvement & Baseline CR

Inversely Related

+5%, 10%

53 K

?In the beginning, we make an educated guess …

Page 50: Optimizely Workshop: Take Action on Results with Statistics

… but after 1 day …

Data!How can we update the tradeoffs?

Page 51: Optimizely Workshop: Take Action on Results with Statistics

1. Adjust your timeline

Page 52: Optimizely Workshop: Take Action on Results with Statistics

Improvement turns out to be better …

Instead of: 53K - 10K

= 43K

5% 1,600

Error rates Runtime

+13%, 10%

Inversely Related

Improvement & Baseline CR

Page 53: Optimizely Workshop: Take Action on Results with Statistics

… or worse.

5% 75 K

Error rates Runtime

+2%,8%

Inversely Related

Improvement & Baseline CR

Page 54: Optimizely Workshop: Take Action on Results with Statistics

2. Accept higher / lower error rate

Page 55: Optimizely Workshop: Take Action on Results with Statistics

Improvement turns out to be better …

1% 43 K

Error rates Runtime

+13%, 10%

Inversely Related

Improvement & Baseline CR

Page 56: Optimizely Workshop: Take Action on Results with Statistics

… or worse.

30% 43 K

Error rates Runtime

+2%,8%

Inversely Related

Improvement & Baseline CR

Page 57: Optimizely Workshop: Take Action on Results with Statistics

3. Admit it. It’s inconclusive.

Page 58: Optimizely Workshop: Take Action on Results with Statistics

… or a lot worse.

> 99% > 100K

Error rates Runtime

+.2%,8%

Inversely Related

Improvement & Baseline CR

iterate, iterate, iterate!

Page 59: Optimizely Workshop: Take Action on Results with Statistics

Your experiments will not always have the same improvement over time.

So, run A/B Tests for at least a business cycle appropriate for that test and your company.

Seasonality & Time Variation

Page 60: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

2. What are the three tradeoffs in an A/B Test? And how are they related?

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

After this workshop, you should be able to answer …

Page 61: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with for Stats Engine?

2. What are the three tradeoffs in one A/B Test?

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

A. Adjust your timeline. Accept higher / lower error rate. Admit an inconclusive result.

After this workshop, you should be able to answer …

Page 62: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.”

2. What are the three tradeoffs in one A/B Test?

B. Error Rates, Runtime, and Effect Size. They are all negatively related.

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.”

2. What are the three tradeoffs in one A/B Test?

B. Error Rates, Runtime, and Effect Size. They are all negatively related.

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.”

2. What are the three tradeoffs in one A/B Test?

B. Error Rates, Runtime, and Effect Size. They are all negatively related.

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.”

2. What are the three tradeoffs in one A/B Test?

B. Error Rates, Runtime, and Effect Size. They are all negatively related.

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.

Review

Page 63: Optimizely Workshop: Take Action on Results with Statistics

Preview: How many goals and variations should I use?

Page 64: Optimizely Workshop: Take Action on Results with Statistics

Stats Engine is more conservative when there are more goals that are not affected by

a variation.

So, adding a lot of “random” goals will slow down your experiment.

Page 65: Optimizely Workshop: Take Action on Results with Statistics

Tips & Tricks for using Stats Engine with multiple goals and variations

• Ask: Which goal is most important to me?

-This should be the primary goal (not impacted by all other goals)

• Run large, or large multivariate tests without fear of finding spurious results, but be prepared for the cost of exploration.

• For maximum velocity, only test goals and variations that you believe will have highest impact.

Page 66: Optimizely Workshop: Take Action on Results with Statistics
Page 67: Optimizely Workshop: Take Action on Results with Statistics

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.”

2. What are the three tradeoffs in one A/B Test?

B. Error Rates, Runtime, and Effect Size. They are all negatively related.

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.”

2. What are the three tradeoffs in one A/B Test?

B. Error Rates, Runtime, and Effect Size. They are all negatively related.

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.”

2. What are the three tradeoffs in one A/B Test?

B. Error Rates, Runtime, and Effect Size. They are all negatively related.

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.

1. Which two A/B Testing pitfalls inflate error rates when using classical statistics, and are avoided with Stats Engine?

A. Peeking and mistaking “False Positive Rate” for “Chance of a Wrong Answer.”

2. What are the three tradeoffs in one A/B Test?

B. Error Rates, Runtime, and Effect Size. They are all negatively related.

3. How can you use Optimizely’s results page to best tune the tradeoffs to achieve your experimentation goals?

C. Accept higher / lower error rate. Adjust your timeline. Admit an inconclusive result.

Review