planning sample size for randomized evaluations ed... · sample size calculations for randomized...

TRANSLATING RESEARCH INTO ACTION

Sample size calculations for randomized evaluations

Rebecca Thornton Assistant Professor of Economics

University of Michigan

povertyactionlab.org

1. Background: The basics

2. Getting more complicated: Clusters

3. How to do this in practice

Outline

• Interviews are expensive and you have a budget

• You do not want to be disappointed that you didn’t have a large enough sample

• If you understand the basics of sample size, there are lots of things you can do to increase your power

• You are spending a lot of money and time on this evaluation

Why care?

• General question: how large does the sample need to be to credibly detect a given effect size? (ie. a certain effect of a program)

• What does “credibly” mean here?

It means we can be reasonably sure that the difference between the control and treatment group is due to the treatment and not just to chance

4

Today’s Question

• Two important issues about sample size

1. Larger sample helps to ensure that the treatment and control groups are balanced (on observables and unobservables)

Helps prevent a biased estimate

2. Can detect a significant difference in outcomes between the treatment and control groups

Helps to detect a significant estimate

Sample Size

To start (and to finish)

• Doing sample size calculations is a craft

• The values estimated depend on parameters

whose values are unknown and will vary.

– Power calculations involve some guess work.

– Vary across outcomes!

Basic set up

• At the end of an experiment, we compare the

average outcome of interest in the treatment with

the average outcome of interest in the control

• We are interested in the difference:

Mean (treatment) - Mean (control) = Effect (size)

• Example: Want to know the effect of giving out text

books on test scores. You have the scores of

treatment students (with books) and control students

(without books)

Simple Example

60

65

70

75

80

85

90

No Books Books

Test Scores

• Subtract the average of the Control from the

average of the Treatment

• Run a regression of the outcome (Y) on an

indicator of being in the Treatment group:

Y= a + bT

Effect of the program

Simple Example

60

65

70

75

80

85

90

No Books Books

Test Scores

• Effect size: Difference in means or,

Y = a + bT b = Effect size (slope of the line)

Y=70+10*T

• Treatment Effect = 10 points – How confident am I that there is no treatment effect?

– * 10 percent chance that there is really no effect

– ** 5 percent chance

– *** 1 percent chance

Effect of the program

• Is the estimate of b biased?

– Discussed in previous lectures

– Depends on the validity of the randomization and mitigation of other threats

• How precise is the estimate of b?

– Did this difference happen just by chance? How confident am I that there is a true effect of my program?

– Depends on the sample size, the variability of the outcome variable (Y), and the actual effect of the program

• Accuracy vs. Precision

Back to the main questions…

Accuracy versus Precision

Accuracy P

reci

sio

n

Unbiased and sample size

Unbiased Sa

mp

le S

ize

• When we do survey research and estimate treatment effects… – Randomization helps us to be accurate (unbiased) – Sample size allows us to be precise (confident about

our estimates)

• Both are independently important – Increased sample size may be precise, but not

accurate. – Randomization without a large enough sample will

allow us to estimate the unbiased effect (accuracy), but we might not be that confident about it

Estimation

• Impact evaluation involves the scientific method – 1) propose a hypothesis

– 2) design the experiment to test that hypothesis

• How do we test hypotheses? – We start with an hypothesis (ie., there will be an

effect of the program)

– At the end of an experiment, we test our hypothesis

Scientific method

• In criminal law, most institutions follow the rule: “innocent until proven guilty”

• The presumption is that the accused is innocent and the burden is on the prosecutor to show guilt

– The jury or judge starts with the “null hypothesis” that the accused person is innocent

– The prosecutor has a hypothesis that the accused person is guilty

17

Hypothesis testing

• In program evaluation, instead of “presumption of innocence,” the rule is: “presumption of insignificance”

• The “Null hypothesis” (H0) is that there was no (zero) impact of the program

• The burden of proof is on the evaluator to show a significant effect of the program

Hypothesis testing

• If our measurements show a difference between the treatment and control group we know:

– There is some difference between the treatment and the control…

– But, our presumption is that there is no impact of the program (our H0 is still true)

– It might be that the difference is solely due to chance (random sampling error)

• We need to use statistics to calculate how likely this difference is in fact due to random chance or not

Hypothesis testing

• Lets say the sample size is = 2…

Extreme Example

Perhaps…

Less extreme: Is this difference due to random chance?

Control

Treatment

Probably not….

Is this difference due to random chance?

Control

Treatment

• Using statistics, if we find that it is very unlikely (say less than a 5% probability) that the difference is solely due to chance: – We “reject our null hypothesis” – We may now say: “our program has a statistically

significant impact”

• Are we now 100 percent certain there is an impact? – No, we may be only 95% confident; and we accept

that if we using this threshold, we may be wrong 5% of the time

Hypothesis testing: conclusions

• What if we can’t reject our null hypothesis

– Does that mean we can be 100% certain there is no impact?

– No, it just didn’t meet the statistical threshold to conclude otherwise

Hypothesis testing: conclusions

• Possibility #1: There is an impact

– Could detect it – have enough statistical power

– Could not detect it – do not have enough power

• Possibility #2: There is no impact

– Conclude there was no impact

– Conclude there was an impact

Two possibilities

YOU CONCLUDE

Effective No Effect

THE

TRUTH

Effective Type II Error

No Effect

Type I Error

Hypothesis testing

YOU CONCLUDE

Effective No Effect

THE

TRUTH

Effective Type II Error

No Effect

Type I Error

(probability =

sig level)

Hypothesis testing

Significance Level: Set to a level that you are comfortable with: With a

level of 5%, you can be 95% confident your conclusion of an effect. For policy purpose, you want to be very confident in the answer you give: the level will be set fairly low . Related to Type I error.

YOU CONCLUDE

Effective No Effect

THE

TRUTH

Effective (probability =

power)

Type II Error

No Effect

Type I Error

Hypothesis testing

Power: How frequently will we detect effective programs. Type II error results from low power.

1. Variance – The more “noisy” it is to start with, the harder it is

to measure effects

2. Effect Size to be detected – The smaller the effect size we want to detect, the

larger sample we need

3. Sample Size – The more children we sample, the more likely we

are to obtain the true difference

Power: main ingredients

Variance

Low Standard Deviation

0

5

10

15

20

25

va

lue

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

Number

Fre

qu

en

cy

mean 50

mean 60

Less Precision

Medium Standard Deviation

0

1

2

3

4

5

6

7

8

9

va

lue

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

Number

Fre

qu

en

cy

mean 50

mean 60

Even less precise

High Standard Deviation

0

1

2

3

4

5

6

7

8

value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89

Number

Fre

qu

en

cy

mean 50

mean 60

• Variance depends first on your outcome variable: which outcome you want to measure

• Must calculate separately for each outcome

• What can help increase power? Can “absorb” variance: – using a baseline

– controlling for other variables

– Do a pilot and measure the outcome variables, field testing

Variance


to measure effects






0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

-4 -3 -2 -1 0 1 2 3 4 5 6

control

treatment

1 Standard Deviation

Effect Size: 1 “standard deviation”

Effect Size: 3 standard deviations

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

-4 -3 -2 -1 0 1 2 3 4 5 6

control

treatment

The less overlap the better… (easier to detect a difference)

3 Standard Deviations

• What effect do you think that the program will have?

• What is the smallest effect that you would like to be able to detect with confidence?

Effect Size

DO NOT USE: “Expected” effect size

• First start with the question: how big of an effect do I think the program will have? – This is usually large… I like the program, why else

implement?

– But if we overestimate the effect size, we overestimate the power that we will have, and our sample size may be too small

• Be conservative – What is the smallest effect size that would justify

implementing the program?

39

“Choosing” an effect size

• Different effect sizes for different outcome variables

• Also depends on how variable the outcome is

• How to standardize effect sizes across outcomes? – Standardized effect size is the effect size divided

by the standard deviation of the outcome

= (Treatment – Control)/SD

• Common standardized effect sizes 40

“Choosing” an effect size

An effect size of…

Is considered… …and it means that…

0.2 Modest The average member of the treatment group had a better outcome than the 58th percentile of the control group

0.5 Large The average member of the treatment group had a better outcome than the 69th percentile of the control group

0.8 VERY Large The average member of the treatment group had a better outcome than the 79th percentile of the control group

Standardized effect size

Really? Common Danger: Picking an effect size that is too large! Calculate!


to measure effects






0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

test scores

control

treatment

control μ

treatment μ

Average difference: 6 points

We only observe a random sample of the students

Say that we have a sample of 1 observation, that comes from the distribution of data…

0.0%

0.2%

0.4%

0.6%

0.8%

1.0%

1.2%

1.4%

1.6%

1.8%

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

test scores

control

treatment

control μ

treatment μ

N=1

Sample size = 1

Sample size = 4

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

4.0%

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

test scores

control

treatment

control μ

treatment μ

N=4

Sample size = 9

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

test scores

control

treatment

control μ

treatment μ

N=9

Sample size = 100

0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

14.0%

16.0%

18.0%

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

test scores

control

treatment

control μ

treatment μ

N=100

Sample size = 6,000

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

35.0%

40.0%

45.0%

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

test scores

control

treatment

control μ

treatment μ

N=sqrt(6000)

• What is a good level of power?

• A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if the null hypothesis is in fact false (e.g. there is a treatment effect), we will be able to reject it. In other words, 80% of the time we will be able to measure an effect.

• 20% of the time I will be disappointed

• Common Power used: 80%, 90%

• But I don’t like to be disappointed 20% of the time

Power: What level do I want?




Outline

• Up to now, we have been assuming randomization at the individual level

• But often, we may want to randomize at a higher group level – Village

– School

– District

• In that case, groups are randomized and individuals within each treatment or control group all get the same treatment

Individual vs. Group design

• Minimize or remove contamination across individuals – Example: Deworming, information campaigns

• More feasible

• Only natural choice – Example: Any education intervention that affect

an entire classroom (e.g. flipcharts, teacher training).

• Why not? Expense (linked with power)

Reason for cluster randomization

• If the treatment is randomized at a group level you need more observations

• Why? The observations (ie. individuals) are not independent of each other – All villagers are exposed to the same weather – All districts share a common history – All students share a schoolmaster

• The more correlation between the outcomes within a group, the larger sample you need

• Value called r (rho) measures this

Impact of Group-level randomization

• Like percentages, r must be between 0 and 1

• Higher values mean that your clusters are more correlated (bad for power), lower r is more desirable

• It is sometimes low, 0, .05, .08, but can be high:0.62

Values of r (rho)

Madagascar Math + Language 0.5

Busia, Kenya Math + Language 0.22

Udaipur, India Math + Language 0.23

Mumbai, India Math + Language 0.29

Vadodara, India Math + Language 0.28

Busia, Kenya Math 0.62

• Where do I find *my* rho?

– Use data

– Ask other researchers

– Be conservative and use a high value

Values of r (rho)

Impact of r (rho) on sample size?

• Design effect = #cluster/#nocluster

• Design effect = 1+(n-1)*rho

– If only one respondent per cluster, rho doesn’t matter

– Larger rho, bigger design effect

– Larger sample size, larger effects of rho

group size (n) rho 10 50 100 200 0.02 1.18 1.98 2.98 4.98

0.05 1.45 3.45 5.95 10.95 0.10 1.9 5.9 10.9 20.9

• If experimental design is clustered, we now need to consider rho when choosing a sample size

• It is extremely important to randomize an adequate number of groups

• Often the number of individuals within groups matter less than the total number of groups

Implications




Outline

• Two approaches:

• Approach one: Given budget constraints or logistics, you are given the maximum possible sample size. With your estimated effect size, will you have enough power such that it is worthwhile pursuing the project?

• Approach two: Set the power equal to some acceptable number. Given the estimated effect size, what is the sample required to obtain that power?

How to do “power calculations”?

• You plug in some numbers…

• Software will either graph (relates to two approaches above):

– Approach 1: Power vs. effect size

– Approach 2: Power vs. observations

• Follow the graph to see #observations or effect size that gives you ~0.90 power

Power calculations using OD software

Power Calculations using the OD software

• Choose “Power vs number of clusters” in the

menu “clustered randomized trials”

Cluster Size (If no clusters)

• Choose cluster with 1 units… this is a bit

confusing

Choose Significance Level and

Standardized Effect Size

• Pick a

– Normally you pick 0.05

• Pick d

– Can experiment with 0.20

• You obtain the resulting graph showing

power as a function of sample size.

Power and Sample Size

Availability of a Baseline

• A baseline has three main uses:

– Check if C and T group same before the treatment

– Reduce the sample size needed (use controls)

– Interactions and subgroups

• To compute power with a baseline:

– Need to know correlation between two outcome

measures

– Stronger the correlation, the bigger the gain.

– Very big gains for very persistent outcomes such as

tests scores

Stratified Samples

• Stratification reduces sample size needed to achieve a given power

• Why? – Reduce the variance of outcome of interest in each strata

– Reduce the correlation of units within clusters

• Example: if you randomize within school and grade which class is treated and which class is control: – Variance of test score goes down

– The within cluster correlation goes down

• Common stratification variables: – Baseline values of the outcomes when possible

– We expect the treatment to vary in different subgroups

Other considerations

• Are you interested in the difference between two

treatments?

• Are you interested in testing whether the effect is

different in different subpopulations?

• Will there be attrition?

Conclusions

• Sample size calculations are a craft

• Calculations depend on parameters whose

values are unknown and will vary.

– Power calculations involve some guess work.

– Involve pilot testing

– Vary across outcomes!

planning sample size for randomized evaluations ed... · sample size calculations for randomized...

Documents