planning sample size for randomized evaluations ed... · sample size calculations for randomized...
TRANSCRIPT
TRANSLATING RESEARCH INTO ACTION
Sample size calculations for randomized evaluations
Rebecca Thornton Assistant Professor of Economics
University of Michigan
povertyactionlab.org
1. Background: The basics
2. Getting more complicated: Clusters
3. How to do this in practice
Outline
• Interviews are expensive and you have a budget
• You do not want to be disappointed that you didn’t have a large enough sample
• If you understand the basics of sample size, there are lots of things you can do to increase your power
• You are spending a lot of money and time on this evaluation
Why care?
• General question: how large does the sample need to be to credibly detect a given effect size? (ie. a certain effect of a program)
• What does “credibly” mean here?
It means we can be reasonably sure that the difference between the control and treatment group is due to the treatment and not just to chance
4
Today’s Question
• Two important issues about sample size
1. Larger sample helps to ensure that the treatment and control groups are balanced (on observables and unobservables)
Helps prevent a biased estimate
2. Can detect a significant difference in outcomes between the treatment and control groups
Helps to detect a significant estimate
Sample Size
To start (and to finish)
• Doing sample size calculations is a craft
• The values estimated depend on parameters
whose values are unknown and will vary.
– Power calculations involve some guess work.
– Vary across outcomes!
Basic set up
• At the end of an experiment, we compare the
average outcome of interest in the treatment with
the average outcome of interest in the control
• We are interested in the difference:
Mean (treatment) - Mean (control) = Effect (size)
• Example: Want to know the effect of giving out text
books on test scores. You have the scores of
treatment students (with books) and control students
(without books)
Simple Example
60
65
70
75
80
85
90
No Books Books
Test Scores
• Subtract the average of the Control from the
average of the Treatment
• Run a regression of the outcome (Y) on an
indicator of being in the Treatment group:
Y= a + bT
Effect of the program
Simple Example
60
65
70
75
80
85
90
No Books Books
Test Scores
• Effect size: Difference in means or,
Y = a + bT b = Effect size (slope of the line)
Y=70+10*T
• Treatment Effect = 10 points – How confident am I that there is no treatment effect?
– * 10 percent chance that there is really no effect
– ** 5 percent chance
– *** 1 percent chance
Effect of the program
• Is the estimate of b biased?
– Discussed in previous lectures
– Depends on the validity of the randomization and mitigation of other threats
• How precise is the estimate of b?
– Did this difference happen just by chance? How confident am I that there is a true effect of my program?
– Depends on the sample size, the variability of the outcome variable (Y), and the actual effect of the program
• Accuracy vs. Precision
Back to the main questions…
Accuracy versus Precision
Accuracy P
reci
sio
n
Unbiased and sample size
Unbiased Sa
mp
le S
ize
• When we do survey research and estimate treatment effects… – Randomization helps us to be accurate (unbiased) – Sample size allows us to be precise (confident about
our estimates)
• Both are independently important – Increased sample size may be precise, but not
accurate. – Randomization without a large enough sample will
allow us to estimate the unbiased effect (accuracy), but we might not be that confident about it
Estimation
• Impact evaluation involves the scientific method – 1) propose a hypothesis
– 2) design the experiment to test that hypothesis
• How do we test hypotheses? – We start with an hypothesis (ie., there will be an
effect of the program)
– At the end of an experiment, we test our hypothesis
Scientific method
• In criminal law, most institutions follow the rule: “innocent until proven guilty”
• The presumption is that the accused is innocent and the burden is on the prosecutor to show guilt
– The jury or judge starts with the “null hypothesis” that the accused person is innocent
– The prosecutor has a hypothesis that the accused person is guilty
17
Hypothesis testing
• In program evaluation, instead of “presumption of innocence,” the rule is: “presumption of insignificance”
• The “Null hypothesis” (H0) is that there was no (zero) impact of the program
• The burden of proof is on the evaluator to show a significant effect of the program
Hypothesis testing
• If our measurements show a difference between the treatment and control group we know:
– There is some difference between the treatment and the control…
– But, our presumption is that there is no impact of the program (our H0 is still true)
– It might be that the difference is solely due to chance (random sampling error)
• We need to use statistics to calculate how likely this difference is in fact due to random chance or not
Hypothesis testing
• Lets say the sample size is = 2…
Extreme Example
Perhaps…
Less extreme: Is this difference due to random chance?
Control
Treatment
Probably not….
Is this difference due to random chance?
Control
Treatment
• Using statistics, if we find that it is very unlikely (say less than a 5% probability) that the difference is solely due to chance: – We “reject our null hypothesis” – We may now say: “our program has a statistically
significant impact”
• Are we now 100 percent certain there is an impact? – No, we may be only 95% confident; and we accept
that if we using this threshold, we may be wrong 5% of the time
Hypothesis testing: conclusions
• What if we can’t reject our null hypothesis
– Does that mean we can be 100% certain there is no impact?
– No, it just didn’t meet the statistical threshold to conclude otherwise
Hypothesis testing: conclusions
• Possibility #1: There is an impact
– Could detect it – have enough statistical power
– Could not detect it – do not have enough power
• Possibility #2: There is no impact
– Conclude there was no impact
– Conclude there was an impact
Two possibilities
YOU CONCLUDE
Effective No Effect
THE
TRUTH
Effective Type II Error
No Effect
Type I Error
Hypothesis testing
YOU CONCLUDE
Effective No Effect
THE
TRUTH
Effective Type II Error
No Effect
Type I Error
(probability =
sig level)
Hypothesis testing
Significance Level: Set to a level that you are comfortable with: With a
level of 5%, you can be 95% confident your conclusion of an effect. For policy purpose, you want to be very confident in the answer you give: the level will be set fairly low . Related to Type I error.
YOU CONCLUDE
Effective No Effect
THE
TRUTH
Effective (probability =
power)
Type II Error
No Effect
Type I Error
Hypothesis testing
Power: How frequently will we detect effective programs. Type II error results from low power.
1. Variance – The more “noisy” it is to start with, the harder it is
to measure effects
2. Effect Size to be detected – The smaller the effect size we want to detect, the
larger sample we need
3. Sample Size – The more children we sample, the more likely we
are to obtain the true difference
Power: main ingredients
1. Variance – The more “noisy” it is to start with, the harder it is
to measure effects
2. Effect Size to be detected – The smaller the effect size we want to detect, the
larger sample we need
3. Sample Size – The more children we sample, the more likely we
are to obtain the true difference
Power: main ingredients
Variance
Low Standard Deviation
0
5
10
15
20
25
va
lue
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
Number
Fre
qu
en
cy
mean 50
mean 60
Less Precision
Medium Standard Deviation
0
1
2
3
4
5
6
7
8
9
va
lue
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
Number
Fre
qu
en
cy
mean 50
mean 60
Even less precise
High Standard Deviation
0
1
2
3
4
5
6
7
8
value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89
Number
Fre
qu
en
cy
mean 50
mean 60
• Variance depends first on your outcome variable: which outcome you want to measure
• Must calculate separately for each outcome
• What can help increase power? Can “absorb” variance: – using a baseline
– controlling for other variables
– Do a pilot and measure the outcome variables, field testing
Variance
1. Variance – The more “noisy” it is to start with, the harder it is
to measure effects
2. Effect Size to be detected – The smaller the effect size we want to detect, the
larger sample we need
3. Sample Size – The more children we sample, the more likely we
are to obtain the true difference
Power: main ingredients
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
-4 -3 -2 -1 0 1 2 3 4 5 6
control
treatment
1 Standard Deviation
Effect Size: 1 “standard deviation”
Effect Size: 3 standard deviations
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
-4 -3 -2 -1 0 1 2 3 4 5 6
control
treatment
The less overlap the better… (easier to detect a difference)
3 Standard Deviations
• What effect do you think that the program will have?
• What is the smallest effect that you would like to be able to detect with confidence?
Effect Size
DO NOT USE: “Expected” effect size
• First start with the question: how big of an effect do I think the program will have? – This is usually large… I like the program, why else
implement?
– But if we overestimate the effect size, we overestimate the power that we will have, and our sample size may be too small
• Be conservative – What is the smallest effect size that would justify
implementing the program?
39
“Choosing” an effect size
• Different effect sizes for different outcome variables
• Also depends on how variable the outcome is
• How to standardize effect sizes across outcomes? – Standardized effect size is the effect size divided
by the standard deviation of the outcome
= (Treatment – Control)/SD
• Common standardized effect sizes 40
“Choosing” an effect size
An effect size of…
Is considered… …and it means that…
0.2 Modest The average member of the treatment group had a better outcome than the 58th percentile of the control group
0.5 Large The average member of the treatment group had a better outcome than the 69th percentile of the control group
0.8 VERY Large The average member of the treatment group had a better outcome than the 79th percentile of the control group
Standardized effect size
Really? Common Danger: Picking an effect size that is too large! Calculate!
1. Variance – The more “noisy” it is to start with, the harder it is
to measure effects
2. Effect Size to be detected – The smaller the effect size we want to detect, the
larger sample we need
3. Sample Size – The more children we sample, the more likely we
are to obtain the true difference
Power: main ingredients
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
test scores
control
treatment
control μ
treatment μ
Average difference: 6 points
We only observe a random sample of the students
Say that we have a sample of 1 observation, that comes from the distribution of data…
0.0%
0.2%
0.4%
0.6%
0.8%
1.0%
1.2%
1.4%
1.6%
1.8%
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
test scores
control
treatment
control μ
treatment μ
N=1
Sample size = 1
Sample size = 4
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
test scores
control
treatment
control μ
treatment μ
N=4
Sample size = 9
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
test scores
control
treatment
control μ
treatment μ
N=9
Sample size = 100
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
14.0%
16.0%
18.0%
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
test scores
control
treatment
control μ
treatment μ
N=100
Sample size = 6,000
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
test scores
control
treatment
control μ
treatment μ
N=sqrt(6000)
• What is a good level of power?
• A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if the null hypothesis is in fact false (e.g. there is a treatment effect), we will be able to reject it. In other words, 80% of the time we will be able to measure an effect.
• 20% of the time I will be disappointed
• Common Power used: 80%, 90%
• But I don’t like to be disappointed 20% of the time
Power: What level do I want?
1. Background: The basics
2. Getting more complicated: Clusters
3. How to do this in practice
Outline
• Up to now, we have been assuming randomization at the individual level
• But often, we may want to randomize at a higher group level – Village
– School
– District
• In that case, groups are randomized and individuals within each treatment or control group all get the same treatment
Individual vs. Group design
• Minimize or remove contamination across individuals – Example: Deworming, information campaigns
• More feasible
• Only natural choice – Example: Any education intervention that affect
an entire classroom (e.g. flipcharts, teacher training).
• Why not? Expense (linked with power)
Reason for cluster randomization
• If the treatment is randomized at a group level you need more observations
• Why? The observations (ie. individuals) are not independent of each other – All villagers are exposed to the same weather – All districts share a common history – All students share a schoolmaster
• The more correlation between the outcomes within a group, the larger sample you need
• Value called r (rho) measures this
Impact of Group-level randomization
• Like percentages, r must be between 0 and 1
• Higher values mean that your clusters are more correlated (bad for power), lower r is more desirable
• It is sometimes low, 0, .05, .08, but can be high:0.62
Values of r (rho)
Madagascar Math + Language 0.5
Busia, Kenya Math + Language 0.22
Udaipur, India Math + Language 0.23
Mumbai, India Math + Language 0.29
Vadodara, India Math + Language 0.28
Busia, Kenya Math 0.62
• Where do I find *my* rho?
– Use data
– Ask other researchers
– Be conservative and use a high value
Values of r (rho)
Impact of r (rho) on sample size?
• Design effect = #cluster/#nocluster
• Design effect = 1+(n-1)*rho
– If only one respondent per cluster, rho doesn’t matter
– Larger rho, bigger design effect
– Larger sample size, larger effects of rho
group size (n) rho 10 50 100 200 0.02 1.18 1.98 2.98 4.98
0.05 1.45 3.45 5.95 10.95 0.10 1.9 5.9 10.9 20.9
• If experimental design is clustered, we now need to consider rho when choosing a sample size
• It is extremely important to randomize an adequate number of groups
• Often the number of individuals within groups matter less than the total number of groups
Implications
1. Background: The basics
2. Getting more complicated: Clusters
3. How to do this in practice
Outline
• Two approaches:
• Approach one: Given budget constraints or logistics, you are given the maximum possible sample size. With your estimated effect size, will you have enough power such that it is worthwhile pursuing the project?
• Approach two: Set the power equal to some acceptable number. Given the estimated effect size, what is the sample required to obtain that power?
How to do “power calculations”?
• You plug in some numbers…
• Software will either graph (relates to two approaches above):
– Approach 1: Power vs. effect size
– Approach 2: Power vs. observations
• Follow the graph to see #observations or effect size that gives you ~0.90 power
Power calculations using OD software
Power Calculations using the OD software
• Choose “Power vs number of clusters” in the
menu “clustered randomized trials”
Cluster Size (If no clusters)
• Choose cluster with 1 units… this is a bit
confusing
Choose Significance Level and
Standardized Effect Size
• Pick a
– Normally you pick 0.05
• Pick d
– Can experiment with 0.20
• You obtain the resulting graph showing
power as a function of sample size.
Power and Sample Size
Power and Sample Size
Power and Sample Size
Availability of a Baseline
• A baseline has three main uses:
– Check if C and T group same before the treatment
– Reduce the sample size needed (use controls)
– Interactions and subgroups
• To compute power with a baseline:
– Need to know correlation between two outcome
measures
– Stronger the correlation, the bigger the gain.
– Very big gains for very persistent outcomes such as
tests scores
Stratified Samples
• Stratification reduces sample size needed to achieve a given power
• Why? – Reduce the variance of outcome of interest in each strata
– Reduce the correlation of units within clusters
• Example: if you randomize within school and grade which class is treated and which class is control: – Variance of test score goes down
– The within cluster correlation goes down
• Common stratification variables: – Baseline values of the outcomes when possible
– We expect the treatment to vary in different subgroups
Other considerations
• Are you interested in the difference between two
treatments?
• Are you interested in testing whether the effect is
different in different subpopulations?
• Will there be attrition?
Conclusions
• Sample size calculations are a craft
• Calculations depend on parameters whose
values are unknown and will vary.
– Power calculations involve some guess work.
– Involve pilot testing
– Vary across outcomes!