sampling and statistical power

AADAPT Workshop Latin AmericaBrasilia, November 16-20, 2009

Sampling and Statistical Power

Laura Chioda World Bank

1

Outline

Ingredients of a general recipe called “Statistical Sampling Theory” Key message: a successful design involves

some guess work. It is important to have a general rule, but then needs discussion depending on the case at hand

Use randomization as a benchmark I will not deal with non-experimental designs in

general, though what follows can be straightforwardly adapted to cover that case

2

Basic set up (1) Key ingredient for impact evaluation:

Questions, what do we want to learn about this initiative/project? How do I identify the effect that answers my questions? After this

morning possibly a bit more convinced that experimental methods can be ethical, interesting and funny!!

What is Next? DATA COLLECTION, sampling and sampling design Set up: assume we have performed our lottery and we have

identified our treatment and the control groups What we would like to measure is difference:

Effect of the Program = Mean in treatment - Mean in controlExample: average farmers' income who adopted fertilizer because of new incentive program vs the farmers in the control group who didn’t receive any incentives

3

Basic set up (2) Why is Randomization good?

may produce experimental groups that differ by chance, however. These differences are random errors, not biases

when done properly allows us to identify casual effects attributable to the program.

Bottom line: randomization removes bias, but it does not remove random noise in the data.

The latter statement is the reason why we worry about sampling!!!

4

Basic set up (3) In a world with no budget constraint we could collect data on

ALL the individuals (universe) in the treatment and in the control groups.

In practice, we do not observe the entire population, just a sample. Example: we do not have data for all farmers of the country/region, but just for a random sample of them in treatment and control groups

Hence we do not measure the true effect, but we estimate the mean outcome of interest by computing the average in the sample. Example: we compute the average income for farmers in the treatment vs avg. income for farmers in the control group.

Bottom line: Estimated Effect = True Effect + Noise

5

Planning Sample Size for Randomized Evaluations The noise coming from the fact we collect data for a subset

(sample) of the entire universe forces us to think about sampling & how to minimize the prob. that we may end up with a wrong conclusion (scary!!! Not really)

Question: How large does the sample need to be to credibly detect a given effect size?

What does “credibly” mean here?It means that I can measure with a certain degree of confidence the difference between participants and non-participants

Key ingredients: number of units (e.g. villages) randomized number of individuals (e.g. households) within units info on the outcome of interest and the expected size of the effect on

this outcome 6

Hypothesis Testing (1)

0 sizeEffect :a H

7

Let’s go back to basics: The general idea can be easily put across by testing the

hypothesis that the effect size is equal to zero We want to test:

Against:

>> Can be done for different groups of individuals

0 sizeEffect : oH

0 sizeEffect :a H

Hypothesis Testing (2)

80 1 2 3 4 5 6 7 8 9 10

0%10%20%30%40%50%60%70%80%90%

100%

Probability of detecting a positive ef -fect

It is reasonable to think of the following “ideal” property of any testing procedure: minimize disappointment , but allow for a minimum degree of error

True Effect Size

Two Types of Mistakes (1) First type of error: conclude that the program has an effect, when in fact at best it has no effect Significance level of a test: Probability that you will falsely conclude that the program has an effect, when in fact it does not▪ If you find an effect with a level of 5%, you can be 95% confident in the validity of your conclusion that the program had an effect

For policy purpose, you want to be very confident of the answer you give: the level will be set fairly low▪ Common levels are: 5%, 10%, 1%

9

Two Types of Mistakes (2)

Second Type of Error: You conclude that the program has no effect when indeed it had an effect, but it was not measure with enough precision (sort of technical, see next)

Power of a test: Probability to find a significant effect if there truly is an effect▪ higher power is better since I am more likely to have an effect to report▪ don’t think about statistics, simply look at the graph!

10

Practical Steps Set a pre-specified confidence level (5%) – i.e. just set the initial point of the line in the graphDecide for a sample size that allows to achieve a given power.

Common values used are 80% or 90%. Intuitively, the larger the sample, the larger the power. Power is a planning tool: one minus the power is the probability to be disappointed….

Set a range of pre-specified effect sizes (what you think the program will do)

What is the smallest effect that should prompt a policy response? Aka minimum detectable effect

11

Picking an Effect SizeWait a minute is this a catch 22, we do not know the true effect? No, it is a thought experiment, we gain insight from economics past data on the outcome of interest or even past evaluationsWhat is the smallest effect that should justify the program to be adopted? Cost of this program v the benefits it bringsCost of this program v the alternative use of the moneyCommon danger: picking effect size that are too optimistic—the sample size may be set too low to detect an actual effect

Practical Steps and “magic formulas” Proposition I:There exists at least one statistician in the world who has already put into a magic formula the optimal sample size required to address this problem Proposition II:The rule has also been implemented for almost all computer software

Not difficult to do, and only requires simple calculations to understand the logic (really simple!)

13

Picking an Effect Size

General “rule”: the sample size required is a function of: Significance level (often set to 5%) Minimum detectable effect – you set

this Power to detect it (often set to 80%) Variance of the outcome of interest

before the intervention takes place (derived from baseline data)

14

The Design Factors that Influence Power Which elements can influence the power?

1. The level of randomization2. Stratification3. Availability of Control Variables

Very fancy words, to express very simple concepts…

Statistics is nothing more than common sense

15

Level of RandomizationClustered Design (1)Cluster (or group) randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to the intervention groupExamples: First randomize villages

Then observe outcome variables at the household level

In an education program, randomize schools Then look at students’ achievement

16

Level of RandomizationClustered Design (2) Cluster randomization provides unbiased estimates of intervention effects for the same reasons that individual randomization does… However, makes our lives a bit more complicated when we have to do our power calculations:the statistical power or precision of cluster randomization is less than that for individual randomization, and often by a lot!

Necessitates larger samples to increase statistical power!

17

Implications of Clustering Forced to “cluster”, i.e. to take into consideration that

there are some elements that affect all the people in a given village/group/ basin in the same way.

Why? The outcomes for all the individuals within a cluster may be correlated All villagers are exposed to the same NGO The members of a village interact with each other sharing

knowledge, technology and best practices.. Individuals in a given area are exposed to the same soil

characteristics. The sample size needs to be adjusted for this correlation

The more correlation between the outcomes, the more we need to adjust the standard errors

18

In Practice…

It is extremely important to randomize an adequate number of clusters. The general result is that the number of

individuals within clusters matters less than the number of clusters

Think that the “law of large number” applies only when the number of clusters that are randomized is sufficient Number of Clusters == Number of

Observations19

Stratification (1) What do we mean by stratification of the sample and what is a stratum? Esoteric expression to simply say: the more elaborate is the question you

want to answer the larger is the requirement in terms of sample size

Compare the 2 scenarios: Research question 1: I would like to know the average effect of

the fertilizer program on farmers’ income? Research question 2: I would like to know the average effect of the

program for female headed low income families under the age of 35 , living in the north east?

Question: keeping all other elements constants (minimum size effect, power target, significance level) which research question will need more data in order to be answered with precision?

20

Stratification (2) To improve precision one could block or stratify

experimental sample members by some combination of their baseline characteristics, and then randomize within each block or stratum

Factors used for blocking in social research typically include: geographic location demographic characteristics past outcomes

Punch Line: stratification demands for larger sample size, but we get more detailed answers without sacrificing precision

21

Control Variables (1)Good news: if control variables such as

Gender, education levels, age (demographics)Farms characteristics, Soil compositionAre available

Then the precision of our estimates will increase hence the sample size requirement decreases. This reduces variance for two reasons: Reduces the variance of the outcome of interest in each stratum, and Reduces the correlation of units within clusters

Warning: control variables must only include variables that are not INFLUENCED by the treatment, i.e. variables that have been collected BEFORE the intervention >> If not: Problem of reverse causality!

22

Control Variables (3)

What matters in terms of power is ,the residual variation after controlling for those variables

so just replicate the steps described above within strata

It may help stratifying along dimension that we know from previous studies are important for the effects of the program

Example: we might expect to have differential effects by gender or age groups

>> This may help understand “non-response” rates

23

Non-response

Wrap up

Power calculations look scary but they are just a formalization of common sense

At times we do not have the right information to conduct it very properly

However, it is important to spend effort on them: Avoid launching studies that will have no

power at all: waste of time and money, potentially harmful

Devote the appropriate resources to the studies that you decide to conduct (and not too much)

25

Thank You

26

What do we mean by “noise”?

27

What do we mean by “noise”?

28

Relation with Confidence Intervals

A 95% confidence interval for an effect size tells us that, for 95% of any samples that we could have drawn from the same population, the estimated effect would have fallen into this interval

If zero does not belong to the 95% confidence interval of the effect size we measured, then we can be at least 95% sure that the effect size is not zero

The rule of thumb is that if the effect size is more than twice the standard error, you can conclude with more than 95% certainty that the program had an effect

29

Standardized Effect Sizes (but this is a 2OP here) Sometimes impacts are measured as a

standardized mean difference Example: outcomes in different metrics must

be combined or compared The standardized mean effect size equals

the difference in mean outcomes for the treatment group and control group, divided by the standard deviation of outcomes across subjects within experimental groups

30

sampling and statistical power

Documents