sampling: what you don’t know can hurt you juan muñoz

30
Sampling: What you don’t know can hurt you Juan Muñoz

Upload: shon-lawrence

Post on 28-Dec-2015

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Sampling: What you don’t know can hurt you Juan Muñoz

Sampling:What you don’t know can hurt you

Juan Muñoz

Page 2: Sampling: What you don’t know can hurt you Juan Muñoz

Outline of presentation

• Basic concepts– Scientific sampling

– Simple Random Sampling

– Sampling errors and confidence intervals

– Sampling errors and sample size

– Sample size and population size

– Non-sampling errors

– Sampling for rare events

– Two-stage sampling and clustering

– Stratification

– Design effect

• Implementation issues– Planning the survey

– Sample frames

– Excluded strata

– Paneling

– Nonresponse

Page 3: Sampling: What you don’t know can hurt you Juan Muñoz

Random Sampling

• Random Sampling (a.k.a. Scientific Sampling) is a selection procedure that gives each element of the population a known, positive probability of being included in the sample

• Random Sampling permits establishing Sampling Errors and Confidence Intervals

• Other sampling procedures (purposive sampling, quota sampling, etc.) cannot do that

• Other sampling procedures can also yield biased conclusions

Page 4: Sampling: What you don’t know can hurt you Juan Muñoz

• In a Simple Random Sample, households are chosen– With the same probability– Independently of each other

• In a Simple Random Sample, the selection probability of each household is p = n / N, where– n = sample size– N = size of the population

• A Simple Random Sample is self-weighted

Simple Random Sampling

Page 5: Sampling: What you don’t know can hurt you Juan Muñoz

• A simple random sample would be hard to implement...– A list of all households in the country is generally not

available to select the sample from– In other words, we don’t have a good sample frame

– High transportation costs– Difficult management

• ...but can be used to illustrate some basic facts about sampling– Sampling Errors and Confidence Intervals– The relationship between sampling error and sample size– The relationship between sample size and population size– Sampling vs. non-sampling errors

Simple Random Sampling

Page 6: Sampling: What you don’t know can hurt you Juan Muñoz

Sampling error and sample size

Standard error e when estimating a prevalence P in a sample of size ntaken from an infinite population

n

PPe

)1(

Page 7: Sampling: What you don’t know can hurt you Juan Muñoz

Confidence intervalsIn a sample of 1,000 households, 280 households

(28 percent) have preschool children.

0142.0000,1

72.028.0

e

Standard error is 1.42 percent.

Page 8: Sampling: What you don’t know can hurt you Juan Muñoz

Confidence intervals

24 25 26 27 28 29 30 31 32

In a sample of 1,000 households, 280 households (28 percent) have preschool children. Standard error is 1.42 percent.

Standard error

95 percent confidence interval:28 ± 1.42 • 1.96

99 percent confidence interval: 28 ± 1.42 • 2.58

Page 9: Sampling: What you don’t know can hurt you Juan Muñoz

Sampling error and sample size

Standarderror

Sample size

To halve sampling error...

...sample size must be quadrupled

Page 10: Sampling: What you don’t know can hurt you Juan Muñoz

Sample size and population size

Standard error e when estimating a prevalence P in a sample of size ntaken from a population of size N

n

PP

N

ne

)1(1

finite population correction

Page 11: Sampling: What you don’t know can hurt you Juan Muñoz

Sample size and population size

Samplesize

needed for a given

precision

Population size

Page 12: Sampling: What you don’t know can hurt you Juan Muñoz

Sample size

Sampling errorNon-sampling error

Sampling vs. non-sampling errors

Total error

Page 13: Sampling: What you don’t know can hurt you Juan Muñoz

Absolute and relative errors

Formula gives the absolute error n

ppe

)1(

But we are often interested in the relative error pn

p

p

e )1(

For rare events (small p,) the relative error can be large, even with very big samples

This may be the case of some of the MDG’s• Infant / maternal mortality• HIV/AIDS prevalence• Extreme poverty

Page 14: Sampling: What you don’t know can hurt you Juan Muñoz

Two-stage sampling• The country is divided into small

Primary Sampling Units (PSUs)

• In the first stage, PSUs are selected

• In the second stage, households are chosen within the selected PSUs

Page 15: Sampling: What you don’t know can hurt you Juan Muñoz

Two-stage sampling• Solves the problems of Simple Random Sampling• Provides an opportunity to link community-level factors

to household behavior• The sample can be made self-weighted if

– In the first stage, PSUs are selected with Probability Proportional to Size (PPS)

– In the second stage, a fixed number of households are chosen within each of the selected PSUs

• The price to pay is cluster effect

Page 16: Sampling: What you don’t know can hurt you Juan Muñoz

Cluster effectStandard error grows when the sample of size n is

drawn from k PSUs, with m households in each PSU (n=k•m)

Cluster effect

Intra-clustercorrelationcoefficient

1122 mee SRSTSS

Two Stage Sample Simple Random Sample

Page 17: Sampling: What you don’t know can hurt you Juan Muñoz

1.03 1.06 1.15 1.301.05 1.10 1.25 1.501.07 1.14 1.35 1.701.11 1.22 1.55 2.101.14 1.28 1.70 2.401.19 1.38 1.95 2.901.29 1.58 2.45 3.901.39 1.78 2.95 4.901.59 2.18 3.95 6.901.79 2.58 4.95 8.902.19 3.38 6.95 12.90

Cluster effects

Intra-cluster correlation coefficient

0.010.02 0.050.10Numberof PSUs

Number ofhouseholds

per PSU

For a total sample size of 12,000 households

3000200015001000800600400300200150100

468

12152030406080

120

Page 18: Sampling: What you don’t know can hurt you Juan Muñoz

Sampling weights need to be used

to analyze the data

Sampling weights need to be used

to analyze the data

Stratified Sampling

These objectives are

often contradictory in

practice

These objectives are

often contradictory in

practice

• The population is divided up into subgroups or “strata”.

• A separate sample of households is then selected from each stratum.

• There are two primary reasons for using a stratified sampling design:– To potentially reduce sampling

error by gaining greater control over the composition of the sample.

– To ensure that particular groups within a population are adequately represented in the sample.

• The sampling fraction generally varies across strata.

Page 19: Sampling: What you don’t know can hurt you Juan Muñoz

Design effect

• In a two-stage sampleCluster effect = e²TSS / e²SRS

• In a more complex sample (with two or more stages, stratification, etc.)Design effect = Deff = e²CS / e²SRS

• It can be interpreted as an apparent shrinking of the sample size, as a result of clustering and stratification.

• It can be estimated with specialized software (such as the Stata’s svy commands)

Page 20: Sampling: What you don’t know can hurt you Juan Muñoz

First stage sample frame:The list of Census Enumeration Areas

• Exhaustive

• Unambiguous

• Linked with cartography

• Measure of size (for PPS selection)

• Up to date (?)

• Area Units of adequate size

Page 21: Sampling: What you don’t know can hurt you Juan Muñoz

Second stage sample frame:The household listing operation

• What is involved?• How long does it take?• How much does it cost?• How much earlier than

the survey?• Is it always needed?• Dwellings or

households?• Who draws the sample?• Asking extra questions

during listing• Can new technologies

help?

• Training, organization, supervision, forms

• 50-80 households per enumerator/day

• ~15% of the total cost of fieldwork

• As close as possible• Yes (almost)does • A dwelling listing is more

permanent• Ideally, central staff• Not recommended• Yes (GPS)

Page 22: Sampling: What you don’t know can hurt you Juan Muñoz

Planning the survey• Selected PSUs should be allocated

– Among teams

– During the survey period

Page 23: Sampling: What you don’t know can hurt you Juan Muñoz

• Parts of the country may need to be excluded from the sample for security or other reasons

Excluded strata

Page 24: Sampling: What you don’t know can hurt you Juan Muñoz

Panel Surveys can measure change better

Y2001

Y2005

2001 2005It seems that Y2001 > Y2005 but…

…both measures are affected by sampling errors (e2001 et e2005)

The error of the difference Y2005 - Y2001 is…

…√ (e²2001 + e²2005) if the two samples are independent

…only √(e²2001+e²2005–2ρ[Y2001,Y2005]) if the sample is the same

Page 25: Sampling: What you don’t know can hurt you Juan Muñoz

Advantages and disadvantagesof panels

• Analyticaladvantages

– Can measure changes better– Permit understanding better why

things changed– Permits correlating past and

present behavior

• Analyticaldisadvantages

– Become progressively less representative of the population

• Practicaldisadvantages

– Sample attrition– Much harder to manage– Better to design them

prospectively rather than in afterthought

• Practicaladvantages

– No sampling design needed for the second and subsequent surveys

Page 26: Sampling: What you don’t know can hurt you Juan Muñoz

Nonresponse

•Possible solutions… Replace nonrespondents with similar households Increase the sample size to compensate for it Use correction formulas Use imputation techniques (hot-deck, cold-deck,

warm-deck, etc.) to simulate the answers of nonrespondents

None of the above✔

Page 27: Sampling: What you don’t know can hurt you Juan Muñoz

The best way to deal with nonresponse is to prevent it

Lohr, Sharon L. Sampling: Design & Analysis (1999)

Page 28: Sampling: What you don’t know can hurt you Juan Muñoz

TotalNonresponse

Interviewers

Type of survey

Respondents

Training

Work LoadMotivation

Qualification Data collection method

Demographic

Socio-economic

Economic

Burden

Motivation

Proxy

Availability

Source: “Some factors affecting Non-Response.” by R. Platek. 1977. Survey Methodology. 3. 191-214

Page 29: Sampling: What you don’t know can hurt you Juan Muñoz

• Total sample size: 18,144 households• 56 Strata = 18 governorates x 3 zones (5 in Bagdad)

( Urban Center / Other Urban / Rural )• No explicitly excluded strata• Within each stratum: 324 households, selected in two-

stages:– 54 Blocks, selected with PPS– In each block: 6 households (a cluster,) selected with EP

• The 162 clusters of each governorate were allocated– To fieldworkers: 3 teams x 3 interviewers x 18 clusters– In time: 18 waves x 9 clusters (randomly)

One wave = 20 days fieldwork period = 12 months

Case study: The IHSESIraq Household Socio-Economic Survey

Presenter: Ms Najla Murad - COSIT

Page 30: Sampling: What you don’t know can hurt you Juan Muñoz

• If a cluster could not be visited at the scheduled time, it was swapped with one of the selected clusters not yet visited, chosen at random.

• At the end of fieldwork, 75 of the 3,024 originally selected clusters could not be visited (2.5 percent)

• However, over 30 percent of the clusters were not visited at the scheduled time

• In the clusters that could be visited, non-response was negligible (~1.5 percent)

Case study: The IHSESIraq Household Socio-Economic Survey

Performance of the contingency plans