variance estimation in complex surveys drew hardin kinfemichael gedif

Variance Estimation Variance Estimation in Complex Surveysin Complex Surveys

Drew HardinDrew Hardin

Kinfemichael GedifKinfemichael Gedif

So far..So far..

Variance for estimated mean and Variance for estimated mean and total undertotal under

SRS, Stratified, Cluster (single, multi-stage), SRS, Stratified, Cluster (single, multi-stage), etc.etc.

Variance for estimating a ratio of two Variance for estimating a ratio of two means undermeans under

SRS (we used linearization method)SRS (we used linearization method)

What about other cases?What about other cases?

Variance for estimators that are not Variance for estimators that are not linear combinations of means and linear combinations of means and totalstotals– RatiosRatios

Variance for estimating other statistic Variance for estimating other statistic from complex surveysfrom complex surveys– Median, quantiles, functions of EMF, etc.Median, quantiles, functions of EMF, etc.

Other approaches are necessaryOther approaches are necessary

OutlineOutline

Variance Estimation Methods Variance Estimation Methods – LinearizationLinearization– Random Group MethodsRandom Group Methods– Balanced Repeated Replication (BRR)Balanced Repeated Replication (BRR)– Resampling techniquesResampling techniques

Jackknife, BootstrapJackknife, Bootstrap Adapting to complex surveysAdapting to complex surveys ‘‘Hot’ research areasHot’ research areas ReferenceReference

Linearization (Taylor Series Linearization (Taylor Series Methods)Methods)

We have seen this before (ratio We have seen this before (ratio estimator and other courses).estimator and other courses).

Suppose our statistic is non-linear. It Suppose our statistic is non-linear. It can often be approximated using can often be approximated using Taylor’s Theorem.Taylor’s Theorem.

We know how to calculate variances We know how to calculate variances of linear functions of means and of linear functions of means and totals.totals.

Linearization (Taylor Series Linearization (Taylor Series Methods)Methods)

LinearizeLinearize

Calculate VarianceCalculate Variance

)ˆ,ˆ(ˆˆ

)ˆ(ˆ

)ˆ(ˆ

)ˆ,...,ˆ(2

),...(1

2

),...(1

1 11

jijji i

kttk

ttk

ttCovt

h

t

h

tVt

htV

t

htthV

kk

k

j

jjtttj

kk ttc

cccchttthtttth k

k

1

,..,21321 )ˆ(),....,,,(

),...,,()ˆ,...,ˆ,ˆ,ˆ( 21

321

Linearization (Taylor Series) Linearization (Taylor Series) MethodsMethods

– Pro: Pro: Can be applied in general sampling designsCan be applied in general sampling designs Theory is well developedTheory is well developed Software is available Software is available

– Con:Con: Finding partial derivatives may be difficultFinding partial derivatives may be difficult Different method is needed for each statisticDifferent method is needed for each statistic The function of interest may not be expressed The function of interest may not be expressed

a smooth function of population totals or a smooth function of population totals or meansmeans

Accuracy of the linearization approximationAccuracy of the linearization approximation

Random Group MethodsRandom Group Methods Based on the concept of replicating the Based on the concept of replicating the

survey designsurvey design Not usually possible to merely go and Not usually possible to merely go and

replicate the surveyreplicate the survey However, often the survey can be divided However, often the survey can be divided

into R groups so that each group forms a into R groups so that each group forms a miniature versions of the surveyminiature versions of the survey

Random Group MethodsRandom Group Methods

1 2 3 4 5 6 7 8Stratum 1

1 2 3 4 5 6 7 8Stratum 2

1 2 3 4 5 6 7 8Stratum 3

1 2 3 4 5 6 7 8Stratum 4

1 2 3 4 5 6 7 8Stratum 5

Treat as miniature sample

Unbiased Estimator (Average of Unbiased Estimator (Average of Samples)Samples)

Slightly Biased Estimator (All Data)Slightly Biased Estimator (All Data)

1

)~ˆ(

1)

~(ˆ 1

2

1

RRV

R

rr

1

)ˆˆ(1ˆ 1

2

2

RRV

R

rr

Random Group MethodsRandom Group Methods Pro: Pro:

– Easy to calculateEasy to calculate– General method (can also be used for non General method (can also be used for non

smooth functions) smooth functions) Con:Con:

– Assumption of independent groups (problem Assumption of independent groups (problem when N is small)when N is small)

– Small number of groups (particularly if one Small number of groups (particularly if one strata is sampled only a few times)strata is sampled only a few times)

– Survey design must be replicated in each Survey design must be replicated in each random group (presence of strata and clusters random group (presence of strata and clusters remain the same)remain the same)

Resampling and Replication Resampling and Replication MethodsMethods

Balanced Repeated Replication (BRR)Balanced Repeated Replication (BRR)– Special case when Special case when nnhh=2=2

Jackknife (Quenouille (1949) Tukey (1958))Jackknife (Quenouille (1949) Tukey (1958)) Bootstrap (Efron (1979) Shao and Tu Bootstrap (Efron (1979) Shao and Tu

(1995))(1995)) These methodsThese methods

Extend the idea of random group methodExtend the idea of random group method Allows replicate groups to overlapAllows replicate groups to overlap Are all purpose methodsAre all purpose methods Asymptotic properties ??Asymptotic properties ??

Balanced Repeated Balanced Repeated ReplicationReplication

Suppose we had sampled 2 per Suppose we had sampled 2 per stratumstratum

There are 2There are 2H H ways to pick 1 from ways to pick 1 from each stratum.each stratum.

Each combination could treated as a Each combination could treated as a sample.sample.

Pick R samples.Pick R samples.


Which samples should we include?Which samples should we include?– Assign each value either 1 or –1 within the Assign each value either 1 or –1 within the

stratumstratum– Select samples that are orthogonal to one Select samples that are orthogonal to one

another to create balanceanother to create balance– You can use the design matrix for a fraction You can use the design matrix for a fraction

factorialfactorial

– Specify a vector Specify a vector r r of 1,-1 values for each of 1,-1 values for each stratumstratum

EstimatorEstimator 2

1

ˆ)(ˆ1

)ˆ(ˆ

R

rrBRR R

V


ProPro– Relatively few computationsRelatively few computations– Asymptotically equivalent to linearization Asymptotically equivalent to linearization

methods for smooth functions of population methods for smooth functions of population totals and quantilestotals and quantiles

– Can be extended to use weightsCan be extended to use weights

ConCon– 2 psu per sample2 psu per sample

Can be extended with more complex Can be extended with more complex schemesschemes

The JackknifeThe JackknifeSRS-with replacementSRS-with replacement

Quenoule (1949); Tukey (1958); Shao and Tu (1995)Quenoule (1949); Tukey (1958); Shao and Tu (1995) LetLet be the estimator of be the estimator of after omitting the after omitting the iithth

observationobservation Jackknife estimateJackknife estimate

Jackknife estimator of theJackknife estimator of the

For Stratified SRS without replacement Jones For Stratified SRS without replacement Jones (1974)(1974)

l iin

i

iJ nnn ˆ)1(ˆ~

where/~~

1

n

iJ

i

n

i

in

i

iJ

nn

nn

nV

1

2

11

2

)~~

()1(

1

/ˆˆwhere)ˆˆ(1

)ˆ(

i

)ˆ(V

The JackknifeThe Jackknifestratified multistage designstratified multistage design

In stratum h, delete one PSU at a timeIn stratum h, delete one PSU at a time Let be the estimator of the same form as Let be the estimator of the same form as

when PSU when PSU ii of stratum of stratum hh is omitted is omitted Jackknife estimate:Jackknife estimate:

Or using pseudovaluesOr using pseudovalues

)(ˆ)1/()(' ''

hihihihhhhh hh

hi ygwherenhyynWyWy

)(ˆ hi

L

h

n

i

L

h

n

i

hi

h

IIJ

hiIJ

hihh

hi

h h

nLn

nn

1 1 1 1

)()()()(

)()(

~11~;/

~~

ˆ)1(ˆ~

The JackknifeThe Jackknifestratified multistage designstratified multistage design

Different formulae for Different formulae for

WhereWhere

Using the pseudovalues Using the pseudovalues

)ˆ(V

hn

i

methodhiL

h h

hL n

nV

1

2)(

1

)ˆˆ()1

)ˆ(

LnL

h

hL

h

hihmethod /ˆor,/ˆ,ˆ,ˆbecanˆ1

)(

1

)()(

IIIjn

nV

hn

i

jJ

hiL

h h

hL ,)

~~(

)1)ˆ(

1

2)()(

1

The JackknifeThe JackknifeAsymptoticsAsymptotics

Krewski and Rao (1981)Krewski and Rao (1981) Based on the concept of a sequence of finite populations Based on the concept of a sequence of finite populations

with L strata in with L strata in

Under conditions C1-C6 given in the paperUnder conditions C1-C6 given in the paper

Where method is the estimator used (Linearization, BRR, Jackknife)Where method is the estimator used (Linearization, BRR, Jackknife)

1LL

L

)1,0()ˆ(

ˆ)

)ˆ()

),0()ˆ()2

22/1

NV

Tiii

nVii

Nni

d

method

method

method

d

The BootstrapThe BootstrapNaïve bootstrapNaïve bootstrap

Efron (1979); Rao and Wu (1988); Shao and Tu Efron (1979); Rao and Wu (1988); Shao and Tu (1995)(1995)

Resample with replacement in stratum Resample with replacement in stratum hh

Estimate:Estimate:

Variance:Variance:

– Or approximate byOr approximate by

The estimator is not a consistent estimator of the The estimator is not a consistent estimator of the variance of a general nonlinear statisticsvariance of a general nonlinear statistics

hn

ihiy 1*

Bb

ygandyyyny b

h

bh

b

i

bhih

bh

,...,2,1

)(ˆ,, *)*()*()*()*(1)*(

2**

**

* ))ˆ(ˆ()ˆ(ˆ EEVNBS

B

b

b

BV

NBS1

.*)*(** )ˆˆ(1

1)ˆ(ˆ

The BootstrapThe BootstrapNaïve bootstrapNaïve bootstrap

ForFor

Comparing with Comparing with

The ratio does not converge to 1for a The ratio does not converge to 1for a bounded bounded nnhh

***ˆ yyW hh

22

* 1)( h

h

h

h

sn

n

n

WyVar h

22

)( hh

sn

WyVar h

)(

)( *

yVar

yVar

The BootstrapThe BootstrapModified bootstrapModified bootstrap

Resample with replacement in Resample with replacement in stratum stratum hh

Calculate:Calculate:

Variance: Variance: Can be approximated with Monte CarloCan be approximated with Monte Carlo For the linear case, it reduces to the For the linear case, it reduces to the

customary unbiased variance estimatorcustomary unbiased variance estimator mmhh < n < nhh

1,1* h

m

ihi my h

)~(~

,~~,/~~

)()1(

~

1

*2/1

2/1

ygyWymyy

yyn

myy

h

m

i

L

hhhhih

hih

hhhi

h

2**

**

** ))~

(~

()~

(ˆ EEVMBS

More on bootstrapMore on bootstrap

The method can be extended to stratified srs The method can be extended to stratified srs without replacement by simply changing without replacement by simply changing

For For mmhh=n=nhh-1-1, this method reduces to the naïve , this method reduces to the naïve BSBS

For For nnhh=2, m=2, mhh=1=1, the method reduces to the , the method reduces to the random half-sample replication methodrandom half-sample replication method

For nFor nhh>3, choice of m>3, choice of mh h …see Rao and Wu (1988)…see Rao and Wu (1988)

))(1()1(

~to~ *2/1

2/1

hhihh

hhhihi yyf

n

myyy

SimulationSimulationRao and Wu (1988)Rao and Wu (1988)

Jackknife and Linearization intervals gave Jackknife and Linearization intervals gave substantial bias for nonlinear statistics in one substantial bias for nonlinear statistics in one sided intervalssided intervals

The bootstrap performs best for one-sided The bootstrap performs best for one-sided intervals (especially when intervals (especially when mmhh=n=nhh-1-1))

For two-sided intervals, the three methods For two-sided intervals, the three methods have similar performances in coverage have similar performances in coverage probabilitiesprobabilities

The Jackknife and linearization methods are The Jackknife and linearization methods are more stable than the bootstrap more stable than the bootstrap

B=200 is sufficientB=200 is sufficient

‘‘Hot’ topicsHot’ topics

Jackknife with non-smooth functions Jackknife with non-smooth functions (Rao and Sitter 1996)(Rao and Sitter 1996)

Two-phase variance estimation Two-phase variance estimation (Graubard and Korn 2002; Rubin-(Graubard and Korn 2002; Rubin-Bleuer and Schiopu-Kratina 2005)Bleuer and Schiopu-Kratina 2005)

Estimating Function (EF) bootstrap Estimating Function (EF) bootstrap method (Rao and Tausi 2004)method (Rao and Tausi 2004)

SoftwareSoftware

OSIRIS – BRR, JackknifeOSIRIS – BRR, Jackknife SAS – LinearizationSAS – Linearization Stata – LinearizationStata – Linearization SUDAAN – Linearization, Bootstrap, SUDAAN – Linearization, Bootstrap,

JackknifeJackknife WesVar – BRR, JackKnife, BootstrapWesVar – BRR, JackKnife, Bootstrap

References:References: Effron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Effron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of

statistics 7, 1-26.statistics 7, 1-26. Graubard, B., J., Korn, E., L. (2002). Inference for supper population parameters Graubard, B., J., Korn, E., L. (2002). Inference for supper population parameters

using sample surveys. Statistical Science, 17, 73-96.using sample surveys. Statistical Science, 17, 73-96. Krewski, D., and Rao, J., N., K. (1981). Inference from stratified samples: Krewski, D., and Rao, J., N., K. (1981). Inference from stratified samples:

Properties of linearization, jackknife, and balanced replication methods. The Properties of linearization, jackknife, and balanced replication methods. The annals of statistics. 9, 1010-1019.annals of statistics. 9, 1010-1019.

Quenouille, M., H.(1949). Problems in plane sampling. Annals of Mathematical Quenouille, M., H.(1949). Problems in plane sampling. Annals of Mathematical Statistics 20, 355-375.Statistics 20, 355-375.

Rao, J.,N.,K., and Wu, C., F., J., (1988). Resampling inferences with complex Rao, J.,N.,K., and Wu, C., F., J., (1988). Resampling inferences with complex survey data. JASA, 83, 231-241.survey data. JASA, 83, 231-241.

Rao, J.,N.,K., and Tausi, M. (2004). Estimating function variance estimation Rao, J.,N.,K., and Tausi, M. (2004). Estimating function variance estimation under stratified multistage sampling. Communications in statistics. 33:, 2087-under stratified multistage sampling. Communications in statistics. 33:, 2087-2095. 2095.

Rao, J. N. K., and Sitter, R. R. (1996). Discussion of Shao’s paper.Statistics, 27, pp. 246–247.

Rubin-Bleuer, S., and Schiopu-Kratina, I. (2005). On the two-phase framework for joint model and design based framework. Annals of Statistics (to appear)

Shao, J., and Tu, (1995). The jackknife and bootstrap. New York: Springer-Shao, J., and Tu, (1995). The jackknife and bootstrap. New York: Springer-Verlag.Verlag.

Tukey, J.W. (1958). Bias and confidence in not-quite large samples. Annals of Tukey, J.W. (1958). Bias and confidence in not-quite large samples. Annals of Mathematical Statistics. 29:614.Mathematical Statistics. 29:614.

Not referred in the presentationNot referred in the presentation Wolter, K. M. (1985) Introduction to variance estimation. New York: Springer-Wolter, K. M. (1985) Introduction to variance estimation. New York: Springer-

Verlag.Verlag. Shao, J. (1996). Resampling Methods in Sample Surveys. Invited paper,

Statistics, 27, pp. 203–237, with discussion, 237–254.

variance estimation in complex surveys drew hardin kinfemichael gedif

Documents

variance slide

linearization methods

survey slide

replication methods

totals variance

random group methods

data slide

necessary slide