lynn lethbridge shrug november, 2010. what is bootstrapping? a method to estimate a statistic’s...

29
Lynn Lethbridge SHRUG November, 2010

Upload: noah-osborne

Post on 17-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Lynn Lethbridge

SHRUG November, 2010

Page 2: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

What is Bootstrapping?A method to estimate a statistic’s sampling

distribution

Bootstrap samples are drawn repeatedly with replacement from the original data

From each new sample, the statistic is re-calculated and saved in a dataset (ie 200 bootstraps, 200 statistics)

The standard error of the statistic is calculated as the standard deviation of the bootstrap statistics

Bootstrapping not used for the point estimate

Page 3: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

When to Use BootstrappingDistribution has no clear analytical solution

eg Gini coefficient, poverty intensityTest for sensitivityComplex survey design (not random)

eg Statistics Canada surveys are a stratified, multistage design Households within clusters within strata are

selected Observations will not be independent – variance

calculated the usual way will be underestimated

Page 4: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Two ProgramsOne is ‘traditional’ bootstrapping

re-sampling from the original sampleThe second is bootstrapping using Statistics

Canada survey dataStatistics Canada does the re-sampling heavy

lifting in most of its surveysUse the bootstrap weights provided to

calculate the standard error

Page 5: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Program 1Project where we examined the effect of

trade on ‘poverty intensity’ in Canada/USUsed state/province level measures in

regression analysisUsed bootstrapping to measure robustness of

results given a different mix of policiesOur dataset consists of 61 unique observations

of states and provinces. Re-sample to see if results are affected if we had a different make-up of regions

Page 6: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/** run the regression with original sample to get point estimates */ proc reg data=orig.pov97 outest=work.estpoint(keep=intercept lmurate aveuiben tradeimp tradeexp sambearn can); model sst = lmurate aveuiben tradeimp tradeexp sambearn can; weight invse; title " 1997"; run; proc transpose data=work.estpoint out=work.estpoint2(drop=_label_ rename=(col1=coef)); run;

Page 7: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/* put sample size in a macro */ proc means data=orig.pov97 noprint; var year; output out=work.out n=totnum; run; data _null_; set work.out; call symput ('totnum', totnum); run;

Page 8: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/** make a temporary file of original dataset */ data work.pov97; set orig.pov97; run; /* initiate bootstrap dataset */ data work.boot97fin; set _null_; run; options nonotes; /* create macro for number of bootstraps */ %let bt=1000;

Page 9: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

%macro boot; /** construct new sample of 61 observations - randomly drawn with replacement */ data work.boot; do i=1 to &totnum; _p=ceil(ranuni(i+&x)*&totnum); do obsnum=_p to _p; set work.pov97 point=obsnum; if _error_ then abort; output; end; end; stop; run;

Page 10: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/* estimate coefficients from bootstrap sample*/ proc reg data=work.boot noprint outest=work.est(keep=intercept lmurate aveuiben tradeimp tradeexp sambearn can); model sst = lmurate aveuiben tradeimp tradeexp sambearn can; weight invse; title " 1997"; run; /** add coefficients to dataset */ data work.boot97fin; set work.boot97fin work.est; run; %mend boot;

Page 11: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/** invoke the boot macro 1000 times */ %macro reps; %do x=1 %to &bt; %boot; %end; %mend reps; %reps;

Page 12: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

options notes; /** calculate the standard deviation of each bootstrapped coefficient */ proc means data=work.boot97fin n mean std; output out=work.std std=intercept lmurate aveuiben tradeimp tradeexp sambearn can; run; proc transpose data= work.std (drop=_type_ _freq_)out=work.std2(drop=_label_ rename=(col1=se)); run;

Page 13: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/** merge point estimates together with standard errors and calculate statistics */

data work.final; merge work.estpoint2 work.std2; t=coef/se; pvalue=(1-probnorm(abs(t)))*2; run; proc print data= work.final; run;

Page 14: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 0.05648 0.02317 2.44 0.0181 lmurate 1 0.06210 0.01433 4.33 <.0001 aveuiben 1 -0.00009479 0.00003002 -3.16 0.0026 tradeimp 1 -0.07186 0.12541 -0.57 0.5690 tradeexp 1 0.02107 0.13190 0.16 0.8737 sambearn 1 -0.06155 0.04973 -1.24 0.2212 can 1 -0.03489 0.02739 -1.27 0.2081

Page 15: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

1997 The MEANS Procedure Variable Label N Mean Std Dev ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Intercept Intercept 1000 0.0581707 0.0305142 lmurate 1000 0.0616976 0.0178248 aveuiben 1000 -0.000101532 0.000037820 tradeimp 1000 -0.0258204 0.1743886 tradeexp 1000 -0.0355008 0.1880651 sambearn 1000 -0.0635708 0.0673242 can 1000 -0.0228619 0.0402765 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Page 16: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Obs _NAME_ coef se t pvalue 1 intercept 0.056482 0.03051 1.85102 0.06417 2 lmurate 0.062098 0.01782 3.48378 0.00049 3 aveuiben -0.000095 0.00004 -2.50627 0.01220 4 tradeimp -0.071862 0.17439 -0.41208 0.68028 5 tradeexp 0.021066 0.18807 0.11202 0.91081 6 sambearn -0.061547 0.06732 -0.91419 0.36062 7 can -0.034891 0.04028 -0.86628 0.38634

Page 17: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Program 2Project using the National Longitudinal

Survey of Children and Youth (NLSCY)

Examined the effect of having a child with disabilities on the health of mothers and fathers

Ordered Probit utilizing Statistics Canada NLSCY bootstrap weights to estimate standard errors

Page 18: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

WeightingMany survey datasets include sampling weights

so results will represent the population

The mechanics of using bootstrap weights are the same as for sampling weights

Each individual in survey has a sample weight and all the bootstrap weights

Re-estimate your model or statistic over and over using a different weight each time

Page 19: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Bootstrap Weight Derivation

Re-sampling A Miracle

Occurs

Bootstrap Weights

Page 20: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/** macros to indicate the dependent variable and independent variables */ %let depvar=momhealth00; %let indepvars=hhdis00 momage00 momlthigh00 momcertdip00 momunivdeg00 momimm eqinc00 hhchlt500 kids01700 momvg94 momg94 momfp94 momsmokesdaily00; /** separate macro for the independent variables and intercept */ %let allrhs=intercept_2 intercept_3 intercept_4 intercept_5 &indepvars;

Page 21: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/*** get point estimates using sample weight */ proc logistic data=nlscy.age615validboot descending outest=work.point(keep=&allrhs); model &depvar= &indepvars / link=normit maxiter=50 rsq; weight dwtcwd1l / norm; where validdis=1; title " mom 2000 "; run; /** transpose the date which contains the point estimates */ proc transpose data=work.point out=work.pointtrans(drop=_label_ rename=(col1=coef)); run;

Page 22: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/** put data into memory */ data work.age615validboot; set nlscy.age615validboot; run; /** create empty dataset for coefficients */ data work.probitboot; set _null_; run; %global bt; %let bt=1000; /** 1000 bootstrap weights provided;*/

Page 23: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

%macro boot; options nonotes; %do i=1 %to &bt; proc logistic data=work.age615validboot noprint descending outest=work.est(keep=&allrhs); model &depvar =&indepvars / link=normit maxiter=50 rsq; weight bsw&i / norm; where validdis=1; title " mom 2000 "; run; data work.probitboot; set work.probitboot work.est; run; %end; options notes; %mend boot; %boot;

Page 24: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

/** calculate the standard deviation */ proc means data=work.probitboot n mean std ; output out=work.std std=&allrhs; run; proc transpose data=work.std(drop=_type_ _freq_) out=work.std2(drop=_label_ rename=(col1=se)); run;

Page 25: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

data work.final; merge work.pointtrans work.std2; /** Wald chi square */ z=coef/se; chi=z*z; pvaluechi=1-probchi(chi,1); run; proc print; title " married moms"; run;

Page 26: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 5 1 -2.9050 0.1513 368.5150 <.0001 Intercept 4 1 -2.0956 0.1451 208.6086 <.0001 Intercept 3 1 -1.0202 0.1429 50.9855 <.0001 Intercept 2 1 0.2247 0.1424 2.4906 0.1145 hhdis00 1 0.3052 0.0427 51.1371 <.0001 momage00 1 0.00579 0.00314 3.4098 0.0648 momlthigh00 1 0.1499 0.0583 6.6078 0.0102 momcertdip00 1 -0.0731 0.0384 3.6231 0.0570 momunivdeg00 1 -0.1781 0.0433 16.9065 <.0001 momimm 1 0.3377 0.0419 64.9256 <.0001 eqinc00 1 -2.95E-6 6.018E-7 24.0756 <.0001 hhchlt500 1 -0.1872 0.0876 4.5628 0.0327 kids01700 1 -0.1262 0.0161 61.0665 <.0001 momvg94 1 0.6181 0.0350 312.6018 <.0001 momg94 1 1.1116 0.0458 589.8279 <.0001 momfp94 1 1.5644 0.0912 294.0294 <.0001 momsmokesdaily00 1 0.1706 0.0430 15.7629 <.0001

Page 27: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

The MEANS Procedure Variable N Mean Std Dev ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Intercept_5 1000 -2.9650753 0.3107804 Intercept_4 1000 -2.1470196 0.2770212 Intercept_3 1000 -1.0465351 0.2621726 Intercept_ 1000 0.2091371 0.2622451 hhdis00 1000 0.2846419 0.0973226 momage00 1000 0.0057067 0.0055820 momlthigh00 1000 0.1293874 0.0932894 momcertdip00 1000 -0.0739417 0.0772243 momunivdeg00 1000 -0.1852935 0.0980241 momimm 1000 0.3191519 0.1181139 eqinc00 1000 -3.090889E-6 1.1721765E-6 hhchlt500 1000 -0.1760001 0.1143188 kids01700 1000 -0.1148346 0.0351904 momvg94 1000 0.6399775 0.0754143 momg94 1000 1.1403891 0.1000578 momfp94 1000 1.6089774 0.1664408 momsmokesdaily00 1000 0.1618192 0.0882162 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Page 28: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly

Obs _NAME_ coef se chi pvaluechi 1 intercept_2 -2.90503 0.31078 87.376 0.00000 2 intercept_3 -2.09565 0.27702 57.228 0.00000 3 intercept_4 -1.02021 0.26217 15.143 0.00010 4 intercept_5 0.22473 0.26225 0.734 0.39147 5 hhdis00 0.30519 0.09732 9.834 0.00171 6 momage00 0.00579 0.00558 1.076 0.29961 7 momlthigh00 0.14987 0.09329 2.581 0.10815 8 momcertdip00 -0.07309 0.07722 0.896 0.34390 9 momunivdeg00 -0.17806 0.09802 3.300 0.06930 10 momimm 0.33771 0.11811 8.175 0.00425 11 eqinc00 -0.00000 0.00000 6.346 0.01176 12 hhchlt500 -0.18722 0.11432 2.682 0.10149 13 kids01700 -0.12618 0.03519 12.857 0.00034 14 momvg94 0.61807 0.07541 67.169 0.00000 15 momg94 1.11157 0.10006 123.417 0.00000 16 momfp94 1.56445 0.16644 88.349 0.00000 17 momsmokesdaily00 0.17064 0.08822 3.742 0.05307

Page 29: Lynn Lethbridge SHRUG November, 2010. What is Bootstrapping? A method to estimate a statistic’s sampling distribution Bootstrap samples are drawn repeatedly