DEPARTMENT OF MATHEMATICS AND STATISTICS
Handling Missing DataHandling Missing Data
Tao SunTao Sun
Lena ZhangLena Zhang
Yaqing ChenYaqing Chen
Francisco AguirreFrancisco Aguirre
SSC Case Study 2002SSC Case Study 2002
SSC Conference Hamilton Ontario May 2002 2
DEPARTMENT OF MATHEMATICS AND STATISTICS
1.1. Preliminary analysis Preliminary analysis • Various plotsVarious plots
2.2. Assessing the missing patternAssessing the missing pattern• Spearman rank correlation, logistic regression Spearman rank correlation, logistic regression
3.3. Data analysis with missing data - Data analysis with missing data - Multiple ImputationMultiple Imputation• Random hot deck imputation with bootstrapRandom hot deck imputation with bootstrap• PROC MI and MIANALIZE (SAS)PROC MI and MIANALIZE (SAS)• Transcan function (Hmisc library in S plus or R)Transcan function (Hmisc library in S plus or R)
4.4. ConclusionsConclusions5.5. Further workFurther work
Presentation OutlinePresentation Outline
Objective:Objective:
Compare different approaches to handle missing data from a practitioner’s point of view
SSC Conference Hamilton Ontario May 2002 3
DEPARTMENT OF MATHEMATICS AND STATISTICS
Preliminary analysisPreliminary analysis
RESPONSE OVERVIEWRESPONSE OVERVIEW
Sample size: 2389
Males: 1097 (45.9%)
Females: 1292 (54.1%)
Observed: 1691
Missing: 698 (28.8%)
Mean: 0.9129
• The response variable is highly skewed to the left.
0.4 0.6 0.8 1.0
010
020
030
040
050
060
0
HISTOGRAM OF RESPONSE DVHST94
DVHST94
Histogram of observed responses
DVHST94
SSC Conference Hamilton Ontario May 2002 4
DEPARTMENT OF MATHEMATICS AND STATISTICS
Preliminary analysisPreliminary analysis
• 8 covariates in total, first 4 shown here.
• There appears to be a pattern of two clusters in the response DVHST94 (below 0.5 and above 0.5).
• DVBMI94 appears to have some “wild” values ( = 96)– 43 observations , all males.
(3.9% of males sample)– Wild values were replaced
with the mean DVBMI94 of males
– DVBMI94 transformation:NEW.DVBMI94 = abs (DVBMI94 – 22)
SSC Conference Hamilton Ontario May 2002 5
DEPARTMENT OF MATHEMATICS AND STATISTICS
Preliminary analysisPreliminary analysis
• There are no obvious linear patterns between the covariates and the response DVHST94
• DVPP94 is recoded as dichotomous:
NEW.DVPP94 = 0 (91% of observations) NEW.DVPP94 > 0 (9% of observations)
• The AGEGRP covariate is recoded to NEW.AGE NEW.AGE = mid range value (AGEGRP) – 20
SSC Conference Hamilton Ontario May 2002 6
DEPARTMENT OF MATHEMATICS AND STATISTICS
Preliminary analysisPreliminary analysis
0.84 0.86 0.88 0.90 0.92
DVHST94
309 283 383 296 173 132 61 29 25
857 834
635 259 435 362
535 57 44 305 155 595
1721519
814 491 386
487 433 382 389
424 422 420 425
453 383 433 422
1691
N
2 7
12 17 22 27 32 37 42
Female Male
7
1 2 3 4 5 6
DVPP94 > 0 DVPP94 = 0
0 1
[ 1, 7)
[ 8,10) [10,11]
[2,9]
[ 0, 3) [ 3, 6) [ 6,12) [12,94]
[0.0547,0.447) [0.4473,0.824) [0.8239,1.430) [1.4297,7.445]
[0.0, 1.6) [1.6, 3.1) [3.1, 6.1) [6.1,18.0]
NEW.AGE
SEX
DVHHIN94
DVSMKT94
NEW.DVPP94
NUMCHRON
VISITS
NEW.WT6
NEW.DVBMI94
Overall
mean
N=1691 N missing=698
Mean DVHST94
SSC Conference Hamilton Ontario May 2002 7
DEPARTMENT OF MATHEMATICS AND STATISTICS
Preliminary analysisPreliminary analysis
• Strength of marginal relationships between the covariates and the response using generalized Spearman chi-square
SSC Conference Hamilton Ontario May 2002 8
DEPARTMENT OF MATHEMATICS AND STATISTICS
• The missing pattern of the response does not appear to depend on the sampling weights
Assessing the missing Assessing the missing patternpattern
SSC Conference Hamilton Ontario May 2002 9
DEPARTMENT OF MATHEMATICS AND STATISTICS
• The missing values depend on age
Assessing the missing Assessing the missing patternpattern
Missing response DVHST94 vs NEW.AGE
0
100
200
300
400
500
2 7 12 17 22 27 32 37 42NEW.AGE
Sam
ple
size
0%
20%
40%
60%
80%
100%
Per
cent
age
of
mis
sing
total sample size % Missing values
SSC Conference Hamilton Ontario May 2002 10
DEPARTMENT OF MATHEMATICS AND STATISTICS
Assessing the missing Assessing the missing patternpattern
LOGISTIC REGRESSIONLOGISTIC REGRESSIONCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.058793 0.367083 -13.781 < 2e-16 ***NEW.AGE 0.181625 0.007524 24.140 < 2e-16
***SEXMale -0.847947 0.131475 -6.450 1.12e-10 ***DVHHIN94 0.047828 0.026768 1.787 0.0740 . DVSMKT94 -0.015131 0.031662 -0.478 0.6327 NEW.DVPP94 = 0 0.233188 0.226732 1.028 0.3037 NUMCHRON -0.087992 0.048783 -1.804 0.0713 . VISITS 0.012483 0.006563 1.902 0.0572 . NEW.WT6 -0.043935 0.077407 -0.568 0.5703 NEW.DVBMI94 -0.015622 0.017299 -0.903 0.3665
Missing response DVHST94 vs Gender
0
500
1000
1500
2000
2500
3000
Male Female Total
Missing
Observed% missing for males: 24%
% missing for females: 34%
SSC Conference Hamilton Ontario May 2002 11
DEPARTMENT OF MATHEMATICS AND STATISTICS
Multiple imputationMultiple imputation
Methods:Methods:
– Random Hot Deck MI with Bootstrap
– SAS PROC MI and PROC MIANALIZE
– Function TRANSCAN in S-plus from Hmisc Library (Frank Harrel)
SSC Conference Hamilton Ontario May 2002 12
DEPARTMENT OF MATHEMATICS AND STATISTICS
Multiple ImputationMultiple Imputation
INCOMPLETE DATA
IMPUTATION ANALYSIS POOLING
IMPUTED DATA
ANALYSIS RESULTS
FINAL RESULTS
• IMPUTATION:
Impute the missing entries of the incomplete data sets B times, resulting in B complete data sets.
• ANALYSIS:
Analyze each of the B completed data sets using weighted least squares.
• POOLING:
Integrate the B analysis results into a final result. Simple rules exist for combining the B analyses.
SSC Conference Hamilton Ontario May 2002 13
DEPARTMENT OF MATHEMATICS AND STATISTICS
Random hot-deck MI with BootstrapRandom hot-deck MI with Bootstrap
B = 1000 B = 1000 replicatesreplicates
Observed MissingObserved Missing
response responseresponse response
Choose randomly with Choose randomly with replacementreplacement
Probability ~ weightsProbability ~ weights
Complete Complete datadata
( ( , ), )(Within variance,R-square)(Within variance,R-square)
1 1U 1R
Same procedureSame procedure ( , )( , )(Within variance ,R-square)(Within variance ,R-square)
1000 1000U 1000R
EstimatedEstimated
EstimatedEstimated
)(~
mean
B
RR
B
ii
1
Compute 95% CI for judging Compute 95% CI for judging significance of predictorssignificance of predictors
)riancebetween va(
)1/())((
riance)(within va/where
1 varianceTotal
1
1
BUUUUBe
BUU
Beb
bU
B
iBiBiB
B
iiB
BB
SSC Conference Hamilton Ontario May 2002 14
DEPARTMENT OF MATHEMATICS AND STATISTICS
PROC MI & MIANALYZE MethodPROC MI & MIANALYZE Method
PROC MIPROC MI1 By default generates 5 imputation values for each missing valueBy default generates 5 imputation values for each missing value2 Imputation method: MCMC (Markov Chain Monte Carlo)Imputation method: MCMC (Markov Chain Monte Carlo) EM algorithm determines initial valuesEM algorithm determines initial values MCMC repeatedly simulates the distribution of interest from which the MCMC repeatedly simulates the distribution of interest from which the
imputed values are drawnimputed values are drawn3 Assumption: Data follows multivariate normal distributionAssumption: Data follows multivariate normal distribution
PROC REGPROC REG Fits five weighted linear regression models to the Fits five weighted linear regression models to the five complete data sets obtained from PROC MI five complete data sets obtained from PROC MI (used by_imputation_statement )(used by_imputation_statement )
PROC MIANALIZE PROC MIANALIZE Reads the parameter estimates and associated Reads the parameter estimates and associated covariance matrix from the analysis covariance matrix from the analysis performed on the multiple imputed data sets performed on the multiple imputed data sets and derives valid statistics for the parametersand derives valid statistics for the parameters
SSC Conference Hamilton Ontario May 2002 15
DEPARTMENT OF MATHEMATICS AND STATISTICS
TRANSCAN(Splus,Hmisc)
Transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables.
It approximates the multiple imputation algorithm described by Rubin’s Bayesian bootstrap.
• Draws a sample of size r from r non-missing residuals.
• Chooses a sample of size m from this sample of size r with replacement. m is the number of missing values.
LS Bootstrap Bootstrap
• Generates imputed values with the linear imputation model and the bootstrapped residuals.
Advantage:•Does not need normality assumption or symmetry of residuals.
•Does shrinkage to avoid overfitting
Disadvantage:•“Freezes” the imputation model before drawing the multiple imputations.
)ˆ,...,ˆ( ''
1 r obsobs XY ,. )ˆ,...,ˆ( 1 r )ˆ,...,ˆ( **
1 m
Frank Harrell
This algorithm is repeated B times to obtain the multiple imputed data sets that are analyzed using WLS with the function LM.
SSC Conference Hamilton Ontario May 2002 16
DEPARTMENT OF MATHEMATICS AND STATISTICS
Comparing imputation Comparing imputation methodsmethods
Ranking:Ranking:1.1. TRANSCAN ( Advantage: shrinkage correction to prevent over fitting)TRANSCAN ( Advantage: shrinkage correction to prevent over fitting)
2.2. PROC MI (Drawback: normality assumption)PROC MI (Drawback: normality assumption)
3.3. Bootstrap random hot deck (does not use the information of the covariates)Bootstrap random hot deck (does not use the information of the covariates)
S-plus TRANSCAN SAS PROC MI
Bootstrap (random hot
deck)
Available data only
(Intercept) 0.8495 (0.0135) * 0.9281 (0.01) * 0.8711 (0.0128) * 0.861 (0.012) *NEW.AGE -0.0039 (0.0004) * -0.0016 (0.0004) * -0.0006 (0.0002) * -0.0013 (0.0003) *
SEX (Male=1) 0.0045 (0.0045) 0.0023 (0.0045) 0.0031 (0.0055) 0.0037 (0.0049) DVHHIN94 0.0083 (0.0016) * 0.0061 (0.0012) * 0.0029 (0.0007) * 0.0051 (0.0011) *
NEW.DVBMI94 -0.0001 (0.0007) -0.0005 (0.0008) -0.0006 (0.0005) -0.0007 (0.0007) DVSMKT94 0.0012 (0.0014) 0.0009 (0.0013) 0.0019 (0.0008) * 0.0012 (0.0012)
NEW.DVPP94(=0) 0.0904 (0.0085) * 0.0717 (0.0092) * 0.0531 (0.0089) * 0.0686 (0.0081) *NUMCHRON -0.0174 (0.0022) * -0.0123 (0.0023) * -0.0079 (0.0013) * -0.013 (0.0021) *
VISITS -0.0026 (0.0003) * -0.0023 (0.0003) * -0.0017 (0.0002) * -0.0023 (0.0003) *
Mean R-square 0.33 0.193 0.093 0.183
SSC Conference Hamilton Ontario May 2002 17
DEPARTMENT OF MATHEMATICS AND STATISTICS
Significant variablesSignificant variables
Intercept
0.8000
0.8200
0.8400
0.8600
0.8800
0.9000
0.9200
0.9400
S-plusTRANSCAN
SAS PROC MI Random HotDeck
(Bootstrap)
Completeobservations
-
0.0020
0.0040
0.0060
0.0080
0.0100
0.0120
0.0140
0.0160
DVHHIN94
0.0000
0.0010
0.0020
0.0030
0.0040
0.0050
0.0060
0.0070
0.0080
0.0090
S-plusTRANSCAN
SAS PROC MI Random HotDeck
(Bootstrap)
Completeobservations
-
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
0.0014
0.0016
0.0018 NEW.DVPP94(=0)
0.00000.01000.02000.03000.04000.05000.06000.07000.08000.09000.1000
S-plusTRANSCAN
SAS PROC MIRandom HotDeck
(Bootstrap)
Completeobservations
0.00740.00760.00780.00800.00820.00840.00860.00880.00900.00920.0094
NEW.AGE
-0.0045
-0.0040
-0.0035
-0.0030
-0.0025
-0.0020
-0.0015
-0.0010
-0.0005
0.0000
S-plusTRANSCAN SAS PROC MI
Random HotDeck (Bootstrap)
Completeobservations
-
0.0001
0.0001
0.0002
0.0002
0.0003
0.0003
0.0004
0.0004
0.0005
0.0005
VISITS
-0.0030
-0.0025
-0.0020
-0.0015
-0.0010
-0.0005
0.0000
S-plusTRANSCAN SAS PROC MI
Random HotDeck
(Bootstrap)Complete
observations
-
0.0001
0.0001
0.0002
0.0002
0.0003
0.0003
0.0004
NUMCHRON
-0.0200
-0.0180
-0.0160
-0.0140
-0.0120
-0.0100
-0.0080
-0.0060
-0.0040
-0.0020
0.0000
S-plusTRANSCAN SAS PROC MI
Random HotDeck
(Bootstrap)Complete
observations
-
0.0005
0.0010
0.0015
0.0020
0.0025
SSC Conference Hamilton Ontario May 2002 18
DEPARTMENT OF MATHEMATICS AND STATISTICS
Conclusions about the Conclusions about the missing patternmissing pattern
• The missing values of the response variable DVHST94 are not MCAR. The probability of missing depends primarily on the age and sex covariates, therefore the missing values are MAR.
SSC Conference Hamilton Ontario May 2002 19
DEPARTMENT OF MATHEMATICS AND STATISTICS
Conclusions about Conclusions about multiple imputationmultiple imputation
• Transcan function appeared to perform better than PROC MI for imputing and analyzing this data set given non-normality.
• Random hot deck MI with bootstrap gave significantly biased results. This approach does not take into account the information provided by the covariates therefore is not appropriate for data MAR.
SSC Conference Hamilton Ontario May 2002 20
DEPARTMENT OF MATHEMATICS AND STATISTICS
Conclusions about the Conclusions about the data analysisdata analysis
• The health status of the population tends decreases with age.
• People with higher income tend to have better health than people with less income.
• People with lower health status demand more medical services (visits to a doctor).
• People that are propense to depression have lower health.
• Smoking does not appear to have a decisive influence on the health status.
SSC Conference Hamilton Ontario May 2002 21
DEPARTMENT OF MATHEMATICS AND STATISTICS
Future workFuture work
• GLM could be used to model the categorical response GQ.H1 using a multinomial logistic model to impute the missing categorical responses
• Interactions of the significant variables with the insignificant variables should be explored in order to further assess the concomitant effects (e.g. smoking and depression).
SSC Conference Hamilton Ontario May 2002 22
DEPARTMENT OF MATHEMATICS AND STATISTICS
Acknowledgements:Acknowledgements:
Special thanks to professor Peggy Ng and George Monette for their Special thanks to professor Peggy Ng and George Monette for their support.support.