bmtry 701 biostatistical methods ii
DESCRIPTION
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor. BMTRY 701 Biostatistical Methods II. Recall the added variable plots. These can help check for adequacy of model Is there curvature between Y and X after adjusting for the other X’s? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/1.jpg)
Lecture 13
Diagnostics in MLRAdded variable plotsIdentifying outliersVariance Inflation Factor
BMTRY 701Biostatistical Methods II
![Page 2: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/2.jpg)
Recall the added variable plots
These can help check for adequacy of model Is there curvature between Y and X after
adjusting for the other X’s? “Refined” residual plots They show the marginal importance of an
individual predictor Help figure out a good form for the predictor
![Page 3: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/3.jpg)
Example: SENIC
Recall the difficulty determining the form for INFIRSK in our regression model.
Last time, we settled on including one term, INFRISK^2
But, we could do an adjusted variable plot approach.
How? We want to know, adjusting for all else in the
model, what is the right form for INFRISK?
![Page 4: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/4.jpg)
R code
av1 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION) )av2 <- lm(INFRISK ~ AGE + XRAY + CENSUS + factor(REGION) )resy <- av1$residualsresx <- av2$residuals
plot(resx, resy, pch=16)
abline(lm(resy~resx), lwd=2)
![Page 5: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/5.jpg)
Added Variable Plot
-2 -1 0 1 2 3
-0.2
0.0
0.2
0.4
resx
resy
![Page 6: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/6.jpg)
What does that show?
The relationship between logLOS and INFRISK if you added INFRISK to the regression
But, is that what we want to see? How about looking at residuals versus INFRISK
(before including INFRISK in the model)?
![Page 7: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/7.jpg)
R codemlr8 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION))smoother <- lowess(INFRISK, mlr8$residuals)plot(INFRISK, mlr8$residuals)lines(smoother)
2 3 4 5 6 7 8
-0.2
0.0
0.2
0.4
INFRISK
mlr
8$
resi
du
als
![Page 8: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/8.jpg)
R code
> infrisk.star <- ifelse(INFRISK>4,INFRISK-4,0)> mlr9 <- lm(logLOS ~ INFRISK + infrisk.star + AGE + XRAY + > CENSUS + factor(REGION))> summary(mlr9)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.798e+00 1.667e-01 10.790 < 2e-16 ***INFRISK 1.836e-03 1.984e-02 0.093 0.926478 infrisk.star 6.795e-02 2.810e-02 2.418 0.017360 * AGE 5.554e-03 2.535e-03 2.191 0.030708 * XRAY 1.361e-03 6.562e-04 2.073 0.040604 * CENSUS 3.718e-04 7.913e-05 4.698 8.07e-06 ***factor(REGION)2 -7.182e-02 3.051e-02 -2.354 0.020452 * factor(REGION)3 -1.030e-01 3.036e-02 -3.391 0.000984 ***factor(REGION)4 -2.068e-01 3.784e-02 -5.465 3.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1137 on 104 degrees of freedomMultiple R-Squared: 0.6209, Adjusted R-squared: 0.5917 F-statistic: 21.29 on 8 and 104 DF, p-value: < 2.2e-16
![Page 9: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/9.jpg)
Residual Plots
2 3 4 5 6 7 8
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
INFRISK
mlr
9$
resi
du
als
2 3 4 5 6 7 8
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
INFRISK
mlr
7$
resi
du
als
SPLINE FOR INFRISK INFRISK2
![Page 10: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/10.jpg)
Which is better?
Cannot compare via ANOVA because they are not nested!
But, we can compare statistics qualitatively R-squared:
• MLR7: 0.60• MLR9: 0.62
Partial R-squared:• MLR7: 0.17• MLR9: 0.19
![Page 11: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/11.jpg)
Identifying Outliers
Harder to do in the MLR setting than in the SLR setting.
Recall two concepts that make outliers important: • Leverage is a function of the explanatory variable(s)
alone and measures the potential for a data point to affect the model parameter estimates.
• Influence is a measure of how much a data point actually does affect the estimated model.
Leverage and influence both may be defined in terms of matrices
![Page 12: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/12.jpg)
“Hat” matrix
We must do some matrix stuff to understand this Section 6.2 is MLR in matrix terms Notation for a MLR with p predictors and data on
n patients. The data:
nY
Y
Y
Y2
1
~
npn
p
p
XX
XX
XX
X
1
221
111
1
1
1
~
![Page 13: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/13.jpg)
More notation:
THE MODEL:
What are the dimensions of each?
Matrix Format for the MLR model
ne
e
e
e2
1
p
1
0
eXY
![Page 14: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/14.jpg)
“Transpose” and “Inverse”
X-transpose: X’ or XT
X-inverse: X-1
Hat matrix = H
Why is H important? It transforms Y’s to Yhat’s:
')'( 1 XXXXH
HYY ˆ
![Page 15: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/15.jpg)
Estimating, based on fitted model
)()(2 HIMSEes
Variance-Covariance Matrix of residuals:
)1()(2iii hMSEes
Variance of ith residual:
MSEhhses ijijij )0()( 22
Covariance of ith and jth residual:
![Page 16: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/16.jpg)
Other uses of H
YHIe )(
I = identity matrix
)()( 22 HIe
Variance-Covariance Matrix of residuals:
)1()( 22iii he
Variance of ith residual:
222 )0()( ijijij hhe
Covariance of ith and jth residual:
![Page 17: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/17.jpg)
Property of hij’s
n
i
n
jijij hh
1 1
1
This means that each row of H sums to 1And, that each column of H sums to 1
![Page 18: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/18.jpg)
Other use of H
Identifies points of leverage
0 5 10 15
-10
01
02
03
04
0
x
y
1 2
4
3
![Page 19: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/19.jpg)
Using the Hat Matrix to identify outliers
Look at hii to see if a datapoint is an outlier Large values of hii imply small values of var(ei) As hii gets close to 1, var(ei) approaches 0. Note that
As hii approaches 1, yhat approaches y This gives hii the name “leverage” HIGH HAT VALUE IMPLIES POTENTIAL FOR
OUTLIER!
ji
jijiii
n
jjiji yhyhyhy
1
ˆ
![Page 20: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/20.jpg)
R code
hat <- hatvalues(reg)plot(1:102, hat)highhat <- ifelse(hat>0.10,1,0)plot(x,y)points(x[highhat==1], y[highhat==1],
col=2, pch=16, cex=1.5)
![Page 21: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/21.jpg)
Hat values versus index
0 20 40 60 80 100
0.0
20
.06
0.1
00
.14
1:102
ha
t
![Page 22: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/22.jpg)
Identifying points with high hii
0 5 10 15
-10
01
02
03
04
0
x
y
![Page 23: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/23.jpg)
Does a high hat mean it has a large residual?
No. hii measures leverage, not influence Recall what hii is made of
• it depends ONLY on the X’s• it does not depend on the actual Y value
Look back at the plot: which of these is probably most “influential”
Standard cutoffs for “large” hii: • 2p/n• 0.5 very high, 0.2-0.5 high
![Page 24: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/24.jpg)
Let’s look at our MLR9
Any outliers?
0 20 40 60 80 100
0.0
50
.10
0.1
50
.20
1:length(hat9)
ha
t9
![Page 25: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/25.jpg)
Using the hat matrix in MLR
Studentized residuals Acknowledge:
• each residual has a different variance• magnitude of residual should be made relative to its
variance (or sd)
Studentized residuals recognize differences in sampling errors
![Page 26: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/26.jpg)
Defining Studentized Residuals
From slide 15,
We then define
Comparing ei and ri• ei have different variance due to sampling variations• ri have constant variance
)1()(2iii hMSEes
)1()(ii
i
i
ii
hMSE
e
es
er
![Page 27: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/27.jpg)
Deleted Residuals
Influence is more intuitively quantified by how things change when an observation is in versus out of the estimation process
Would be more useful to have residuals in the situation when the observation is removed.
Example: • if a Yi is far out then it may be very influential in the
regression and the residual will be small• but, if that case is removed before estimating and
then the residual is calculated based on the fit, the residual would be large
![Page 28: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/28.jpg)
Deleted Residuals, di
Process:• delete ith case• fit regression with all other cases• obtain estimate of E(Yi) based on its X’s and fitted
model
)(
)(
ˆ
estimation from removed with valuefittedˆ
iiii
iii
YYd
YY
![Page 29: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/29.jpg)
Deleted Residuals, di
Nice result: you don’t actually have to refit without the ith case!
where ei is the ‘plain’ residual from the ith case and hii is the hat value. Both are from the regression INCLUDING the case
For small hii: ei and di will be similar For large hii: ei and di will be different
ii
ii h
ed
1
![Page 30: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/30.jpg)
Studentized Deleted Residuals
Recall the need to standardize, based on the knowledge of the variance
The difference between ti and ri?
)1(
)(
)( iii
i
i
ii
hMSE
e
ds
dt
![Page 31: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/31.jpg)
Another nice result
You can calculate MSE(i) without refitting the model
2/1
2)1(
1
iiiii ehSSE
pnet
![Page 32: BMTRY 701 Biostatistical Methods II](https://reader036.vdocuments.mx/reader036/viewer/2022081513/56815994550346895dc6db86/html5/thumbnails/32.jpg)
Testing for outliers
outlier = Y observations whose studentized deleted residuals are large (in absolute value)
ti ~ t with n-p-1 degrees of freedom
Two examples: • simulated data• mlr9