lecture 5: slr diagnostics (continued) correlation introduction to multiple linear regression

48
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II

Upload: chana

Post on 11-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression. BMTRY 701 Biostatistical Methods II. From last lecture. What were the problems we diagnosed? We shouldn’t just give up! Some possible approaches for improvement - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Lecture 5:SLR Diagnostics (Continued)CorrelationIntroduction to Multiple Linear Regression

BMTRY 701Biostatistical Methods II

Page 2: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

From last lecture

What were the problems we diagnosed?

We shouldn’t just give up!

Some possible approaches for improvement• remove the outliers: does the model change?

• transform LOS: do we better adhere to model assumptions?

Page 3: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Outlier Quandry

To remove or not to remove outliers

Are they real data?

If they are truly reflective of the data, then what does removing them imply?

Use caution!• better to be true to the data• having a perfect model should not be at the expense

of using ‘real’ data!

Page 4: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Removing the outliers: How to?

I am always reluctant.

my approach in this example:• remove each separately• remove both together• compare each model with the model that includes

outliers

How to decide: compare slope estimates.

Page 5: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

SENIC Data

> par(mfrow=c(1,2))> hist(data$LOS)> plot(data$BEDS, data$LOS)

Histogram of data$LOS

data$LOS

Fre

qu

en

cy

6 8 10 14 18

01

02

03

04

05

0

0 200 400 600 8008

10

12

14

16

18

20

data$BEDS

da

ta$

LO

S

Page 6: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

How to fit regression removing outlier(s)?

> keep.remove.both <- ifelse(data$LOS<16,1,0)> keep.remove.20 <- ifelse(data$LOS<19,1,0)> keep.remove.18 <- ifelse(data$LOS<16 | data$BEDS<600,1,0)> > table(keep.remove.both)keep.remove.both 0 1 2 111 > table(keep.remove.20)keep.remove.20 0 1 1 112 > table(keep.remove.18)keep.remove.18 0 1 1 112

Page 7: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Regression Fitting

reg <- lm(LOS ~ BEDS, data=data)

reg.remove.both <- lm(LOS ~ BEDS, data=data[keep.remove.both==1,])

reg.remove.20 <- lm(LOS ~ BEDS, data=data[keep.remove.20==1,])

reg.remove.18 <- lm(LOS ~ BEDS, data=data[keep.remove.18==1,])

Page 8: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

How much do our inferences change?

reg remove

both

remove 20

remove 18

β1 estimate

0.00406 0.00299 0.00393 0.00314

se(β1) 0.00086 0.00070 0.00073 0.00085

% change 0

(ref)

26% 3% 23%

Why is “18” a bigger outlier than “20”?

Page 9: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Leverage and Influence

Leverage is a function of the explanatory variable(s) alone and measures the potential for a data point to affect the model parameter estimates.

Influence is a measure of how much a data point actually does affect the estimated model.

Leverage and influence both may be defined in terms of matrices

More later in MLR (MPV ch. 6)

Page 10: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Graphically

0 200 400 600 800

81

01

21

41

61

82

0

data$BEDS

da

ta$

LO

S

regw/out 18w/out 20w/out both

Page 11: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

R code

par(mfrow=c(1,1))plot(data$BEDS, data$LOS, pch=16)# old plain old regression modelabline(reg, lwd=2)# plot “20” to show which point we are removing, then# add regression linepoints(data$BEDS[keep.remove.20==0], data$LOS[keep.remove.20==0],

col=2, cex=1.5, pch=16)abline(reg.remove.20, col=2, lwd=2)# plot “18” and then add regressionlinepoints(data$BEDS[keep.remove.18==0], data$LOS[keep.remove.18==0],

col=4, cex=1.5, pch=16)abline(reg.remove.18, col=4, lwd=2)# add regression line where we removed both outliersabline(reg.remove.both, col=5, lwd=2)# add a legend to the plotlegend(1,19, c("reg","w/out 18","w/out 20","w/out both"),

lwd=rep(2,4), lty=rep(1,4), col=c(1,2,4,5))

Page 12: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

What to do?

Let’s try something else What was our other problem?

• heteroskedasticity (great word…try that at scrabble)

• non-normality of outliers

Common way to solve: transform the outcome

Page 13: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Determining the Transformation

Box-Cox transformation approach

Finds the “best” power transformation to achieve closest distribution to normality

Can apply to• a variable• to a linear regression model

When applied to a regression model, result tells you what is the ‘best’ power transform of Y to achieve normal residuals

Page 14: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Review of power transformation

Assume we want to transform Y Box-Cox considers Ya for all values of a Solution is the a that provides the “most normal”

looking Ya

Practical powers• a = 1: identity• a = ½ : square-root• a = 0: log• a = -1: 1/Y. usually we also take negative so that

order is maintained (see example) Often not practical interpretation: Y-0.136

Page 15: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Box-Cox for linear regression

library(MASS)

bc <- boxcox(reg)

-2 -1 0 1 2

-35

0-3

40

-33

0-3

20

-31

0

log

-Lik

elih

oo

d

95%

Page 16: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Transform

8 10 12 14 16 18 20

-0.1

4-0

.12

-0.1

0-0

.08

-0.0

6

data$LOS

ty

ty <- -1/data$LOS

plot(data$LOS, ty)

Page 17: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

0 200 400 600 800

-0.1

4-0

.12

-0.1

0-0

.08

-0.0

6

data$BEDS

ty

New regression: transform is -1/LOS

plot(data$BEDS, ty, pch=16)reg.ty <- lm(ty ~ data$BEDS)abline(reg.ty, lwd=2)

Page 18: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

More interpretable?

LOS is often analyzed in the literature Common transform is log

• it is well-known that LOS is skewed in most applications

• most people take the log• people are used to seeing and interpreting it on the

log scale

How good is our model if we just take the log?

Page 19: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Regression with log(LOS)

8 10 12 14 16 18 20

2.0

2.2

2.4

2.6

2.8

3.0

data$LOS

log

y

0 200 400 600 800

2.0

2.2

2.4

2.6

2.8

3.0

data$BEDS

log

y

Page 20: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Let’s compare: residual plots

0 200 400 600 800

-0.0

20

.00

0.0

20

.04

data$BEDS

reg

.ty$

resi

du

als

0 200 400 600 800

-0.2

0.0

0.2

0.4

0.6

data$BEDS

reg

.log

y$re

sid

ua

ls

Page 21: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Let’s compare: distribution of residuals

-0.0

20

.00

0.0

20

.04

Residuals where Y = -1/LOS

-0.2

0.0

0.2

0.4

0.6

Residuals where Y = log(LOS)

Page 22: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Let’s Compare: |Residuals|

0 200 400 600 800

0.0

00

.02

0.0

4

data$BEDS

ab

s(re

g.ty

$re

sid

ua

ls)

0 200 400 600 8000

.00

.20

.40

.6

data$BEDS

ab

s(re

g.lo

gy$

resi

du

als

)

p=0.59 p=0.12

Page 23: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Let’s Compare: QQ-plot

-2 -1 0 1 2

-0.0

20

.00

0.0

20

.04

TY

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

-2 -1 0 1 2-0

.20

.00

.20

.40

.6

LogY

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Page 24: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

R code logy <- log(data$LOS)par(mfrow=c(1,2))plot(data$LOS, logy)plot(data$BEDS, logy, pch=16)reg.logy <- lm(logy ~ data$BEDS)abline(reg.logy, lwd=2)

par(mfrow=c(1,2))plot(data$BEDS, reg.ty$residuals, pch=16)abline(h=0, lwd=2)plot(data$BEDS, reg.logy$residuals, pch=16)abline(h=0, lwd=2)

boxplot(reg.ty$residuals)title("Residuals where Y = -1/LOS")boxplot(reg.logy$residuals)title("Residuals where Y = log(LOS)")

qqnorm(reg.ty$residuals, main="TY")qqline(reg.ty$residuals)qqnorm(reg.logy$residuals, main="LogY")qqline(reg.logy$residuals)

Page 25: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Regression results> summary(reg.ty)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.169e-01 2.522e-03 -46.371 < 2e-16 ***data$BEDS 3.953e-05 7.957e-06 4.968 2.47e-06 ***---

> summary(reg.logy)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 ***data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 ***---

Page 26: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Let’s compare: results ‘untransformed’

0 200 400 600 800

81

01

21

41

61

82

0

data$BEDS

da

ta$

LO

S

0 200 400 600 8007

89

10

11

12

data$BEDS

da

ta$

LO

S

Page 27: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

R code

par(mfrow=c(1,2))plot(data$BEDS, data$LOS, pch=16)abline(reg, lwd=2)lines(sort(data$BEDS), -1/sort(reg.ty$fitted.values),lwd=2, lty=2)lines(sort(data$BEDS), exp(sort(reg.logy$fitted.values)), lwd=2, lty=3)

plot(data$BEDS, data$LOS, pch=16, ylim=c(7,12))abline(reg, lwd=2)lines(sort(data$BEDS), -1/sort(reg.ty$fitted.values),lwd=2, lty=2)lines(sort(data$BEDS), exp(sort(reg.logy$fitted.values)), lwd=2, lty=3)

Page 28: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

So, what to do?

What are the pros and cons of each transform?

Should we transform at all?!

Page 29: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Switching Gears: Correlation

“Pearson” Correlation Measures linear association between two

variables A natural by-product of linear regression Notation: r or ρ (rho)

Page 30: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Correlation versus slope?

Measure different aspects of the association between X and Y

Slope: measures if there is a linear trend Correlation: provides measure of how close the

datapoints fall to the line

Statistical significance is IDENTICAL• p-value for testing that correlation is 0 is the SAME as

the p-value for testing that the slope is 0.

Page 31: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

0 1 2 3 4 5

-50

51

01

5

x

y1

0 1 2 3 4 5-5

05

10

15

x

y2

Example: Same slope, different correlation

0 1 2 3 4 5

-50

51

0

x

y1

0 1 2 3 4 5-5

05

10

x

y2

r = 0.46, b1=2 r = 0.95, b1=2

Page 32: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

0 1 2 3 4 5

-10

01

02

03

0

x

y1

0 1 2 3 4 5

-10

01

02

03

0

x

y2

Example: Same correlation, different slope

r = 0.46, b1=4 r = 0.46, b1=2

Page 33: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Correlation

Scaled version of Covariance between X and Y Recall Covariance:

Estimating the Covariance:

)])([(),( yxxy YXEYXCov

n

iiinxy yyxx

1

1 ))((̂

Page 34: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Correlation

n

i

n

iii

n

iii

xy

yyxx

yyxx

1 1

22

1

)()(

))((̂

Page 35: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Interpretation

Correlation tells how closely two variables “track” one another

Provides information about ability to predict Y from X

Regression output:• look for R2

• for SLR, sqrt(R2) = correlation Can have low correlation yet significant

association With correlation, 95% confidence interval is

helpful

Page 36: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

LOS ~ BEDS

> summary(lm(data$LOS ~ data$BEDS))

Call:lm(formula = data$LOS ~ data$BEDS)

Residuals: Min 1Q Median 3Q Max -2.8291 -1.0028 -0.1302 0.6782 9.6933

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.6253643 0.2720589 31.704 < 2e-16 ***data$BEDS 0.0040566 0.0008584 4.726 6.77e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.752 on 111 degrees of freedomMultiple R-squared: 0.1675, Adjusted R-squared: 0.16 F-statistic: 22.33 on 1 and 111 DF, p-value: 6.765e-06

409.01675.0

Page 37: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

95% Confidence Interval for Correlation

The computation of a confidence interval on the population value of Pearson's correlation (ρ) is complicated by the fact that the sampling distribution of r is not normally distributed. The solution lies with Fisher's z' transformation described in the section on the sampling distribution of Pearson's r. The steps in computing a confidence interval for ρ are:• Convert r to z' • Compute a confidence interval in terms of z' • Convert the confidence interval back to r.

freeware!• http://www.danielsoper.com/statcalc/calc28.aspx• http://glass.ed.asu.edu/stats/analysis/rci.html• http://faculty.vassar.edu/lowry/rho.html

Page 38: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

log(LOS) ~ BEDS

> summary(lm(log(data$LOS) ~ data$BEDS))

Call:lm(formula = log(data$LOS) ~ data$BEDS)

Residuals: Min 1Q Median 3Q Max -0.296328 -0.106103 -0.005296 0.084177 0.702262

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 ***data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1618 on 111 degrees of freedomMultiple R-squared: 0.1805, Adjusted R-squared: 0.1731 F-statistic: 24.44 on 1 and 111 DF, p-value: 2.737e-06

425.01805.0

Page 39: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Multiple Linear Regression

Most regression applications include more than one covariate

Allows us to make inferences about the relationship between two variables (X and Y) adjusting for other variables

Used to account for confounding. Especially important in observational studies

• smoking and lung cancer• we know people who smoke tend to expose themselves to other

risks and harms• if we didn’t adjust, we would overestimate the effect of smoking

on the risk of lung cancer.

Page 40: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Importance of including ‘important’ covariates

If you leave out relevant covariates, your estimate of β1 will be biased

How biased?

Assume: • true model:

• fitted model:

eXXY iii 22110

*1

*1

*0 eXY ii

Page 41: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Fun derivation

n

ii

n

ii

n

iii

n

ii

n

ii

n

ii

n

i

n

ii

n

iii

n

ii

n

iii

n

ii

n

i

n

iiiiii

n

ii

n

i

n

iiii

n

i

n

iiiin

ii

n

iiin

ii

n

ii

n

iii

XX

XXXXXXX

XX

XXXXnXXXXX

XX

XXXXXX

XX

YEXYEX

YXYXEXX

YYXXEXX

XX

YYXXEE

1

211

1212

11112

112

1

211

1

211

1 1212

1111012

112

1

21110

1

211

1 1221101221101

1

211

1 111

1 111

1

211

111

1

211

1

211

111

*1

)(

)(

)(

)()(

)(

)()(

)(

1

))(()(

1

)(

))(()(

Page 42: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Fun derivation

n

ii

n

ii

n

ii

n

ii

n

ii

n

iii

n

ii

n

iii

n

ii

n

iii

n

ii

n

ii

n

iii

n

ii

n

ii

n

ii

n

ii

n

ii

n

iii

n

ii

n

ii

n

ii

n

ii

n

ii

n

iii

n

ii

n

ii

XX

XX

XX

XX

XX

XXXX

XX

XXXX

XX

XXXX

XX

XXXX

XX

XXX

XX

XXXXXXX

XX

XXXXXXXE

1

222

1

211

1221

1

222

1

222

1

211

12211

21

1

211

12211

21

1

211

1212

11

21

1

211

1212

112

1

211

111

1

211

1

211

1212

112

111

1

211

1

211

1212

11112

112

1

211

*1

)(

)(

)(

)(

)(

))((

)(

))((

)(

)()(

)(

)()(

Page 43: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Fun derivation

211221

2

11221

1

222

1

211

1221*1

ˆˆ if

ˆ

ˆ

)(

)(

)(

n

ii

n

ii

XX

XX

E

Page 44: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Implications

The bias is a function of the correlation between the two covariates, X1 and X2

If the correlation is high, the bias will be high If the correlation is low, the bias may be quite

small. If there is no correlation between X1 and X2, then

omitting X2 does not bias inferences

However, it is not a good model for prediction if X2 is related to Y

Page 45: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

Example: LOS ~ BEDS analysis.

0 200 400 600 800

2.0

2.4

2.8

data$BEDS

log(

data

$LO

S)

0 100 200 300 400 500 600

2.0

2.4

2.8

data$NURSE

log(

data

$LO

S)

0 200 400 600 800

020

040

060

0

data$BEDS

data

$NU

RS

E

> cor(cbind(data$BEDS, data$NURSE, data$LOS)) [,1] [,2] [,3][1,] 1.0000000 0.9155042 0.4092652[2,] 0.9155042 1.0000000 0.3403671[3,] 0.4092652 0.3403671 1.0000000

Page 46: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

R code

reg.beds <- lm(log(data$LOS) ~ data$BEDS)reg.nurse <- lm(log(data$LOS) ~ data$NURSE)reg.beds.nurse <- lm(log(data$LOS) ~ data$BEDS + data$NURSE)summary(reg.beds)summary(reg.nurse)summary(reg.beds.nurse)

Page 47: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

SLRs

Call:lm(formula = log(data$LOS) ~ data$BEDS)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 ***data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 ***---

Call:lm(formula = log(data$LOS) ~ data$NURSE)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1682138 0.0250054 86.710 < 2e-16 ***data$NURSE 0.0004728 0.0001127 4.195 5.51e-05 ***---

Page 48: Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression

BEDS + NURSE> summary(reg.beds.nurse)

Call:lm(formula = log(data$LOS) ~ data$BEDS + data$NURSE)

Residuals: Min 1Q Median 3Q Max -0.291537 -0.108447 -0.006711 0.087594 0.696747

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1522361 0.0252758 85.150 <2e-16 ***data$BEDS 0.0004910 0.0001977 2.483 0.0145 * data$NURSE -0.0001497 0.0002738 -0.547 0.5857 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1624 on 110 degrees of freedomMultiple R-squared: 0.1827, Adjusted R-squared: 0.1678 F-statistic: 12.29 on 2 and 110 DF, p-value: 1.519e-05