detecting and reducing multicollinearity. detecting multicollinearity

48
Detecting and reducing multicollinearity

Upload: homer-horn

Post on 03-Jan-2016

281 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Detecting and reducing multicollinearity. Detecting multicollinearity

Detecting and reducing multicollinearity

Page 2: Detecting and reducing multicollinearity. Detecting multicollinearity

Detecting multicollinearity

Page 3: Detecting and reducing multicollinearity. Detecting multicollinearity

Common methods of detection

• Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity.

• Non-significant t-tests for all of the slopes but a significant overall F-test.

• Significant correlations among pairs of predictor variables (correlations, matrix scatter plots).

• Variance inflation factors (VIF).

Page 4: Detecting and reducing multicollinearity. Detecting multicollinearity

The first variance at issueFor the model:

ipipiii xxxy 1,122110

the variance of the estimated coefficient bk is:

2

1

2

2

1

1

kn

ikik

k Rxx

bVar

2kRwhere is the R2 value obtained by regressing

the kth predictor on the remaining predictors.

Page 5: Detecting and reducing multicollinearity. Detecting multicollinearity

The second variance at issueFor the model:

iikki xy 0

the variance of the estimated coefficient bk is:

n

ikik

k

xxbVar

1

2

2

min

Page 6: Detecting and reducing multicollinearity. Detecting multicollinearity

The ratio of the two variances

2

2

2

22

2

min 1

111

k

kik

kkik

k

k

R

xx

Rxx

bVar

bVar

Page 7: Detecting and reducing multicollinearity. Detecting multicollinearity

Variance inflation factors

The variance inflation factor for the kth predictor is:

21

1

kk R

VIF

2kRwhere is the R2 value obtained by regressing

the kth predictor on the remaining predictors.

Page 8: Detecting and reducing multicollinearity. Detecting multicollinearity

Variance inflation factors (VIFk)

• A measure of how much the variance of the estimated regression coefficient bk is “inflated” by the existence of correlation among the predictor variables in the model.

• VIFs exceeding 4 warrant investigation.

• VIFs exceeding 10 are signs of serious multicollinearity.

Page 9: Detecting and reducing multicollinearity. Detecting multicollinearity

Blood pressure example

120

110

53.25

47.75

97.325

89.375

2.125

1.875

8.275

4.425

72.5

65.5

120110

76.25

30.75

53.2547.75

97.32589.375

2.1251.875

8.2754.425

72.565.576.25

30.75

BP

Age

Weight

BSA

Duration

Pulse

Stress

n = 20 hypertensive individuals

p-1 = 6 predictor variables

Page 10: Detecting and reducing multicollinearity. Detecting multicollinearity

Blood pressure example

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Blood pressure (BP) is the response.

Page 11: Detecting and reducing multicollinearity. Detecting multicollinearity

Regress y = BP on all 6 predictors Predictor Coef SE Coef T P VIFConstant -12.870 2.557 -5.03 0.000Age 0.70326 0.04961 14.18 0.000 1.8Weight 0.96992 0.06311 15.37 0.000 8.4BSA 3.776 1.580 2.39 0.033 5.3Dur 0.06838 0.04844 1.41 0.182 1.2Pulse -0.08448 0.05161 -1.64 0.126 4.4Stress 0.005572 0.003412 1.63 0.126 1.8

S = 0.4072 R-Sq = 99.6% R-Sq(adj) = 99.4%

Analysis of VarianceSource DF SS MS F PRegression 6 557.844 92.974 560.64 0.000Residual Error 13 2.156 0.166Total 19 560.000

Page 12: Detecting and reducing multicollinearity. Detecting multicollinearity

Regress x2 = weight on 5 predictorsPredictor Coef SE Coef T P VIFConstant 19.674 9.465 2.08 0.057Age -0.1446 0.2065 -0.70 0.495 1.7BSA 21.422 3.465 6.18 0.000 1.4Dur 0.0087 0.2051 0.04 0.967 1.2Pulse 0.5577 0.1599 3.49 0.004 2.4Stress -0.02300 0.01308 -1.76 0.101 1.5

S = 1.725 R-Sq = 88.1% R-Sq(adj) = 83.9%

Analysis of VarianceSource DF SS MS F PRegression 5 308.839 61.768 20.77 0.000Residual Error 14 41.639 2.974Total 19 350.478

Page 13: Detecting and reducing multicollinearity. Detecting multicollinearity

The variance inflation factor calculated by its definition

40.8

881.01

1

1

12

min

kk

k

RbVar

bVar

The variance of the weight coefficient is inflated by a factor of 8.40 due to the existence of correlation among the predictor variables in the model.

Page 14: Detecting and reducing multicollinearity. Detecting multicollinearity

The pairwise correlations

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Blood pressure (BP) is the response.

Page 15: Detecting and reducing multicollinearity. Detecting multicollinearity

Regress y = BP on age, weight, duration and stress

Predictor Coef SE Coef T P VIFConstant -15.870 3.195 -4.97 0.000Age 0.68374 0.06120 11.17 0.000 1.5Weight 1.03413 0.03267 31.65 0.000 1.2Dur 0.03989 0.06449 0.62 0.545 1.2Stress 0.002184 0.003794 0.58 0.573 1.2

S = 0.5505 R-Sq = 99.2% R-Sq(adj) = 99.0%

Analysis of VarianceSource DF SS MS F PRegression 4 555.45 138.86 458.28 0.000Residual Error 15 4.55 0.30Total 19 560.00

Page 16: Detecting and reducing multicollinearity. Detecting multicollinearity

Reducing data-based multicollinearity

Page 17: Detecting and reducing multicollinearity. Detecting multicollinearity

Data-based multicollinearity

• Multicollinearity that results from a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which you collect the data.

Page 18: Detecting and reducing multicollinearity. Detecting multicollinearity

Some methods

• Modify the regression model by eliminating one or more predictor variables.

• Collect additional data under different experimental or observational conditions.

Page 19: Detecting and reducing multicollinearity. Detecting multicollinearity

(Modified!) Allen Cognitive Level (ACL) Study

• Relationship of ACL test to level of pathology in a set of 23 patients in a hospital psychiatry unit:– Response y = ACL score

– x1 = vocabulary (Vocab) score on Shipley Institute of Living Scale

– x2 = abstraction (Abstract) score on Shipley Institute of Living Scale

– x3 = score on Symbol-Digit Modalities Test (SDMT)

Page 20: Detecting and reducing multicollinearity. Detecting multicollinearity

Allen Cognitive Level (ACL) Study on 23 patients

47.7517.25

27.7517.25

28.513.5

5.8

4.2

47.75

17.25

27.75

17.25

ACL

SDMT

Vocab

Abstract

Page 21: Detecting and reducing multicollinearity. Detecting multicollinearity

Strong correlation between Vocab and Abstract

3525155

30

20

10

Abstract

Vo

cab

Pearson correlation of Vocab and Abstract = 0.990

Page 22: Detecting and reducing multicollinearity. Detecting multicollinearity

Regress y = ACL on SDMT, Vocab, and Abstract

Predictor Coef SE Coef T P VIFConstant 3.747 1.342 2.79 0.012SDMT 0.02326 0.01273 1.83 0.083 1.7Vocab 0.0283 0.1524 0.19 0.855 49.3Abstract -0.0138 0.1006 -0.14 0.892 50.6

S = 0.7344 R-Sq = 26.5% R-Sq(adj) = 14.8%

Analysis of Variance

Source DF SS MS F PRegression 3 3.6854 1.2285 2.28 0.112Residual Error 19 10.2476 0.5393Total 22 13.9330

Page 23: Detecting and reducing multicollinearity. Detecting multicollinearity

Allen Cognitive Level (ACL) Study on 69 patients

57.520.5 32.517.5 30.511.5

5.8

4.2

57.5

20.5

32.5

17.5

ACL

SDMT

Vocab

Abstract

Page 24: Detecting and reducing multicollinearity. Detecting multicollinearity

Plot after having collected more data

403020100

40

30

20

10

Abstract

Vo

cab

Pearson correlation of Vocab and Abstract = 0.698

Page 25: Detecting and reducing multicollinearity. Detecting multicollinearity

Regress y = ACL on SDMT, Vocab, and Abstract

Predictor Coef SE Coef T P VIFConstant 3.9463 0.3381 11.67 0.000SDMT 0.027404 0.007168 3.82 0.000 1.6Vocab -0.01740 0.01808 -0.96 0.339 2.1Abstract 0.01218 0.01159 1.05 0.297 2.2

S = 0.6878 R-Sq = 28.6% R-Sq(adj) = 25.3%

Analysis of Variance

Source DF SS MS F PRegression 3 12.3009 4.1003 8.67 0.000Residual Error 65 30.7487 0.4731Total 68 43.0496

Page 26: Detecting and reducing multicollinearity. Detecting multicollinearity

Reducing structural multicollinearity

In context of polynomial regression models

Page 27: Detecting and reducing multicollinearity. Detecting multicollinearity

Structural multicollinearity

• Multicollinearity that is a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor x2 from the predictor x.

Page 28: Detecting and reducing multicollinearity. Detecting multicollinearity

Example

• (General research question) What is impact of exercise on human immune system?

• (Specific research question) How is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x)?

Page 29: Detecting and reducing multicollinearity. Detecting multicollinearity

30 40 50 60 70

1000

1500

2000

Maximal oxygen uptake (ml/kg)

Imm

uno

glo

bin

(mg)

Scatter plot

Page 30: Detecting and reducing multicollinearity. Detecting multicollinearity

A quadratic polynomial regression function

iiii xxy 21110

where:

• yi = amount of immunoglobin in blood (mg)

• xi = maximal oxygen uptake (ml/kg)

• typical assumptions about error terms (“INE”)

Page 31: Detecting and reducing multicollinearity. Detecting multicollinearity

Estimated quadratic function

30 40 50 60 70

1000

1500

2000

oxygen

igg

igg = -1464.40 + 88.3071 oxygen - 0.536247 oxygen**2

S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %

Regression Plot

Page 32: Detecting and reducing multicollinearity. Detecting multicollinearity

Interpretation of the regression coefficients

• If 0 is a possible x value, then b0 is the predicted response. Otherwise, interpretation of b0 is meaningless.

• b1 is the slope of the tangent line at x = 0.

• b2 indicates the up/down direction of curve

– b2 < 0 means curve is concave down

– b2 > 0 means curve is concave up

Page 33: Detecting and reducing multicollinearity. Detecting multicollinearity

The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq

Predictor Coef SE Coef T P VIFConstant -1464.4 411.4 -3.56 0.001oxygen 88.31 16.47 5.36 0.000 99.9oxygensq -0.5362 0.1582 -3.39 0.002 99.9

S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%

Analysis of VarianceSource DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029

Regress y = iggon oxygen and oxygen2

Page 34: Detecting and reducing multicollinearity. Detecting multicollinearity

Structural multicollinearity

7060504030

5000

4000

3000

2000

1000

oxygen

oxy

ge

nsq

Pearson correlation of oxygen and oxygensq = 0.995

Page 35: Detecting and reducing multicollinearity. Detecting multicollinearity

“Center” the predictors

637.50OxygenOxCent

2637.50 OxygenOxCentSq

Mean of oxygen = 50.637

oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064

Page 36: Detecting and reducing multicollinearity. Detecting multicollinearity

Wow! It really works!

20100-10-20

400

300

200

100

0

oxcent

oxc

ent

sq

Pearson correlation of oxcent and oxcentsq = 0.219

Page 37: Detecting and reducing multicollinearity. Detecting multicollinearity

A better quadratic polynomial regression function

iiii xxxxy 2*11

*1

*0

xxx ii *where denotes the centered predictor

and:

• yi = amount of immunoglobin in blood (mg)

• typical assumptions about error terms (“INE”)

iiii xxy 2**11

**1

*0

Page 38: Detecting and reducing multicollinearity. Detecting multicollinearity

The regression equation isigg = 1632 + 34.0 oxcent - 0.536 oxcentsq

Predictor Coef SE Coef T P VIFConstant 1632.20 29.35 55.61 0.000oxcent 34.000 1.689 20.13 0.000 1.1oxcentsq -0.5362 0.1582 -3.39 0.002 1.1

S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%

Analysis of VarianceSource DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029

Regress y = iggon oxcent and oxcent2

Page 39: Detecting and reducing multicollinearity. Detecting multicollinearity

Interpretation of the regression coefficients

• b0 is predicted response at the predictor mean.

• b1 is the estimated slope of the tangent line at the predictor mean; and, often, similar to the estimated slope in the simple model.

• b2 indicates the up/down direction of curve

– b2 < 0 means curve is concave down

– b2 > 0 means curve is concave up

Page 40: Detecting and reducing multicollinearity. Detecting multicollinearity

-20 -10 0 10 20

1000

1500

2000

oxcent

igg

igg = 1632.20 + 33.9995 oxcent - 0.536247 oxcent**2

S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %

Regression Plot

Estimated regression function

Page 41: Detecting and reducing multicollinearity. Detecting multicollinearity

Similar estimates of coefficients from first-order linear model

-20 -10 0 10 20

1000

1500

2000

oxcent

igg

igg = 1557.63 + 32.7427 oxcent

S = 124.783 R-Sq = 91.1 % R-Sq(adj) = 90.8 %

Regression Plot

Page 42: Detecting and reducing multicollinearity. Detecting multicollinearity

The relationship between the two forms of the model

2**11

**1

*0ˆ iii xbxbby Centered model:

21110ˆ iii xbxbby Original model:

*1111

*11

*11

2*11

*1

*00

2

bb

xbbb

xbxbbb

where:

Page 43: Detecting and reducing multicollinearity. Detecting multicollinearity

2** 5362.00.342.1632ˆ iii xxy

5362.0

3.88)637.50)(5362.(234

4.1464)637.50(5362.0)637.50(342.1632

11

1

20

b

b

b

2536.03.884.1464ˆ iii xxy

Mean of oxygen = 50.637

Page 44: Detecting and reducing multicollinearity. Detecting multicollinearity

1000 1500 2000

-200

-100

0

100

200

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is igg)

Model evaluation

Page 45: Detecting and reducing multicollinearity. Detecting multicollinearity

-200 -100 0 100 200

-2

-1

0

1

2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is igg)

Model evaluation

Page 46: Detecting and reducing multicollinearity. Detecting multicollinearity

Model use: What is predicted IgG if maximal oxygen uptake is 90?

There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction.

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XXX denotes a row with X values away from the centerXX denotes a row with very extreme X values

Values of Predictors for New Observations

New Obs oxcent oxcentsq1 39.4 1549

Page 47: Detecting and reducing multicollinearity. Detecting multicollinearity

The hierarchical approach to model fitting

Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate.

iiiii xxxY 3111

21110

Is a first-order linear model (“line”) adequate?

0: 111110 H

Page 48: Detecting and reducing multicollinearity. Detecting multicollinearity

The hierarchical approach to model fitting

But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained.

That is, if a quadratic term was significant, you would use this regression function:

21110 iii xxYE

2110 ii xYE

and not this one: