chapter 13: collinearity

We have the following important topics left that I want to cover

- Assessing collinearity (Sect 13.1) - Model building (Sects 13.1‐2, 22.1) & validation (22.3) - Time series and generalized least squares (Chap 16) - Logistic regression (Sect 14.1)

Chapter 13: Collinearity

13.1 Detecting Collinearity

• We cannot deal with exact linear relationships among the predictors, including the simple case of two predictors being perfectly correlated

1 1 2 2i i i ikc X c X k 0c X c+ + + =

• The design matrix is singular and we cannot compute X 1( )TX X − . • With a high degree of collinearity, least squares regression parameter estimates have large sampling

variances:

1 1 2 2i i i k ikc X c X c X c0+ + + ≈ means a high squared multiple correlation 2jR between jX and the rest of the

, and we know 'X s2

2 2

1ˆ( )1 ( 1)j

j j

VarR n

εσβ = ×− − S

1

Figure 13.1: Note that the square root of the variance inflation factor, 21 (1 )jVIF R= − , which scales the width of a confidence interval, is close to 2,

doubling the the standard error of the estimate, when jR approaches 0.9.

• There is a relevant result in section 9.4.4 on joint confidence regions that we did not cover. If jX and jX are

highly positively correlated, we get a joint confidence region like that shown in Figure 9.1:

2

• In this case the individual confidence intervals for B1 and B2 are large, but we can actually estimate the sum of the regression coefficients reasonably precisely.

• Notes: o The VIF is the most convenient numerical indicator of consequences of collinearity for a particular

coefficient. You can look at all of them, but note that these do not make sense for groups of variables that are necessarily/intentionally correlated as are the dummy variables we have shown for polytomous factors. (They are correlated, but usually not too highly.) Fox has a “Generalized VIF” for collinearity between groups of variables, but we won’t deal with that.

o Computing the VIF: On p. 321 Fox notes that

2 11 (1 ) is the j diagonal element of thj XXVIF R R−= −

3

5

R example: Canadian women’s labor‐force participation time series library(car) data(Bfox) Bfox$time <- 1:30 attach(Bfox) names <- c( "Women's Labor-Force Participation","Total Fertility Rate","Men's Weekly Wage Rate", "Women's Weekly Wage Rate","Per-Capita Consumer Debt","Percent Part-Time Work") windows(height=8, width=8) par(mfrow=c(3,2),mgp=c(2,.75,0),mar=c(4,3,3,1)) for (i in 1:(ncol(Bfox) - 1)){ plot(1945+time, Bfox[,i], xlab="Year", type="l", cex=1.5, main=names[i]) }

Fig 13.4 Time‐series data on Candadian women’s labor‐force participation and other variables.

6

7

> Bfox$time <- as.numeric(row.names(Bfox)) > Bfox$time <- 1:30 > summary(fit1 <- lm( partic ~ tfr + menwage + womwage + debt + parttime + time, data=Bfox )) Call: lm(formula = partic ~ tfr + menwage + womwage + debt + parttime + time, data = Bfox) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.2949089 3.6260388 4.218 0.000327 *** tfr 0.0002618 0.0004093 0.640 0.528741 menwage -0.0415038 0.1477735 -0.281 0.781328 womwage 0.0572354 0.1758473 0.325 0.747757 debt 0.0668872 0.0155761 4.294 0.000270 *** parttime 0.6864741 0.0818507 8.387 1.89e-08 *** time -0.0180641 0.0994435 -0.182 0.857447 --- Residual standard error: 0.5333 on 23 degrees of freedom Multiple R-squared: 0.9936, Adjusted R-squared: 0.992 F-statistic: 597.8 on 6 and 23 DF, p-value: < 2.2e-16 > vif <- function(X) { + # Function to compute variance inflation factor for + # predictors in a design matrix X + RX <- cor(X) + vif <- diag(solve(RX)) + return(vif=vif) + }

8

> # Print sqrt VIF's as in Fox, p. 313 > sqrt(vif(Bfox[,-1])) tfr menwage womwage debt parttime time 3.014261 10.592193 8.344881 9.748452 2.765828 8.839506 > ## Note: these are large VIFs! > # Check one of these using the definition > temp <- lm(tfr ~ menwage+womwage+debt+parttime + time, data=Bfox) > sqrt(1/(1-summary(temp)$r.squared)) [1] 3.014261 > > # Turns out that I didn't need to create this "vif" function as there is > # one in Fox's car libary that takes as input the result of a lm fit. > rm(vif) > > sqrt(vif(fit1)) tfr menwage womwage debt parttime time 3.014261 10.592193 8.344881 9.748452 2.765828 8.839506 > # The collinearity problem: > round(cor(Bfox),2) partic tfr menwage womwage debt parttime time partic 1.00 -0.88 0.96 0.97 0.98 0.95 0.95 tfr -0.88 1.00 -0.79 -0.85 -0.84 -0.89 -0.76 menwage 0.96 -0.79 1.00 0.98 0.99 0.85 0.99 womwage 0.97 -0.85 0.98 1.00 0.99 0.87 0.96 debt 0.98 -0.84 0.99 0.99 1.00 0.89 0.98 parttime 0.95 -0.89 0.85 0.87 0.89 1.00 0.85 time 0.95 -0.76 0.99 0.96 0.98 0.85 1.00 > pairs(Bfox)

10

Strategies to deal with issues of multicollinearity in model building

• 13.1.1 Replace sets of variables with Principal Component

o I will cover the essence of this multivariate computation, and I recommend this section, but I won’t ask you any theoretical exercises on this.

• 13.2.1 Model respecification: Consider the meaning of the variables and determine whether there is a meaningful approach to the scientific task that doesn’t involve all the correlated predictors

• 13.2.2 Variable selection: We’ll jump to section 22.1 to discuss model selection criteria. Model selection, as discussed in Chapter 22, is an important issue even without high degrees of collinearity

• 13.2.3 Biased estimation: Neither Fox nor I like this approach (the primary one being called “Ridge regression”), so we won’t do it. There are, however, newer methods in the statistical literature that do variable selection/biased estimation in a clever way.

• 13.3.4 Prior information: It would be nice to talk about Bayesian methods, but we don’t have time.

Figure 13.6 Principal components: find the linear combination of maximum variance

This presentation works with standardized regressors (explanatory variables) 1 2, , , kz z z…

1 11 1 21 2 1 1k k XA A A= + + + =w z z z Z a

1

21 1 1 1 1

1var( )1 X X XXS

n′ ′′= = =

−ww a Z Z a a R a

1 1 1 1max ( ) subject to 1XX′ ′ =a R a a a using method of

Lagrange multipliers leads to

1 1 1 1( ) 0 and 1XX kλ ′− = =R I a a a

with nontrivial solution only when 1( )XX kλ−R I is singular, or

1 0XX kλ− =R I

which means that 1λ is an eigenvalue of XXR and is the corresponding eigenvector satisfying 1a 1 1 1XX λ=R a a .

And, finally we have that

1

21 1 1 1 1 1XXS λ λ′ ′= = =w a R a a a .

11

Subsequent principal components solve the problem

2 2 2 2 2 1max ( ) subject to 1 and 0XX′ ′ ′= =a R a a a a a

and similarly for . 3, , ka a…

The complete matrix of principal component coefficients is an orthogonal matrix

[ ]1 2, , , k=A a a a…

The matrix of eigenvectors of XXR determines linear combinations we call principal component scores having

variances 2j j XX j jS j j jλ λ′ ′= = =w a R a a a , 1 2 kλ λ λ≥ ≥ ≥ .

Or, in matrix notation, the ) matrix of principal component scores can be written (n k×

X=W Z A (13.3)

with covariance matrix

1 11 1 X X XXn n

′ ′ ′ ′ ′= = = Λ = Λ− −WW AZ Z A AR A AA

where is the diagonal matrix of eigenvalues, which also provides the variances of the principal component scores. Because the matrix of eigenvectors Ahas orthonormal columns, we see that principal components represent a

Λ

rotation of the (standardized) data matrix XZ , simply a reexpression of the original data in terms of a new coordinate system.

12

Note also that

trace( ) trace( )i XXkλΛ = = =∑ R

I.e. the sum of the variances of the principal component scores is the sum of the variances of the original (standardized) variables.

So, we say that the principal components provides an orthogonal decomposition of the total variance in the original dataset and we say, for example, that 1100 / jλ λ× ∑ is the “percent variance explained by the first principal

component” and similarly 1 1

100 J / kj jj j

λ λ= =

×∑ ∑ is the “cumulative percent variance explained by the first J

principal components.”

Note that we can solve (13.3) to express the original variables in terms of the principal component scores 1

X− ′= =Z WA WA

This fact underlies the following Figure 13.5 from Fox

13

Figure 13.5 Vector geometry of principal components

For the case of the explanatory variables of the women’s labor force data set it produces Figure 13.7. First we’ll look at the actual computation of the principal components in R.

14

Figure 13.7 Orthogonal projections of six explanatory variables onto subspace spanned by the first two principal components.

R‐code for this computation and interpretation of this plot.

15

Summary (so far):

Principal components can be interpreted in a number of ways:

• They are derived algebraically to find a sequence of linear combinations of maximal variance‐‐‐to “explain” the variance in the original data with as few composite variables as possible

• The principal components represent a rotation of the original coordinate axes (defined by the original variables) to a new set so that the first few variables are sufficient to approximate the original k‐dimensional dataset. They provide a new “orthogonal basis” (the principal component scores are uncorrelated) for the original variables.

• Geometrically the principal components identify the “principal axes” defining a set of orthogonal directions of greatest/least variance in the original k‐dimensions.

• We computed these using standardized variables because the original variables were in different units and linear combinations of variables in different units are generally uninterpretable. However, if a set of variables are all defined in the same units, it usually makes sense to compute principal components without standardizing, so we compute eigenvectors of a covariance matrix rather than a correlation matrix.

• Because the last principal components represent linear combinations of smallest variance, they may define near collinearity. In fact, it is the relative magnitudes of the eigenvalues (variances) that matter and

1 / kK λ λ= is called the “condition number”, the most common indicator of global instability in the least

squares regression coefficients, meaning small change in the data can result in large changes in the regression coefficients. Large “condition indices” 1 / jK λ λ= , say values greater than 10, may define near

collinearities. For the labor‐force data: > round(sqrt(L[1]/L),2)

[1] 1.00 3.95 7.10 16.83 23.70 33.68

16

17

Example of a possible application of principal components to a current consulting problem.

A consulting client in Speech and Hearing sciences is carrying out a study on the effects of three different types of hearing aids (which amplify, but may also distort the original sound signal) on speech recognition performance in normal and hearing‐impaired subjects. The first question or research aim is: #1: Are there predictive factors that explain variance in speech recognition performance for listeners with hearing loss using linear and non-linear hearing aids? For this question, I have the following predictors 1) amount of hearing loss 2) spectral resolution ( 5 different measures) 3) cognitive capacity ( 3 different measures) 4) Audibility ( but this may be a co-variate that I do not know how to handle it...it relates to performance but varies tremendously across subjects.

There are, in fact, 6 “spectral resolution” measures on each subject, two “Equivalent Rectangular Bandwidth” (ERB) measures, one at 500Hz and the other at 2000Hz, and four “Broad Bandwidth” (BB) spectral modulation transfer function scores. Here is the correlation and scatterplot matrix for these data, including also a hearing loss score (PTA) and one the speech recognition outcome score for the “linear” hearing aid. > round(cor(hearing.df[,c(4,25,26,15:18,9)],use="pair"),2) PTA lERB_Hz_500 lERB_Hz_2000 BB_0.25_SMT BB_0.5_SMT BB_1.0_SMT BB_2.0_SMT Score_lnr PTA 1.00 0.19 0.36 -0.39 -0.01 0.21 0.40 -0.38 lERB_Hz_500 0.19 1.00 0.42 0.29 0.66 0.48 0.56 -0.41 lERB_Hz_2000 0.36 0.42 1.00 -0.02 0.21 0.31 0.39 -0.63 BB_0.25_SMT -0.39 0.29 -0.02 1.00 0.78 0.65 0.32 -0.43 BB_0.5_SMT -0.01 0.66 0.21 0.78 1.00 0.87 0.71 -0.50 BB_1.0_SMT 0.21 0.48 0.31 0.65 0.87 1.00 0.81 -0.59 BB_2.0_SMT 0.40 0.56 0.39 0.32 0.71 0.81 1.00 -0.48 Score_lnr -0.38 -0.41 -0.63 -0.43 -0.50 -0.59 -0.48 1.00

19

> # Remove missing data > anymiss <- apply( is.na(hearing.df[,c(25,26,15:18)]),1,any ) > # car package “princomp” > pca.spectral <- princomp( hearing.df[!anymiss,c(25,26,15:18)], cor=T) > # or regular R “prcomp” > pca.spectral <- prcomp( hearing.df[!anymiss,c(25,26,15:18)], scale=T) > options(digits=5) > summary(pca.spectral) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 Standard deviation 1.9 1.12 0.763 0.665 0.276 0.24 Proportion of Variance 0.6 0.21 0.097 0.074 0.013 0.01 Cumulative Proportion 0.6 0.81 0.904 0.977 0.990 1.00 > round(pca.spectral$rotation,2) PC1 PC2 PC3 PC4 PC5 PC6 lERB_Hz_500 0.37 0.36 0.57 -0.57 0.18 -0.23 lERB_Hz_2000 0.20 0.73 0.08 0.65 -0.02 0.04 BB_0.25_SMT 0.37 -0.51 0.41 0.43 -0.24 -0.45 BB_0.5_SMT 0.50 -0.18 0.14 -0.04 -0.20 0.81 BB_1.0_SMT 0.48 -0.16 -0.38 0.11 0.76 -0.08 BB_2.0_SMT 0.45 0.16 -0.58 -0.25 -0.54 -0.29 >

20

13.2.1 Model Respecification

Collinearity is a data problem, not (necessarily) a deficiency of the model. One approach to the problem is to respecify the model ‐‐‐if in fact someone gave you a “specified model” to start with.

Sometimes a group of collinear variables may be combined in some manner to compute a “composite measure” of an “underlying construct.” In this case high correlation is good as it indicates that the individual variables are “reliable” as indicators of that construct. Principal components may be used to define such a composite, or to suggest one.

Alternatively, one may select one of a group of correlated variables as the single measure to represent the others.

When focusing on the relationship between, say, Y and X1, consider whether we need to control for X2, correlated with X1. (Think back to the path diagrams for multiple regression.)

Fox: “Generally, though, respecification of this variety is possible only where the original model was poorly thought out or where the researcher is willing to abandon some of the goals of the research.”

21

13.2.2 Variable Selection

Fox: “A common, but usually misguided, approach to collinearity is variable selection, where some procedure is used to reduce the regressors in the model to a less highly correlated set.”

- Forward selection: add variables to a model sequentially so as to increase R2. - Backward elimination: start with “full” or “complete” model and delete variables one at a time - Forward/backward or stepwise: add variable that yields best improvement, then check to see if any should

be removed.

Issues: - interpreting order of entry of variables as reflecting importance - stepwise methods can fail to find the optimal subset of variables of a given size; i.e. the subset of variables

that optimizes R2. - When variables occur in sets, as in dummy variables, these should generally be kept together, and when there

are hierarchical relationships as in interaction terms, these relations should be respected - Because variable selection optimizes the fit of the model, to the sample data, coefficient standard errors

calculated following variable selection‐‐‐and hence confidence intervals and tests of hypotheses‐‐‐almost surely overstate the precision of results don’t take the resulting model too seriously!

22.1.1 Model selection criteria

- Adjusted or corrected R2 - Mallows’ Cp - Cross‐validation and Generalized Cross‐Validation (GCV) - Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

{ }1 2, , , mM M M= …M is a set of models to be compared. These are generally defined by different subsets of

possible explanatory variables.

1. Adjusted R2: Ordinary R2 is always greater with more predictors.

a. 2 1 jj

RSSR

TSS= −

b. Instead use ( )2

22

11 1j

jEj

Y j

RSSS nRS n s TSS

which penalizes for using more parameters. Intuitive but

ad hoc.

−= − = = ×

−

2. Mallows’ Cp: Want an estimate of the total MSE (= variance plus squared bias) for the given model:

{ }2( ) ( ) ( )2 2

1 1

1 1ˆ ˆ ˆ( ) ( ) ( ) ( )n n

j j jj i i i i

i i

MSE Y V Y E Y E Yε ε

γσ σ= =

⎡ ⎤= = + −⎣ ⎦∑ ∑

The approach is to estimate 2εσ and )( )ˆ( j

iMSE Y from the data.

( )

2 2 ( 1 )( 1)j

j

p j j j jE

RSSC s n k s F sS

= + − = + − − +

Where js is the number of parameters in model j and Fj is the F‐statistic for testing the significance of model

Mj. For a “good model” the F‐statistic should be near 1 and 1( )jpE C ≈ . For the full model, 1pC k= + .

22

3. Cross‐Validation: Choose the model Mj that has the smallest leave‐one‐out (cross‐validation) prediction error. Let ( )ˆ j

iY− denote the predicted value of the ith observation computed from model Mj fitted on the dataset excluding the ith observation.

( ) 21

ˆ( )n ji ii

j

Y YCV

n−=

−= ∑

Which is the average value of the prediction error sum of squares, usually referred to as PRESS. We would prefer the model with the smallest value.

For the type of linear least squares models we are currently working with, there are efficient computational approaches to compute CVj without running n separate regressions. In some cases with large datasets one may choose to partition the data into some number of (random) subsets and leave out a subset of p cases, leading to p‐fold cross validation.

Where computation is an issue, we can also compute an estimate of the cross‐validation criterion from the single fitted model (as Mallows’ Cp was an estimate of the total mean squared error from the single fitted model). This is called the generalized cross‐validation criterion

2 , where j

j

jres j

res

n RSSGCV df n s

df×

= = − .

Finally, with a large dataset one may simply set aside a single validation or test set, fitting the models on the remaining training set.

23

24

b. BIC is an approximation to twice the log of the Bayes factor comparing a particular model to the full (“saturated”) model where the Bayes factor is the ratio of the marginal probability of the data under the two models (with a certain choice for the prior for model parameters)

More specifically , using

2) 2 2 1 1 2 1

1

( | ) 1ˆ ˆlog log ( | ) log ( | ) ( ) log( | ) 2e e e e

p M p p s s np M

≈ − − −y y θ y θy

BIC is defined as

ˆ2log ( | ) logj e j j j eBIC p y s nθ= − +

Note that for the BIC, in comparison with AIC, the penalty for the number of parameters increases with sample size with BIC having a more severe penalty for 8n ≥ .

a. AIC is based on the Kullback‐Leibler information measure comparing the true distribution of the data, ( )p y , to the distribution of the data ( )

4. Most popular these days are the AIC and BIC criteria:

j jp y θ under a particular model Mj. We want it to be small. It is

defined as

( )2ˆ ˆ2log ( ) 2 log 2jj j j e jAIC L s n sεσ= − + = +θ y

Example: R‐code for variable selection using “regsubsets”

chapter 13: collinearity

Documents