multicollinearity: an introductory example a high-tech business wants to measure the effect of...

Multicollinearity: an introductory example

A high-tech business wants to measure the effect of advertising on sales and likes to distinguish between traditional advertising (TV and newspapers) and advertising on internet.

– Y : sales in $m

– X1: advertising in $m

– X2: internet in $m

Data: Sales3.sav

A matrix scatter plot of the data

Cor(y, x1)

0.983

Cor(y, x2)

0.986

Cor(x1,x2)

0.990x1 and x2 are

strongly correlated, i.e. they have a

substantial amount of common information

x1 = α0 + α1x2 + ε

Regression output

R R2 Adj R2 DS1 ,983a ,965 ,962 ,9764

Anovab

SS df MS F Sig.1 Regressione 265,466 1 265,466 278,438 ,000a

Residuo 9,534 10 ,953Totale 275,000 11

Coefficients

Modelt Sig.B DS

1(Costante) ,885 ,696 1,272 ,232

X1 = traditional advertising in $m

2,254 ,135 16,686 ,000

Using x1 only

With equivalent results when using x2 only

Regression output Using x1 and x2

R R2 Adj R2 DS1 ,987a ,974 ,968 ,8916

Anovab

SS df MS F Sig.1 Regressione 267,846 2 133,923 168,483 ,000a

Residuo 7,154 9 ,795Totale 275,000 11

Coefficients

t Sig.B DS1

(Costante) 1,992 ,902 2,210 ,054

X1 = traditional advertising in $m

,767 ,868 ,884 ,400

X2 = internet advertising in $m

1,275 ,737 1,730 ,118

x1 and x2 are not significant anymore

Multicollinearity

Multicollinearity exists when two or more of the independent variables are moderately or highly correlated with each other.

In practice if independent variables are (highly) correlated they contribute too much redundant information which prevents isolating the effect of single independent variables on y. Confusion is often the result.

High levels of multicollinearity: a) inflate the variance of the β estimates b) regression results maybe misleading and confusing.

In the extreme case, if there exists perfect correlation among some of the independent variables, OLS estimates cannot be computed.

xi = α0 + α1xj +… + αpxj+p+ ε, j+p<k, i≠j,j+1,…, j+p

Detecting Multicollinearity

The following are indicators of multicollinearity:

1. Significant correlations between pairs of independent variables in the model (sufficient but not necessary).

2. Nonsignificant t-tests for all (or nearly all) the individual β parameters when the F test for model adequacy H0: β1= β2 = … = βk = 0 is significant.

3. Opposite signs (from what expected) in the estimated parameters

4. A variance inflation factor (VIF) for a β parameter greater that 10.

The VIFs can be calculated in SPSS by selecting “Collinearity diagnostics” in the “Statistics” options in the “Regression” dialog box.

A typical situation

Multicollinearity can arise when transforming variables, e.g. using x1 and x1

2 in the regression equations if the range of values of x1 is limited.X X square

1,0 1,001,2 1,441,4 1,961,6 2,561,8 3,242,0 4,002,2 4,842,4 5,762,6 6,762,8 7,843,0 9,003,2 10,243,4 11,563,6 12,963,8 14,444,0 16,00

0,00

2,00

4,00

6,00

8,00

10,00

12,00

14,00

16,00

18,00

1,0 1,5 2,0 2,5 3,0 3,5 4,0 4,5

X

X s

qu

are

Cor(x,x2)=0.987

Remember, if the multicollinearity is present but not excessive (no high correlations, no VIFs above 10), you can ignore it. Each variable provides enough independent information and one can assess its value.

If your main goal is explaining relationships, then the multicollinearity maybe a problem because measured effects can be misleading.

If your main goal is prediction (using the available explanatory variables to predict the response), then you can safely ignore the multicollinearity.

Some solutions to Multicollinearity

Get more data if you can.

Drop one or more of the correlated independent variables from the final model. A screening procedure like Stepwise regression may be helpful in determining which variable to drop.

If you keep all independent variables be cautios in interpreting parameter values and keep prediction within the range of your data.

Use Ridge regression (we do not touch this subject in the course).

Some solutions to MC

If the multicollinearity is introduced by the use of higher order models (e.g. use x and x2 or x1, x2 and x1x2) use IV as deviations from their mean.

Example: suppose multicollinearity is present in E(Y) = β0 + β1x + β2x2

1) Compute: x* = x – Mean(X)2) Run the regression E(Y) = β0 + β1x* + β2(x*)2

In most cases multicollinearity is greatly reduced. Clearly the parameters β of the new regression will have different values and meaning.

Example: Shipping costs - continues

– Y : cost of shipment in dollars

– X1: package weight in pounds

– X2: distance shipped in miles

Model 1: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x12

+ β5x22

Data: Express.sav

A company conducted a study to investigate the relationship between the cost of shipment and the

variables that control the shipping charge: weight and distance.

It is suspected that non linear effect may be present, let us analyze the model

Matrix scatter-plot

A matrix scatter-plot shows at once the bivariate scatter plots for the selected variables. Use it as preliminary screening.

In SPSS choose the “Matrix” option from “Scatter/Dot” Graph and input the variables of interest

Note the obvious quadratic relation for some of the variables, very close to linearity

Symmetric matrix, just look

at the lower triangle

Correlation matrix

Correlazioni

Weight of parcel

Distance shipped

Cost of shipm.

Weight squared

Dist. squared

Weight*Dist.

Weight of parcel in lbs.

Correlazione 1 ,182 ,774** ,967** ,151 ,820**

Sig. (2-code) ,444 ,000 ,000 ,524 ,000N 20 20 20 20 20 20

Distance shipped

Correlazione ,182 1 ,695** ,202 ,980** ,633**

Sig. (2-code) ,444 ,001 ,393 ,000 ,003N 20 20 20 20 20 20

Cost of shipment

Correlazione ,774** ,695** 1 ,799** ,652** ,989**

Sig. (2-code) ,000 ,001 ,000 ,002 ,000N 20 20 20 20 20 20

Weight squared

Correlazione ,967** ,202 ,799** 1 ,160 ,821**

Sig. (2-code) ,000 ,393 ,000 ,500 ,000N 20 20 20 20 20 20

Distance squared

Correlazione ,151 ,980** ,652** ,160 1 ,590**

Sig. (2-code) ,524 ,000 ,002 ,500 ,006N 20 20 20 20 20 20

Weight*Distance

Correlazione ,820** ,633** ,989** ,821** ,590** 1Sig. (2-code) ,000 ,003 ,000 ,000 ,006N 20 20 20 20 20 20

**. La correlazione è significativa al livello 0,01 (2-code).

Individually strongly related

to Y

Model 1:VIF statistics

A VIF statistics larger than 10 is usualy considered an indicator of substantial collinearity

The VIFs can be calculated in SPSS by selecting “Collinearity diagnostics” in the “Statistics” options in the “Regression” dialog box.

Coefficientia

Model

t Sig.B DS VIF

1 (Costante) ,827 ,702 1,178 ,259

Weight of parcel in lbs.

-,609 ,180 -3,386 ,004 20,031

Distance shipped ,004 ,008 ,503 ,623 35,526

Weight squared ,090 ,020 4,442 ,001 17,027

Distance squared 1,507E-5 ,000 ,672 ,513 28,921

Weight*Distance ,007 ,001 11,495 ,000 12,618

Model 2: Using IV as deviations from their mean

Note: problems of multicollinearity have disappeared

Coefficientsa

Model

t Sig.B DS VIF1

(Costante) 5,467 ,216 25,252 ,000X1star 1,263 ,042 30,128 ,000 1,087X2star ,038 ,001 27,563 ,000 1,081X1x2star ,007 ,001 11,495 ,000 1,095X1star2 ,090 ,020 4,442 ,001 1,113x2star2 1,507E-5 ,000 ,672 ,513 1,120

Note: R-square (adjusted), ANOVA table and prediction are the same for the two models (check).

Seems actually irrelevant, drop it

multicollinearity: an introductory example a high-tech business wants to measure the effect of...

Documents