1 chapter 5 multicollinearity, dummy and interaction variables

63
1 Chapter 5 Multicollinearity, Dummy and Interaction variables

Upload: garey-harmon

Post on 03-Jan-2016

240 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

1

Chapter 5

Multicollinearity, Dummy and Interaction variables

Page 2: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

2

We noted from the previous chapter that there are several dangers of adding new variables indiscriminately to the model. First, although unadjusted R2 goes up, we lose degrees of freedom because we have to estimate additional coefficients. The smaller the degrees of freedom, the less precise the parameter estimates. There is another serious consequence of adding too many variables to a model. If a model has several variables, it is likely that some of the variables will be strongly correlated. This property, known as multicollinearity, can drastically alter the results from one model to another, making them much harder to interpret.

Page 3: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

3

Example 1

Let

Housingt = number of housing starts (in thousands) in the U.S.

Popt = U.S. population in millions

GDPt = U.S. Gross Domestic Product in billions of dollars

Interatet = new home mortgage rate

t = 1963 to 1985

Page 4: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

4

SAS codes:

data housing;infile ‘d:\teaching\MS3215\housing.txt’;input year housing pop gdp unemp intrate ;proc reg data= housing ;model housing= pop intrate;run; proc reg data= housing ;model housing= gdp intrate;run; proc reg data= housing ;model housing= pop gdp intrate;run;

Page 5: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

5

The REG Procedure Model: MODEL1 Dependent Variable: housing

Number of Observations Read 23 Number of Observations Used 23

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 1125359 562679 7.50 0.0037 Error 20 1500642 75032 Corrected Total 22 2626001

Root MSE 273.91987 R-Square 0.4285 Dependent Mean 1601.07826 Adj R-Sq 0.3714 Coeff Var 17.10846

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -3813.21672 1588.88417 -2.40 0.0263 pop 1 33.82138 9.37464 3.61 0.0018 intrate 1 -198.41880 51.29444 -3.87 0.0010

Page 6: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

6

The REG Procedure Model: MODEL1 Dependent Variable: housing

Number of Observations Read 23 Number of Observations Used 23

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 2 1134747 567374 7.61 0.0035 Error 20 1491254 74563 Corrected Total 22 2626001

Root MSE 273.06168 R-Square 0.4321 Dependent Mean 1601.07826 Adj R-Sq 0.3753 Coeff Var 17.05486

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t|

Intercept 1 687.92418 382.69637 1.80 0.0874 gdp 1 0.90543 0.24899 3.64 0.0016 intrate 1 -169.67320 43.83996 -3.87 0.0010

Page 7: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

7

The REG Procedure Model: MODEL1 Dependent Variable: housing

Number of Observations Read 23 Number of Observations Used 23

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 3 1147699 382566 4.92 0.0108 Error 19 1478302 77805 Corrected Total 22 2626001

Root MSE 278.93613 R-Square 0.4371 Dependent Mean 1601.07826 Adj R-Sq 0.3482 Coeff Var 17.42177

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -1317.45317 4930.68042 -0.27 0.7922 pop 1 4.91398 36.55401 0.41 0.6878 gdp 1 0.52186 0.97391 0.54 0.5983 intrate 1 -184.77902 58.10610 -3.18 0.0049

Page 8: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

8

Note that in the last model, the t statistics for Pop and GDP are insignificant, but they are both significant when entered separately in the first and second models. This is because the three variables Pop, GDP and Intrate are highly correlated. It can be shown that

Cor(GDP, Pop) = 0.99

Cor(GDP, Intrate ) = 0.88

Cor(Pop, Intrate ) = 0.91

Page 9: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

9

Example 2

Let expensesi be the cumulative expenditure on the maintenance for a given automobile, milesi be the cumulative mileage in thousand of miles and weeki be its age in weeks since the original purchase. i= 1,…,57.

Page 10: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

10

SAS codes:

data automobile;infile ‘d:\teaching\MS3215\automobile.txt’;input weeks miles expenses;proc reg data= automobile;model expenses = weeks;run; proc reg data= automobile;model expenses = miles;run; proc reg data= automobile;model expenses = weeks miles;run;

Page 11: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

11

The REG Procedure Model: MODEL1 Dependent Variable: expenses

Number of Observations Read 57 Number of Observations Used 57

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 1 66744854 66744854 491.16 <.0001 Error 55 7474117 135893 Corrected Total 56 74218972

Root MSE 368.63674 R-Square 0.8993 Dependent Mean 1426.57895 Adj R-Sq 0.8975 Coeff Var 25.84061

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -626.35977 104.71371 -5.98 <.0001 weeks 1 7.34942 0.33162 22.16 <.0001

Page 12: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

12

The REG Procedure Model: MODEL1 Dependent Variable: expenses

Number of Observations Read 57 Number of Observations Used 57

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 1 63715228 63715228 333.63 <.0001 Error 55 10503743 190977 Corrected Total 56 74218972

Root MSE 437.00933 R-Square 0.8585 Dependent Mean 1426.57895 Adj R-Sq 0.8559 Coeff Var 30.63338

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -796.19928 134.75770 -5.91 <.0001 miles 1 53.45246 2.92642 18.27 <.0001

Page 13: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

13

The REG Procedure Model: MODEL1 Dependent Variable: expenses

Number of Observations Read 57 Number of Observations Used 57

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 2 70329066 35164533 488.16 <.0001 Error 54 3889906 72035 Corrected Total 56 74218972

Root MSE 268.39391 R-Square 0.9476 Dependent Mean 1426.57895 Adj R-Sq 0.9456 Coeff Var 18.81381

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 7.20143 117.81217 0.06 0.9515 weeks 1 27.58405 2.87875 9.58 <.0001 miles 1 -151.15752 21.42918 -7.05 <.0001

Page 14: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

14

A car that is driven more should have a greater maintenance expense. Similarly, the older the car the greater the cost of maintaining it. So we would expect both slope coefficients to be positive. It is interesting to note that even though the coefficient for miles is positive in the second model, it is negative in the third model. Thus, there is a reversal in sign. The magnitude of the coefficient for weeks also changes substantially. The t statistics for miles and weeks are also much lower in the third model even though both variables are still significant.

The problem is again the high correlation between weeks and miles.

Page 15: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

15

Consider the model and let and be the least squares estimates of and respectively. It can be shown that

where r12 = Cor(X1, X2)

iiii XXy 22110 10ˆ,ˆ 2̂

10 , 2

2

122

11

2

11

ˆvarrxxi

2

122

22

2

21

ˆvarrxxi

212

1 22 22

12 1 1 2 21 1

ˆ ˆcov ,

1n n

i ii i

r

r x x x x

Page 16: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

16

The effect of increasing r12 on

Value of r12

0.00

0.5 1.33 x A

0.7 1.96 x A

0.8 2.78 x A

0.9 5.26 x A

0.95 10.26 x A

0.97 16.92 x A

0.99 50.25 x A

0.995 100 x A

0.999 500 x A

2ˆvar

A

xxn

ii

1

22

2

2ˆvar

The sign reversal and decrease in t value are caused by the inflated variance of the estimators.

Page 17: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

17

Consequences of Multicollinearity Wider confidence intervals.

Insignificant t values.

High R2 and consequently F can convincingly reject , but few significant t values.

Sensitivity of least squares estimates and their standard errors to small changes in model.

0...: 210 pH

Page 18: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

18

Exact multicollinearity exists if two or more independent variables have a perfect linear relationship between them. In this case there is no unique solution to the normal equations derived from least squares. When this happens, one or more variables should be dropped from the model.

Multicollinearity is very much a norm in regression analysis involving non-experimental (or uncontrolled) data. It can never be eliminated . The question is not about the existence or non-existence of multicollinearity, but how serious the problem is.

Page 19: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

19

Identifying Multicollinearity High R2 (and significant F value) but low values for t

statistics.

High correlation coefficients between the explanatory variables. But the converse need not be true. In otherwords, multicollinearity may still be a problem even though the correlation between two variables does not appear to be high. This is because three of more variables may be strongly correlated, yet pairwise correlation are not high.

Regression coefficient estimates and standard errors sensitive to small changes in specification.

Page 20: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

20

Variance Inflation Factor (VIF):

Let x1, x2,…, xp be the p explanatory variables in a regression. Perform the regression of xk on the remaining p-1 explanatory variables and call the coefficient of determination from the regression . The VIF for variable xk is

VIF is a measure of the strength of the relationship between each explanatory variable and all other explanatory variables in the model.

2KR

2

1

1KK

VIFR

Page 21: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

21

Relationship between and VIFK2KR

VIFK

0 1

0.9 10

0.99 100

2KR

Page 22: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

22

How large the VIFK have to be to suggest a serious problem with multicollinearity?

a) An individual VIFK larger than 10 indicates that multicollinearity may be seriously influencing the least squares estimates of the regression coefficients.

b) If the average of the VIFK, , is larger than 5, then serious problems may exist. indicates how many times larger the errors for the regression is due to multicollinearity than it would be if the variables were uncorrelated.

1

/p

KK

VIF VIF p

FIV

Page 23: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

23

SAS codes:

data housing;

infile ‘d:\teaching\MS3215\housing.txt’;

input year housing pop gdp unemp intrate ;

proc reg data= housing ;

model housing= pop gdp intrate/vif;

run;

Page 24: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

24

The REG Procedure Model: MODEL1 Dependent Variable: housing

Number of Observations Read 23 Number of Observations Used 23

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 3 1147699 382566 4.92 0.0108 Error 19 1478302 77805 Corrected Total 22 2626001

Root MSE 278.93613 R-Square 0.4371 Dependent Mean 1601.07826 Adj R-Sq 0.3482 Coeff Var 17.42177

Parameter Estimates

Parameter Standard Variance Variable DF Estimate Error t Value Pr > |t| Inflation Intercept 1 -1317.45317 4930.68042 -0.27 0.7922 0 pop 1 14.91398 36.55401 0.41 0.6878 87.97808 gdp 1 0.52186 0.97391 0.54 0.5983 64.66953 intrate 1 -184.77902 58.10610 -3.18 0.0049 7.42535

Page 25: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

25

Solutions to Multicollinearity

1) Benign Neglect

If an analyst is less interested in interpreting individual coefficients but more interested in forecasting then multicollinearity may not a serious concern. Even with high correlations among independent variables, if the regression coefficients are significant and have meaningful signs and magnitudes, one need not be too concerned with multicollinearity.

Page 26: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

26

2) Eliminating Variables

Remove the variable with strong correlation with the rest would generally improve the significance of other variables. There is a danger, however, in removing too many variables from the model because that would lead to bias in the estimates.

3) Re-specifying the model

For example, in the housing regression, we can include the variables as per capita rather than include population as an explanatory variable, leading to

IntratePop

GDP210

Pop

Housing

Page 27: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

27

SAS codes:

data housing;

infile ‘d:\teaching\MS3215\housing.txt’;

input year housing pop gdp unemp intrate ;

phousing= housing/pop;

pgdp= gdp/pop;

proc reg data= housing ;

model phousing = pgdp intrate/vif;

run;

Page 28: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

28

The REG Procedure Model: MODEL1 Dependent Variable: phousing

Number of Observations Read 23 Number of Observations Used 23

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 2 26.33472 13.16736 7.66 0.0034 Error 20 34.38472 1.71924 Corrected Total 22 60.71944

Root MSE 1.31120 R-Square 0.4337 Dependent Mean 7.50743 Adj R-Sq 0.3771 Coeff Var 17.46531

Parameter Estimates

Parameter Standard Variance Variable DF Estimate Error t Value Pr > |t| Inflation

Intercept 1 2.07920 3.34724 0.62 0.5415 0 pgdp 1 0.93567 0.36701 2.55 0.0191 3.45825 intrate 1 -0.69832 0.18640 -3.75 0.0013 3.45825

Page 29: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

29

4) Increasing the sample size

This solution is often recommended on the ground that such an increase improves the precision of an estimator and hence reduce the adverse effects of multicollinearity. But sometimes additional sample information may not available.

5) Other estimation techniques (beyond the scope of this course)

Ridge regression Principal component analysis

Page 30: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

30

Dummy variables

In regression analysis, qualitative or categorical variables are often useful. Qualitative variables such as sex, martial status or political affiliation can be represented by dummy variables, usually coded as 0 and 1. The two values signify that the observation belongs to one of two possible categories.

Page 31: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

31

Example

The Salary Survey data set was developed from a salary survey of computer professionals in a large corporation. The objective of the survey was to identify and quantify those variables that determine salary differentials. In addition, the data could be used to determine if the corporation’s salary administration guidelines were being followed. The data appear in the file salary.txt. The response variable is salary (S) and the explanatory variables are: (1) experience (X), measured in years; (2) education (E), coded as 1 for completion of a high school (H.S.) diploma, 2 for completion of a bachelor degree (B.S.), and 3 for the completion of an advanced degree; (3) management (M), which is coded as 1 for a person with management responsibility and 0 otherwise. We shall try to measure the effects of these three variables on salary using regression analysis.

Page 32: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

32

So, the regression model is

where

This leads to two possible regressions:i) For managerial positions:

ii) For non-managerial positions:

therefore represents the average salary difference between employees with and without managerial responsibilities.

0 1 2 3i i i i iS X E M

iM 1 if employee i takes on management responsibility0 otherwise

iiii EXS 2130

iiii EXS 210

3

Page 33: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

33

SAS codes:

data salary;

infile ‘d:\teaching\MS3215\salary.txt’;

input s x e m;

proc reg data= salary;

model s= x e m;

run;

Page 34: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

34

The REG Procedure Model: MODEL1 Dependent Variable: s

Number of Observations Read 46 Number of Observations Used 46

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 3 928714168 309571389 179.63 <.0001 Error 42 72383410 1723415 Corrected Total 45 1001097577

Root MSE 1312.78883 R-Square 0.9277 Dependent Mean 17270 Adj R-Sq 0.9225 Coeff Var 7.60147

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 6963.47772 665.69473 10.46 <.0001 x 1 570.08738 38.55905 14.78 <.0001 e 1 1578.75032 262.32162 6.02 <.0001 m 1 6688.12994 398.27563 16.79 <.0001

Page 35: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

35

So the estimated regressions arei) For managerial positions:

ii) For non-managerial positions:

Note that all variables are significant and all estimated coefficients have positive signs, indicating that, other things being equal,

iii EXS 76.157809.57013.668848.6963ˆ

ii EX 76.157809.57061.13651

iii EXS 76.157809.57048.6963ˆ

a. Each additional year of work experience is worth a salary increment of $570.

b. An improvement of qualification from high school to a bachelor’s degree or from bachelor’s degree to advanced degree is worth $1579.

c. On average, employees with managerial responsibility receive $6688 more than employees without managerial responsibility.

Page 36: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

36

Note that only one dummy variable is needed to represent M which contains two categories. Suppose we define a new variable which is a compliment to M, that is,

Note that whenever Mi= 1, = 0. If is used in conjunction with Mi then we have

but note that the “implicit” explanatory variable (call it Ii) attached to the intercept term is represented by a vector of 1. Hence,

PERFECT MULTICOLLINEARITY

'iM 1 if employee i does not takes on management responsibility

0 otherwise

'iM

iiiiii MMEXS '43210

'iii MMI

'iM

Page 37: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

37

The method of least squares fails as a result and there is no unique solution to the normal equations.

This problem is known as “dummy variable trap”. In general, for a qualitative variable containing J categories, only J-1 dummy variables are required.

Question: What if Mi is replaced by , will there be any difference in the result?

Answer: Note that Mi and contain essentially the same information. The results will be exactly the same.

'iM

'iM

Page 38: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

38

SAS codes:

data salary;

infile ‘d:\teaching\MS3215\salary.txt’;

input s x e m;

mp= 0;

If m eq 0 then mp= 1;

proc reg data= salary;

model s= x e mp;

run;

Page 39: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

39

The REG Procedure Model: MODEL1 Dependent Variable: s

Number of Observations Read 46 Number of Observations Used 46

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 3 928714168 309571389 179.63 <.0001 Error 42 72383410 1723415 Corrected Total 45 1001097577

Root MSE 1312.78883 R-Square 0.9277 Dependent Mean 17270 Adj R-Sq 0.9225 Coeff Var 7.60147

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 13652 734.39164 18.59 <.0001 x 1 570.08738 38.55905 14.78 <.0001 e 1 1578.75032 262.32162 6.02 <.0001 mp 1 -6688.12994 398.27563 -16.79 <.0001

Page 40: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

40

So the regressions are

i) For managerial positions:

ii) For non-managerial positions:

The results, except for minor differences due to roundings, are essentially the same as those when M, instead of M’, is used as an explanatory variable.

iii EXS 76.157809.57013.668813652ˆ

ii EX 76.157809.57087.6963

iii EXS 76.157809.57013652ˆ

Page 41: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

41

So far, education has been treated in a linear fashion. This may be too restrictive. Instead, we shall view education as a categorical variable and define two dummy variables to represent three categories,

So, when

Ei= 1, Bi= Ai= 0

Ei= 2, Bi= 1, Ai= 0

Ei= 3, Bi= 0, Ai= 1So the model to be estimated is

iB 1 if employee i completes a bachelor degree as his/her highest level of education attainment

0 otherwise

iA1 if employee i completes an advanced degree as his/her highest level

of education attainment0 otherwise

iiiiii MABXS 43210

Page 42: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

42

SAS codes:

data salary;

infile ‘d:\teaching\MS3215\salary.txt’;

input s x e m;

a= 0;

b= 0;

If e eq 2 then b= 1;

If e eq 3 then a= 1;

proc reg data= salary;

model s= x b a m;

run;

Page 43: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

43

The REG Procedure Model: MODEL1 Dependent Variable: s

Number of Observations Read 46 Number of Observations Used 46

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 4 957816858 239454214 226.84 <.0001 Error 41 43280719 1055627 Corrected Total 45 1001097577

Root MSE 1027.43725 R-Square 0.9568 Dependent Mean 17270 Adj R-Sq 0.9525 Coeff Var 5.94919

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 8035.59763 386.68926 20.78 <.0001 x 1 546.18402 30.51919 17.90 <.0001 b 1 3144.03521 361.96827 8.69 <.0001 a 1 2996.21026 411.75271 7.28 <.0001 m 1 6883.53101 313.91898 21.93 <.0001

Page 44: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

44

The interpretation of the coefficients of X and M are the same as before. The estimated coefficient of Bi (3144.04) measures the differential salary between bachelor degree holders relative to high school leavers. Similarly, the estimated coefficient of Ai (2996.21) measures the differential salary between advanced degree holders relative to high school leavers. The difference (3144.04-2996.21) measures the salary differential between bachelor degree and advanced degree holders. Interestingly, the results suggest that a bachelor degree is worth more than an advanced degree! (but is the difference significant?)

Page 45: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

45

Interaction Variables

The previous models all suggest that the effects of education and management status on salary determination are additive. For example, the effect of a management position is measured by independently of the level of educational attainment. The possible non-additive effects may be evaluated by constructing additional variables designed to capture interaction effects. Interaction variables are products of existing variables, for example, and are interaction variables capturing interaction effects between educational levels and managerial responsibility.

The expanded model is

MB

4

MA

iiiiiiiiii MAMBMABXS 6543210

Page 46: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

46

SAS codes:

data salary;infile ‘d:\teaching\MS3215\salary.txt’;input s x e m;a= 0;b= 0;If e eq 2 then b= 1;If e eq 3 then a= 1;bm= b*m;am= a*m;proc reg data= salary;model s= x b a m bm am;run;

Page 47: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

47

The REG Procedure Model: MODEL1 Dependent Variable: s

Number of Observations Read 46 Number of Observations Used 46

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 6 999919409 166653235 5516.60 <.0001 Erro r 39 1178168 30209 Corrected Total 45 1001097577

Root MSE 173.80861 R-Square 0.9988 Dependent Mean 17270 Adj R-Sq 0.9986 Coeff Var 1.00641

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 9472.68545 80.34365 117.90 <.0001 x 1 496.98701 5.56642 89.28 <.0001 b 1 1381.67063 77.31882 17.87 <.0001 a 1 1730.74832 105.33389 16.43 <.0001 m 1 3981.37690 101.17472 39.35 <.0001 bm 1 4902.52307 131.35893 37.32 <.0001 am 1 3066.03512 149.33044 20.53 <.0001

Page 48: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

48

Interpretation of regression results:

There are 6 regression models altogether

i) high school leavers in non-managerial positions

ii) high school leaves in managerial positions

iii) bachelor degree holders in non-managerial positions

iv) bachelor degree holders in managerial positions

iii XS 10

iiiiiii MBMBXS 54210

iiii MXS 410

iiii BXS 210

Page 49: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

49

i) advanced degree holders in non-managerial positions

i) advanced degree holders in managerial positions

For example, to ascertain the marginal change in salary due to the acquisition of an advanced degree,

Thus, the marginal change is $1730.75 for non-managerial employees and $4796.79 for managerial employees.

iiiiii MAAXS 6310

iiii AXS 310

ii

i MA

S63

)(04.306675.1730 estimateM i

Page 50: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

50

Similarly, to investigate the impact if a change from non-managerial to managerial position,

Thus, the marginal change is $3981.38 for high school leavers, $8883.9 for bachelor degree holders and $7047.42 for advanced degree holders.

4 5 6i

i ii

SB A

M

)(04.306652.490238.3981 estimateAB ii

Page 51: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

51

Comparing different groups of regression models by dummy variables

Sometimes a collection of data may consist of two or more distinct subsets, each of which may require a separate regression. Serious bias may be incurred if a combined regression relationship is used to represent the pooled data set.

Page 52: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

52

Example

A job performance test was given to a group of 20 trainees on a special employment program at the end of the job training period. All these 20 trainees were eventually employed by the company and given a performance evaluation score after 6 months. The data are given in the file employment.txt.

Let Y represent job performance score of employee and X be the score on the pre-employment test. We are concerned with equal employment opportunity. We want to compare

Page 53: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

53

Model 1 (pooled):

Model 2A (Minority):

Model 2B (White):

In model 1, race distinction is ignored, the data are pooled and there is a single regression line. In models 2A and 2B, there are two separate regression relationships for the two subgroups, each with a distinct set of regression coefficients.

,10 ijijij xy ni

j

,...,2,1

2,1

1111011 iii xy

2212022 iii xy

Page 54: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

54

Page 55: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

55

So, formally, we want to test

vs.

H1: at least one of the equalities in H0 is false

The test may be performed using dummy and interaction variables.

Define

and formulate the following model which we call model 3: ,3210 ijijijijijij xZZxy

ijZ1 if j = 1 (minority)0 if j = 2 (white)

ni

j

,...,2,1

2,1

020112110 ,: H

Page 56: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

56

Note that Model 3 is equivalent to Models 2A and 2B. When j=1, =1, and Model 3 becomes

which is Model 2A,

and when j=2, =0, and Model 3 reduces to

which is Model 2B.

So a comparison between Model 1 and Models 2A and 2B is equivalent to a comparison between Model 1 and Model 3.

11321101 iiii xxy ,113120 iix

1iZ

,22102 iii xy 2iZ

Page 57: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

57

Now, Model 1

may be obtained by setting in Model 3

Thus, the hypothesis of interest becomes

H1: at least one of is non-zero

;10 ijijij xy ni

j

,...,2,1

2,1

0: 320 H

32 and

032

,3210 ijijijijijij xzZxy ni

j

,...,2,1

2,1

Page 58: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

58

The test may be carried out using a partial-F test defined as

where k is the number of restrictions under H0, SSER is the SSE corresponding to the restricted model. SSEF is the SSE corresponding to the full model and n-p-1 is the degrees of freedom in the full model.

)1/(

/

pnSSE

kSSESSEF

F

FR

Page 59: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

59

SAS codes:

data employ;

infile ‘d:\teaching\MS3215\employ.txt’;

input x race y;

z= 0;

if race eq 1 then z= 1;

zx= z*x;

proc reg data= employ;

model y= x z zx;

test: test z= 0, zx= 0;

run;

proc reg data= employ;

model y= x;

run;

Page 60: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

60

Regression results of Model 3 (full model)

The REG Procedure Model: MODEL1 Dependent Variable: y

Number of Observations Read 20 Number of Observations Used 20

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 3 62.63578 20.87859 10.55 0.0005 Error 16 31.65547 1.97847 Corrected Total 19 94.29125

Root MSE 1.40658 R-Square 0.6643 Dependent Mean 4.50850 Adj R-Sq 0.6013 Coeff Var 31.19840

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 2.01028 1.05011 1.91 0.0736 x 1 1.31340 0.67037 1.96 0.0677 z 1 -1.91317 1.54032 -1.24 0.2321 zx 1 1.99755 0.95444 2.09 0.0527

Page 61: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

61

The REG Procedure Model: MODEL1

Test test Results for Dependent Variable y

Mean Source DF Square F Value Pr > F

Numerator 2 6.95641 3.52 0.0542 Denominator 16 1.97847

Page 62: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

62

Regression results of Model 1 (restricted model)

The REG Procedure Model: MODEL1 Dependent Variable: y

Number of Observations Read 20 Number of Observations Used 20

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 1 48.72296 48.72296 19.25 0.0004 Error 18 45.56830 2.53157 Corrected Total 19 94.29125

Root MSE 1.59109 R-Square 0.5167 Dependent Mean 4.50850 Adj R-Sq 0.4899 Coeff Var 35.29093

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.03497 0.86803 1.19 0.2486 x 1 2.36053 0.53807 4.39 0.0004

Page 63: 1 Chapter 5 Multicollinearity, Dummy and Interaction variables

63

Here, k=2, n=20, p=3 and n-p-1=16

Hence,

Now let

Hence we reject H0 and conclude that the relationship is different for the two groups. Specifically, for minorities, we have

and for white,

45.57 31.66 / 23.52

31.66 /16F

10.0(0.10,2,16) 2.67 (3.52) 0.0542F and p

11 31.3091.0ˆ ii xy

22 31.101.2ˆ ii xy