lecture 9 today: ch. 3: multiple regression analysis example with two independent variables...

Lecture 9

Today:Ch. 3: Multiple Regression Analysis• Example with two independent variables• Frisch-Waugh-Lovell theorem

© Christopher Dougherty 1999–2006

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

EARNINGS

EXP

S

b1

We’ll look at the geometrical interpretation of a multiple regression model with two explanatory variables.

Specifically, we will look at an earnings function model where hourly earnings, EARNINGS, depend on years of schooling (highest grade completed), S, and years of work experience, EXP.

The model has three dimensions, one each for EARNINGS, S, and EXP. The starting point for investigating the determination of EARNINGS is the intercept, b1.

Literally the intercept gives EARNINGS for those respondents who have no schooling and no work experience. However, there were no respondents with less than 6 years of schooling. Hence a literal interpretation of b1 would be unwise.

EARNINGS = b1 + b2S + b3EXP + u

b1 + b2S


EARNINGS

EXP

The next term on the right side of the equation gives the effect of variations in S. A one year increase in S causes EARNINGS to increase by b2 dollars, holding EXP constant.

S

b1

pure S effect



b1 + b3EXP


pure EXP effect

S

b1

EARNINGS

EXP


Similarly, the third term gives the effect of variations in EXP. A one year increase in EXP causes earnings to increase by b3 dollars, holding S constant.



pure EXP effect

pure S effect

S

b1

b1 + b3EXP

b1 + b2S + b3EXP

EARNINGS

EXP

b1 + b2S

combined effect of S and EXP


b1 + b2S

Different combinations of S and EXP give rise to values of EARNINGS which lie on the plane shown in the diagram, defined by EARNINGS = b1 + b2S + b3EXP.

This is the nonstochastic/deterministic (nonrandom) component of the model.



pure EXP effect

pure S effect

S

b1

b1 + b3EXP

b1 + b2S + b3EXP

b1 + b2S + b3EXP + u

EARNINGS

EXP

b1 + b2S


u


b1 + b2S

The final element of the model is the disturbance term, u. This causes the actual values of EARNINGS to deviate from the plane. In this observation, u happens to have a positive value.



pure EXP effect

pure S effect

S

b1

b1 + b3EXP

b1 + b2S + b3EXP

b1 + b2S + b3EXP + u

EARNINGS

EXP

b1 + b2S


u

A sample consists of a number of observations generated in this way. Note that the interpretation of the model does not depend on whether S and EXP are correlated or not.

However we do assume that the effects of S and EXP on EARNINGS are additive. The impact of a difference in S on EARNINGS is not affected by the value of EXP, or vice versa.


b1 + b2S



iiii uXXY 33221

iii XbXbbY 33221ˆ

iiiiii XbXbbYYYe 33221ˆ

The regression coefficients are derived using the same least squares principle used in simple regression analysis. The fitted value of Y in observation i depends on our choice of b1, b2, and b3.

The residual ei in observation i is the difference between the actual and fitted values of Y.



233221

2 )( iiii XbXbbYeRSS

We define RSS, the sum of the squares of the residuals, and choose b1, b2, and b3 so as to minimize it.



233221

2 )( iiii XbXbbYeRSS

)2222

22(

323233122133

22123

23

22

22

21

2

iiiiii

iiiiii

XXbbXbbXbbYXb

YXbYbXbXbbY

iii

iiiii

iiii

XXbbXbb

XbbYXbYXb

YbXbXbnbY

3232331

2213322

123

23

22

22

21

2

22

222

2

01

bRSS

0

2

bRSS

0

3

bRSS

First we expand RSS as shown, and then we use the first order conditions for minimizing it.



33221 XbXbYb

We thus obtain three equations in three unknowns. Solving for b1, b2, and b3, we obtain the expressions shown above. (The expression for b3 is the same as that for b2, with the subscripts 2 and 3 interchanged everywhere.)

23322 XXYYXX iii

23322

233

222

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii



33221 XbXbYb

23322 XXYYXX iii

23322

233

222

3322332

XXXXXXXX

XXXXYYXXb

iiii

iiii

The expression for b1 is a straightforward extension of the expression for it in simple regression analysis.

However, the expressions for the slope coefficients are considerably more complex than that for the slope coefficient in simple regression analysis.

For the general case when there are many explanatory variables, ordinary algebra is inadequate. It is necessary to switch to matrix algebra.



. reg EARNINGS S EXP

Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91

------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

Here is the regression output for the earnings function using Data Set 21.

EXPSINGSNEAR 56.068.249.26ˆ







It indicates that earnings increase by $2.68 for every extra year of schooling and by $0.56 for every extra year of work experience.







Literally, the intercept indicates that an individual who had no schooling or work experience would have hourly earnings of –$26.49.

Obviously, this is impossible. The lowest value of S in the sample was 6. We have obtained a nonsense estimate because we have extrapolated too far from the data range.



GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL

Suppose that you were particularly interested in the relationship between EARNINGS and S and wished to represent it graphically, using the sample data.

A simple plot would be misleading.

-20

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Years of schooling

Ho

url

y ea

rnin

gs

($)


-20

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Years of schooling

Ho

url

y ea

rnin

gs

($)

Schooling is negatively correlated with work experience. The plot fails to take account of this, and as a consequence the regression line underestimates the impact of schooling on earnings Omitted Variable Bias (Later, we’ll discuss the mathematical details of this distortion.)

To eliminate the distortion, you purge both EARNINGS and S of their components related to EXP and then draw a scatter diagram using the purged variables.

. cor S EXP(obs=540) | S ASVABC--------+------------------ S| 1.0000 EXP| -0.2179 1.0000



. reg EARNINGS EXP


------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- EXP | .2414715 .1398002 1.73 0.085 -.0331497 .5160927 _cons | 15.55527 2.442468 6.37 0.000 10.75732 20.35321------------------------------------------------------------------------------

. predict EEARN, resid

We start by regressing EARNINGS on EXP, as shown above. The residuals are the part of EARNINGS which is not related to EXP. The ‘predict’ command is the Stata command for saving the residuals from the most recent regression. We name them EEARN.



. reg S EXP


------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- EXP | -.1198454 .0231436 -5.18 0.000 -.1653083 -.0743826 _cons | 15.69765 .4043447 38.82 0.000 14.90337 16.49194------------------------------------------------------------------------------

. predict ES, resid

We do the same with S. We regress it on EXP and save the residuals as ES.



Now we plot EEARN on ES and the scatter is a faithful representation of the relationship, both in terms of the slope of the trend line (the black line) and in terms of the variation about that line.

As you would expect, the trend line is steeper than in the scatter diagram which did not control for EXP (reproduced here as the red line).

-20

0

20

40

60

80

-8 -6 -4 -2 0 2 4 6

ES (schooling residuals)

EE

AR

N (

earn

ing

s re

sid

ual

s)



. reg EEARN ES Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 131.63 Model | 21895.9298 1 21895.9298 Prob > F = 0.0000 Residual | 89496.5833 538 166.350527 R-squared = 0.1966-------------+------------------------------ Adj R-squared = 0.1951 Total | 111392.513 539 206.665145 Root MSE = 12.898------------------------------------------------------------------------------ EEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ES | 2.678125 .2334325 11.47 0.000 2.219574 3.136676 _cons | 8.10e-09 .5550284 0.00 1.000 -1.090288 1.090288------------------------------------------------------------------------------

From multiple regression:

. reg EARNINGS S EXP------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

Here is the regression of EEARN on ES.

We will content ourselves by verifying that the estimate of the slope coefficient is the same as that from a multiple regression. A mathematical proof that the technique works requires matrix algebra.

This result is also called the Frisch-Waugh-Lovell theorem.



. reg EEARN ES Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 131.63 Model | 21895.9298 1 21895.9298 Prob > F = 0.0000 Residual | 89496.5833 538 166.350527 R-squared = 0.1966-------------+------------------------------ Adj R-squared = 0.1951 Total | 111392.513 539 206.665145 Root MSE = 12.898------------------------------------------------------------------------------ EEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ES | 2.678125 .2334325 11.47 0.000 2.219574 3.136676 _cons | 8.10e-09 .5550284 0.00 1.000 -1.090288 1.090288------------------------------------------------------------------------------

From multiple regression:

. reg EARNINGS S EXP------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------

Finally, a small and not very important technical point. You may have noticed that the standard error and t statistic do not quite match. The reason for this is that the number of degrees of freedom is overstated by 1 in the residuals regression. That regression has not made allowance for the fact that we have already used up 1 degree of freedom in removing EXP from the model.



A.1: The model is linear in parameters and correctly specified.

A.2: There does not exist an exact linear relationship among the regressors in the sample.

A.3 The disturbance term has zero expectation

A.4 The disturbance term is homoscedastic

A.5 The values of the disturbance term have independent distributions

A.6 The disturbance term has a normal distribution

PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS

uXXY kk ...221

Moving from the simple to the multiple regression model, we start by restating the regression model assumptions. Only A.2 is different. Previously it was stated that there must be some variation in the X variable. We will explain the difference in one of the following lectures.

Provided that the regression model assumptions are valid, the OLS estimators in the multiple regression model are unbiased and efficient, as in the simple regression model.

lecture 9 today: ch. 3: multiple regression analysis example with two independent variables...

Documents