lecture 9 today: ch. 3: multiple regression analysis example with two independent variables...
TRANSCRIPT
Lecture 9
Today:Ch. 3: Multiple Regression Analysis• Example with two independent variables• Frisch-Waugh-Lovell theorem
© Christopher Dougherty 1999–2006
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
EARNINGS
EXP
S
b1
We’ll look at the geometrical interpretation of a multiple regression model with two explanatory variables.
Specifically, we will look at an earnings function model where hourly earnings, EARNINGS, depend on years of schooling (highest grade completed), S, and years of work experience, EXP.
The model has three dimensions, one each for EARNINGS, S, and EXP. The starting point for investigating the determination of EARNINGS is the intercept, b1.
Literally the intercept gives EARNINGS for those respondents who have no schooling and no work experience. However, there were no respondents with less than 6 years of schooling. Hence a literal interpretation of b1 would be unwise.
EARNINGS = b1 + b2S + b3EXP + u
b1 + b2S
© Christopher Dougherty 1999–2006
EARNINGS
EXP
The next term on the right side of the equation gives the effect of variations in S. A one year increase in S causes EARNINGS to increase by b2 dollars, holding EXP constant.
S
b1
pure S effect
EARNINGS = b1 + b2S + b3EXP + u
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
b1 + b3EXP
© Christopher Dougherty 1999–2006
pure EXP effect
S
b1
EARNINGS
EXP
EARNINGS = b1 + b2S + b3EXP + u
Similarly, the third term gives the effect of variations in EXP. A one year increase in EXP causes earnings to increase by b3 dollars, holding S constant.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
pure EXP effect
pure S effect
S
b1
b1 + b3EXP
b1 + b2S + b3EXP
EARNINGS
EXP
b1 + b2S
combined effect of S and EXP
EARNINGS = b1 + b2S + b3EXP + u
b1 + b2S
Different combinations of S and EXP give rise to values of EARNINGS which lie on the plane shown in the diagram, defined by EARNINGS = b1 + b2S + b3EXP.
This is the nonstochastic/deterministic (nonrandom) component of the model.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
pure EXP effect
pure S effect
S
b1
b1 + b3EXP
b1 + b2S + b3EXP
b1 + b2S + b3EXP + u
EARNINGS
EXP
b1 + b2S
combined effect of S and EXP
u
EARNINGS = b1 + b2S + b3EXP + u
b1 + b2S
The final element of the model is the disturbance term, u. This causes the actual values of EARNINGS to deviate from the plane. In this observation, u happens to have a positive value.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
pure EXP effect
pure S effect
S
b1
b1 + b3EXP
b1 + b2S + b3EXP
b1 + b2S + b3EXP + u
EARNINGS
EXP
b1 + b2S
combined effect of S and EXP
u
A sample consists of a number of observations generated in this way. Note that the interpretation of the model does not depend on whether S and EXP are correlated or not.
However we do assume that the effects of S and EXP on EARNINGS are additive. The impact of a difference in S on EARNINGS is not affected by the value of EXP, or vice versa.
EARNINGS = b1 + b2S + b3EXP + u
b1 + b2S
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
iiii uXXY 33221
iii XbXbbY 33221ˆ
iiiiii XbXbbYYYe 33221ˆ
The regression coefficients are derived using the same least squares principle used in simple regression analysis. The fitted value of Y in observation i depends on our choice of b1, b2, and b3.
The residual ei in observation i is the difference between the actual and fitted values of Y.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
233221
2 )( iiii XbXbbYeRSS
We define RSS, the sum of the squares of the residuals, and choose b1, b2, and b3 so as to minimize it.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
233221
2 )( iiii XbXbbYeRSS
)2222
22(
323233122133
22123
23
22
22
21
2
iiiiii
iiiiii
XXbbXbbXbbYXb
YXbYbXbXbbY
iii
iiiii
iiii
XXbbXbb
XbbYXbYXb
YbXbXbnbY
3232331
2213322
123
23
22
22
21
2
22
222
2
01
bRSS
0
2
bRSS
0
3
bRSS
First we expand RSS as shown, and then we use the first order conditions for minimizing it.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
33221 XbXbYb
We thus obtain three equations in three unknowns. Solving for b1, b2, and b3, we obtain the expressions shown above. (The expression for b3 is the same as that for b2, with the subscripts 2 and 3 interchanged everywhere.)
23322 XXYYXX iii
23322
233
222
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
33221 XbXbYb
23322 XXYYXX iii
23322
233
222
3322332
XXXXXXXX
XXXXYYXXb
iiii
iiii
The expression for b1 is a straightforward extension of the expression for it in simple regression analysis.
However, the expressions for the slope coefficients are considerably more complex than that for the slope coefficient in simple regression analysis.
For the general case when there are many explanatory variables, ordinary algebra is inadequate. It is necessary to switch to matrix algebra.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
Here is the regression output for the earnings function using Data Set 21.
EXPSINGSNEAR 56.068.249.26ˆ
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
EXPSINGSNEAR 56.068.249.26ˆ
It indicates that earnings increase by $2.68 for every extra year of schooling and by $0.56 for every extra year of work experience.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
. reg EARNINGS S EXP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010-------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
EXPSINGSNEAR 56.068.249.26ˆ
Literally, the intercept indicates that an individual who had no schooling or work experience would have hourly earnings of –$26.49.
Obviously, this is impossible. The lowest value of S in the sample was 6. We have obtained a nonsense estimate because we have extrapolated too far from the data range.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE
© Christopher Dougherty 1999–2006
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL
Suppose that you were particularly interested in the relationship between EARNINGS and S and wished to represent it graphically, using the sample data.
A simple plot would be misleading.
-20
0
20
40
60
80
100
120
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Years of schooling
Ho
url
y ea
rnin
gs
($)
© Christopher Dougherty 1999–2006
-20
0
20
40
60
80
100
120
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Years of schooling
Ho
url
y ea
rnin
gs
($)
Schooling is negatively correlated with work experience. The plot fails to take account of this, and as a consequence the regression line underestimates the impact of schooling on earnings Omitted Variable Bias (Later, we’ll discuss the mathematical details of this distortion.)
To eliminate the distortion, you purge both EARNINGS and S of their components related to EXP and then draw a scatter diagram using the purged variables.
. cor S EXP(obs=540) | S ASVABC--------+------------------ S| 1.0000 EXP| -0.2179 1.0000
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL
© Christopher Dougherty 1999–2006
. reg EARNINGS EXP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 2.98 Model | 617.717488 1 617.717488 Prob > F = 0.0847 Residual | 111392.514 538 207.049282 R-squared = 0.0055-------------+------------------------------ Adj R-squared = 0.0037 Total | 112010.231 539 207.811189 Root MSE = 14.389
------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- EXP | .2414715 .1398002 1.73 0.085 -.0331497 .5160927 _cons | 15.55527 2.442468 6.37 0.000 10.75732 20.35321------------------------------------------------------------------------------
. predict EEARN, resid
We start by regressing EARNINGS on EXP, as shown above. The residuals are the part of EARNINGS which is not related to EXP. The ‘predict’ command is the Stata command for saving the residuals from the most recent regression. We name them EEARN.
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL
© Christopher Dougherty 1999–2006
. reg S EXP
Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 26.82 Model | 152.160205 1 152.160205 Prob > F = 0.0000 Residual | 3052.82313 538 5.67439243 R-squared = 0.0475-------------+------------------------------ Adj R-squared = 0.0457 Total | 3204.98333 539 5.94616574 Root MSE = 2.3821
------------------------------------------------------------------------------ S | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- EXP | -.1198454 .0231436 -5.18 0.000 -.1653083 -.0743826 _cons | 15.69765 .4043447 38.82 0.000 14.90337 16.49194------------------------------------------------------------------------------
. predict ES, resid
We do the same with S. We regress it on EXP and save the residuals as ES.
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL
© Christopher Dougherty 1999–2006
Now we plot EEARN on ES and the scatter is a faithful representation of the relationship, both in terms of the slope of the trend line (the black line) and in terms of the variation about that line.
As you would expect, the trend line is steeper than in the scatter diagram which did not control for EXP (reproduced here as the red line).
-20
0
20
40
60
80
-8 -6 -4 -2 0 2 4 6
ES (schooling residuals)
EE
AR
N (
earn
ing
s re
sid
ual
s)
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL
© Christopher Dougherty 1999–2006
. reg EEARN ES Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 131.63 Model | 21895.9298 1 21895.9298 Prob > F = 0.0000 Residual | 89496.5833 538 166.350527 R-squared = 0.1966-------------+------------------------------ Adj R-squared = 0.1951 Total | 111392.513 539 206.665145 Root MSE = 12.898------------------------------------------------------------------------------ EEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ES | 2.678125 .2334325 11.47 0.000 2.219574 3.136676 _cons | 8.10e-09 .5550284 0.00 1.000 -1.090288 1.090288------------------------------------------------------------------------------
From multiple regression:
. reg EARNINGS S EXP------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
Here is the regression of EEARN on ES.
We will content ourselves by verifying that the estimate of the slope coefficient is the same as that from a multiple regression. A mathematical proof that the technique works requires matrix algebra.
This result is also called the Frisch-Waugh-Lovell theorem.
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL
© Christopher Dougherty 1999–2006
. reg EEARN ES Source | SS df MS Number of obs = 540-------------+------------------------------ F( 1, 538) = 131.63 Model | 21895.9298 1 21895.9298 Prob > F = 0.0000 Residual | 89496.5833 538 166.350527 R-squared = 0.1966-------------+------------------------------ Adj R-squared = 0.1951 Total | 111392.513 539 206.665145 Root MSE = 12.898------------------------------------------------------------------------------ EEARN | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- ES | 2.678125 .2334325 11.47 0.000 2.219574 3.136676 _cons | 8.10e-09 .5550284 0.00 1.000 -1.090288 1.090288------------------------------------------------------------------------------
From multiple regression:
. reg EARNINGS S EXP------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- S | 2.678125 .2336497 11.46 0.000 2.219146 3.137105 EXP | .5624326 .1285136 4.38 0.000 .3099816 .8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213------------------------------------------------------------------------------
Finally, a small and not very important technical point. You may have noticed that the standard error and t statistic do not quite match. The reason for this is that the number of degrees of freedom is overstated by 1 in the residuals regression. That regression has not made allowance for the fact that we have already used up 1 degree of freedom in removing EXP from the model.
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL
© Christopher Dougherty 1999–2006
A.1: The model is linear in parameters and correctly specified.
A.2: There does not exist an exact linear relationship among the regressors in the sample.
A.3 The disturbance term has zero expectation
A.4 The disturbance term is homoscedastic
A.5 The values of the disturbance term have independent distributions
A.6 The disturbance term has a normal distribution
PROPERTIES OF THE MULTIPLE REGRESSION COEFFICIENTS
uXXY kk ...221
Moving from the simple to the multiple regression model, we start by restating the regression model assumptions. Only A.2 is different. Previously it was stated that there must be some variation in the X variable. We will explain the difference in one of the following lectures.
Provided that the regression model assumptions are valid, the OLS estimators in the multiple regression model are unbiased and efficient, as in the simple regression model.