course econometrics i - department home · course econometrics i 2. multiple regression analysis:...

Course Econometrics I

2. Multiple Regression Analysis: Further Issues

Martin Halla

Johannes Kepler University of LinzDepartment of Economics

Last update: April 1, 2014

Martin Halla CS Econometrics I – 2 1/41

Effects of data scaling on OLS Statistics

Data scaling is changing the units of measurement of thedependent and the independent variables.

I Due to scaling the estimated coefficients (∂y/∂x), standarderrors, test-statistics, etc. change in a way that all measuredeffects and testing outcomes are preserved.

I Linear transformation do not change the fit of the regression

I Data scaling is done for cosmetic purposes.

I To improve interpretability


Effects of data scaling on OLS Statistics – An example I

Let’s study the effects of data scaling based on an example:

ˆbwght = β0 + β1cigs+ β2faminc (1)

bwght . . . child birth weight in ouncescigs . . . no. of cigarettes smoked by the mother per dayfaminc . . . annual family income in USD 1.000

Results are displayed in column (1) on the next slide.


Effects of data scaling on OLS statistics – An example II

>> see do-file 2-1.do <<


Effects of data scaling on OLS Statistics – An example III

Now suppose that we decide to measure birth weight in pounds,rather than in ounces:

ˆbwght/16 = β0/16 +(β1/16

)cigs+

(β2/16

)faminc (2)

I New coefficients are the old coefficients divided by 16.

I New standard errors are the old ones divided by 16.

I t-statistics are unchanged.

I R-squared is unchanged.

I SSR has to be divided by 162; SER by 16.

Results are displayed in column (2) on the previous slide.


Effects of data scaling on OLS Statistics – An example IV

Now suppose we change compared to (1) the measurement ofcigarettes to packs = cigs/20

ˆbwght = β0 +(

20β1

)(cigs/20) + β2faminc

= β0 +(

20β1

)packs+ β2

(3)

I β0 and β2 are unchanged.

I Coefficient and standard errors on packs are 20 times higherthan those on cigs.

I t-statistics, R-squared, SSR and SER are unchanged.

Results are displayed in column (3) on the penultimate slide.


Effects of data scaling on OLS Statistics – Log. forms

I If the LHS var appears in log, changing unit has no impact onslope coeffs.

I This follows from the fact that log(c1 · yi) = log(c1) + log(yi)for any c1 > 0.

I The new intercept will be log(c1) + β0.

I The same holds for RHS vars.


Scaling in the birth weight model

--------------------------------------------------------------------------

Variable | bwght bwghtlbs bwght log(bwght) log(bwghtlbs)

-------------+------------------------------------------------------------

cigs | -0.4634 -0.0290 -0.0040 -0.0040

| 0.0916 0.0057 0.0009 0.0009

faminc | 0.0928 0.0058 0.0928 0.0008 0.0008

| 0.0292 0.0018 0.0292 0.0003 0.0003

packs | -9.2682

| 1.8315

_cons | 116.9741 7.3109 116.9741 4.7440 1.9714

| 1.0490 0.0656 1.0490 0.0098 0.0098

-------------+------------------------------------------------------------

r2 | 0.0298 0.0298 0.0298 0.0265 0.0265

rss | 5.57e+05 2177.6778 5.57e+05 49.0862 49.0862

rmse | 20.0628 1.2539 20.0628 0.1883 0.1883

--------------------------------------------------------------------------

legend: b/se


Beta coefficients – I

A key var may be measured on a scale that is hard to interpret.

I Often in other disciplines; e.g., test scores, indices or responsesto attitudinal questions

I PISA test, trust, or political freedom

I Instead of asking by how much the LHS changes if the (test)score would be 10 points higher, we can ask what happens ifthe score is one standard deviation (sd) higher?

I The sample sd of all variables can easily be obtained.

I Therefore, we have to standardize variables (see Appendix C)

I Subtracting off its sample mean and dividing by its sample sdI Generates vars with a mean of zero and a sd of one.I These are called z-scores.


Beta coefficients – II

We want to standardize all variables in the following original model:

yi = β0 + β1xi1 + β2xi2 + . . .+ βkxik + ui

(i) subtract means of all variables:

yi− y = β1(xi1− x1) + β2(xi2− x2) + . . .+ βk(xik − xk) + (ui− 0)

(ii) divide each variable by its sample sd (σy and σk)

(yi − y)

σy=

(σ1σy

)β1

[(xi1 − x1)

σ1

]+. . .

(σkσy

)βk

[(xik − xk)

σk

]+

(uiσy

)

- Since we divide the LHS by σy → divide the coeffs. by σy.

- Since we divide the RHS by σk → multiply the coeffs. with σk.

- Constant is zero by construction


Beta coefficients – III

The last equation (without i subscripts) can be re-written as follows:

zy = b1z1 + b2z2 + . . .+ bkzk + error (4)

where zy denotes the z-score of y, z1 the z-score of x1, and so on.The new coefficients are

bj =

(σjσy

)βj for j = 1, . . . , k. (5)

These bj are either called beta coefficients or standardized coeffs.

I Interpretat.: If x1 increases by one sd, then y changes by b1 sds.

I Since we measure all variables in sd, the scale is irrelevant.

I The most important RHS can easily be identified.

I See Example 6.1 and >> do-file 2-2.do <<.


Calculation of beta coefficients in Stata

. reg price nox crime rooms dist stratio, beta

Source | SS df MS Number of obs = 506

-------------+------------------------------ F( 5, 500) = 174.47

Model | 2.7223e+10 5 5.4445e+09 Prob > F = 0.0000

Residual | 1.5603e+10 500 31205611.6 R-squared = 0.6357

-------------+------------------------------ Adj R-squared = 0.6320

Total | 4.2826e+10 505 84803032 Root MSE = 5586.2

------------------------------------------------------------------------------

price | Coef. Std. Err. t P>|t| Beta

-------------+----------------------------------------------------------------

nox | -2706.433 354.0869 -7.64 0.000 -.340446

crime | -153.601 32.92883 -4.66 0.000 -.1432828

rooms | 6735.498 393.6037 17.11 0.000 .5138878

dist | -1026.806 188.1079 -5.46 0.000 -.2348385

stratio | -1149.204 127.4287 -9.02 0.000 -.2702799

_cons | 20871.13 5054.599 4.13 0.000 .

------------------------------------------------------------------------------

>> do-file 2-2.do <<


More on using logarithmic functional forms – I

Summary of functional forms involving logarithms

Model Dependent Independent InterpretationVariable Variable of β1

Level-level y x 4y = β14xLevel-log y log(x) 4y = (β1/100)%4xLog-level log(y) x %4y = (100β1)4xLog-log log(y) log(x) %4y = β1%4x


More on using logarithmic functional forms – II

Let’s consider the following equation

log(price) = β0 + β1 log(nox) + β2rooms+ u. (6)

I β1 is the elasticity of price with respect to nox (the NOx

pollution).

I β2 is the change in log(price) when ∆rooms = 1.

I β2 is also called semi-elasticity.

I (100β2)∆x2 is the approx. percentage change in price.

I 100[exp(β2∆x2)− 1] is the exact percentage change in price.


More on using logarithmic functional forms – III

When estimated using the data in HPRICE2, we obtain

ˆlog(price) = 9.23− 0.718 log(nox) + 0.306rooms

n = 506, R2 = 0.514

I When nox increases by 1%, price falls by 0.718%, holdingrooms fixed.

I When rooms increases by one, price increases by approximately100(0.306) = 30.6%.

I The exact percentage change in price is100[exp(0.306)− 1] = 35.8%.


More on using logarithmic functional forms – IV

Why do so many econometric models utilize logs?

I Log models often more closely satisfy our assumptions

I Many econ vars are constrained to be positive

I Taking logs reduces extrema and curtails the effects of outliers

I Ratios are often left in levels

I Could be expressed in logs

I Something like an unemployment rate already has a nicepercentage interpretation.

I Be careful to distinguish between an 0.01 change and a one unitchange


Models with quadratics – I

I The following model is linear in parameters

Y = β0 + β1x1 + β2x21 + β3x3 + . . .+ βkxk + ε. (7)

I However, the squared-term x21 allows more flexibility in themodeling of the relationship between x1 and y.

I β1 and β2 have to be interpreted jointly; if x1 is changing, x21changes too.

I The marginal effect of x1 on y is given by

∂y

∂x1= β1 + 2β2x1. (8)

I The effect of x1 on y depends, therefore, on the level of x1.


Models with quadratics – II

Different combinations of β1 and β2 result in different functionalforms:

I β1 < 0, β2 > 0: u-shaped relationship

I β1 > 0, β2 < 0: inverted u-shaped relationship

I β1 > 0, β2 > 0: y increases quadratically in x1I β1 < 0, β2 < 0: y decreases quadratically in x1

The extremum/turning point of x1 with respect to y is given by:

∂y

∂x1= β1 + 2β2x1 = 0 ⇒ x1 = −β1/2β2. (9)


Models with quadratics – An example

See equation (6.12).


Models with quadratics – An further example

See Example 6.2.


Models with higher order polynomials

Higher order polynomials:

y = β0 + β1x1 + β2x21 + β3x

31 + ε. (10)

The marginal effect of x1 on y is given by:

∂y

∂x1= β1 + 2β2x1 + 3β3x

21. (11)

Not as commonly found in empirical work


Models with interaction termsInteraction terms are the product of two (or more) RHS variables:

I This technique allows for nonlinearities

I For instance, consider the determinants of housing prices

price = β0+β1sqrft+β2bdrms+β3sqrft·bdrms+β4bthrms+u,

where the partial effect of bdrms on price is given by

∂price(·)∂bdrms

= β2 + β3 · sqrft.

I That means, if β3 > 0 that an additional bedroom yields ahigher increase in housing price for larger houses

I We must evaluate this effect at interesting values of sqrft

I Respective standard errors can easily be done byreparameterizing the model

I Same applies to partial effect of sqrft on price


Interaction terms – Reparameterization (general case) ILet’s consider the following model including an interaction term:

y = β0 + β1x1 + β2x2 + β3x1x2 + u.

Here β2 gives the partial effect of x2 on y when x1 = 0.

Often, this is not of interest. Therefore, we reparamterize the model:

y = α0 + α1x1 + δ2x2 + β3(x1 − µ1)x2 + u,

where µ1 is the population mean of x1. After rearranging we get:

y = α0 + α1x1 + δ2x2 + β3x1x2 − β3x2µ1 + u.

We can check (see next slide) that δ2 is the partial effect of x2 on yat the mean value of x1:

δ2 = β2 + β3µ1.


Interaction terms – Reparameterization (general case) IIOriginal model:

y = β0 + β1x1 + β2x2 + β3x1x2 + u

∂y

∂x2= β2 + β3x1

Reparameterized model:

y = α0 + α1x1 + δ2x2 + β3(x1 − µ1)x2 + u

y = α0 + α1x1 + δ2x2 + β3x1x2 − β3x2µ1 + u

∂y

∂x2= δ2 + β3x1 − β3µ1

β2 + β3x1 = δ2 + β3x1 − β3µ1δ2 = β2 + β3µ1


Interaction terms – Reparameterization (Example 6.3) Igen priGPA2 = priGPA^2

gen ACT2 = ACT^2

gen interaction = priGPA*atndrte

reg stndfnl atndrte priGPA ACT priGPA2 ACT2 interaction


-------------+------------------------------ F( 6, 673) = 33.25

Model | 152.001001 6 25.3335002 Prob > F = 0.0000

Residual | 512.76244 673 .761905557 R-squared = 0.2287

-------------+------------------------------ Adj R-squared = 0.2218

Total | 664.763441 679 .97903305 Root MSE = .87287

------------------------------------------------------------------------------

stndfnl | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

atndrte | -.0067129 .0102321 -0.66 0.512 -.0268035 .0133777

priGPA | -1.62854 .4810025 -3.39 0.001 -2.572986 -.6840938

ACT | -.1280394 .098492 -1.30 0.194 -.3214279 .0653492

priGPA2 | .2959046 .1010495 2.93 0.004 .0974945 .4943147

ACT2 | .0045334 .0021764 2.08 0.038 .00026 .0088068

interaction | .0055859 .0043174 1.29 0.196 -.0028913 .0140631

_cons | 2.050293 1.360319 1.51 0.132 -.6206864 4.721272

------------------------------------------------------------------------------

test atndrte interaction

( 1) atndrte = 0

( 2) interaction = 0

F( 2, 673) = 4.32

Prob > F = 0.0137


Interaction terms – Reparameterization (Example 6.3) II

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

priGPA | 680 2.586775 .5447141 .857 3.93

display _b[atndrte] + _b[interaction]*2.59

0.00775457

* This is the estimated effect of atndrte on stndfnl at the mean of priGPA

* A 10 percentage point increase in atndrte increases stndfl by 0.078 s.d. from the mean final exam score.

* Is this effect statistically significant from zero?

gen priGPA_mean = priGPA-2.59

gen interaction2 = priGPA_mean* atndrte

reg stndfnl atndrte priGPA ACT priGPA2 ACT2 interaction2

------------------------------------------------------------------------------

stndfnl | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

atndrte | .0077546 .0026393 2.94 0.003 .0025723 .0129368

priGPA | -1.62854 .4810025 -3.39 0.001 -2.572986 -.6840938

ACT | -.1280394 .098492 -1.30 0.194 -.3214279 .0653492

priGPA2 | .2959046 .1010495 2.93 0.004 .0974945 .4943147

ACT2 | .0045334 .0021764 2.08 0.038 .00026 .0088068

interaction2 | .0055859 .0043174 1.29 0.196 -.0028913 .0140631

_cons | 2.050293 1.360319 1.51 0.132 -.6206863 4.721272

------------------------------------------------------------------------------

* Check the estimated coefficient (and standard errors) of atndrte.


Adjusted R-squared

I R-squared cannot decrease when additional RHS vars are addedto the model; even if these have no significant effect on theLHS var.

I A richer model will always be preferred over a moreparsimonious one.

I The adjusted R-squared imposes a penalty for an additionalRHS var.:

R2 = 1− [SSR/ (n− k − 1)]

[SST/ (n− 1)](12)

I R2 increases iff the t statistic on the new var. is greater than|1|.

I R2 can be negative.

I Different formula: R2 = 1−(1−R2

)(n− 1) / (n− k − 1)


Using R2 to choose between nonnested models

The R2 can help us to choose a model without redundant variables

I Models are nonnested if none is a special case of the other(s)

I For instance,

wage = β0 + β1educ+ β2female+ β3weight (13)

wage = β0 + β1educ+ β2female+ β3height (14)

I R2 may be used to make informal comparisons of non-nestedmodels, as long as they have the same dependent variable

I For the limitations of the R2, see Example 6.4


Controlling for too many factors in regression analysis

I MLR.3 shows that we should worry about omitted vars.

I It is also possible to control for too many vars.

I Do not overemphasize goodness-of-fit

I Let’s assume we want to study the effect of state beer taxes ontraffic fatalities:

fatalities = β0 + β1tax+ β2perc male+ . . . (15)

fatalities = β0 + β1tax+ β2perc male+ β3beer cons+ . . . (16)

I Shall we control for beer consumption (beer cons)?

I Remember the ceteris paribus nature of multiple regressionI Beer is a so-called bad control var


Adding regressors to reduce the error variance

Additional RHS vars:

1. exacerbate the multicolinearity problem.

2. reduce the error variance.

I Generally, we do not know which effect will dominate.

I However, always include RHS vars that affect the LHS var andare uncorrelated with all other RHS vars.

I Why? Because such RHS varsI will not cause multi-collinearity; andI reduce the error variance and the standard errors (s.e.).I The issue is not unbiasedness, but smaller sampling variance!I Think about a randomized controlled trial.


Confidence intervals (CI) for predictions

I Predictions for a specific subpopulation

1. Sampling error in y0, because the βj are estimated.

I Predictions for a particular unit

1. Sampling error in y0, because the βj are estimated.

2. Variance of the error in the population σ2.


CI for prediction for a specific subpopulation – I

Fitted values or predictions are subject to sampling variation:

I Suppose we have estimated y = β0 + β1x1 + β2x2 + . . .+ βkxk.

I To obtain a prediction we plug in particular values c1, c2, . . . , ckfor each k RHS var.

I The parameter we would like to estimate is:

θ0 = β0 + β1c1 + β2c2 + . . .+ βkck (17)

θ0 = E (y|x1 = c1, x2 = c2, . . . , xk = ck)

I The estimator of θ0 is

θ0 = β0 + β1c1 + β2c2 + . . .+ βkck.

I To obtain a confidence interval (CI) for θ0, we need s.e. for θ0.


CI for prediction for a specific subpopulation – II

I We re-write (17) as β0 = θ0 − β1c1 − . . .− βkck and plug thisinto

y = β0 + β1x1 + β2x2 + . . .+ βkxk + u

to obtain

y = θ0 + β1 (x1 − c1) + β2 (x2 − c2) + . . .+ βk (xk − ck) + u

I In other words, we subtract the value cj from each observationson xj , and then run the regression of

yi on (xi1 − c1) , . . . , (xik − ck) , i = 1, 2, . . . , n.

I Predicted value and its s.e. are obtained from the intercept.

I Variance of the prediction is smallest at the mean values of xj .

I See Example 6.5 and >> see do-file 2-4.do <<


CI for prediction for a specific subpopulation – III

gen sat0 = sat-1200

gen hsperc0 = hsperc-30

gen hsize0 = hsize-5

gen hsize20 = hsize2-25

reg colgpa sat0 hsperc0 hsize0 hsize20


-------------+------------------------------ F( 4, 4132) = 398.02

Model | 499.030503 4 124.757626 Prob > F = 0.0000


-------------+------------------------------ Adj R-squared = 0.2774

Total | 1794.19567 4136 .433799728 Root MSE = .55986

------------------------------------------------------------------------------

colgpa | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

sat0 | .0014925 .0000652 22.89 0.000 .0013646 .0016204

hsperc0 | -.0138558 .000561 -24.70 0.000 -.0149557 -.0127559

hsize0 | -.0608815 .0165012 -3.69 0.000 -.0932327 -.0285302

hsize20 | .0054603 .0022698 2.41 0.016 .0010102 .0099104

_cons | 2.700075 .0198778 135.83 0.000 2.661104 2.739047

------------------------------------------------------------------------------

I The 95% confidence interval for the expected college GPA is about2.66 to 2.74


CI for prediction for a particular unit – I

I We have just derived the CI of the prediction for thesubpopulation with a given set of RHS vars.

I This is different from the CI for a particular unit (e. g.individual, firm, country.).

I Here, we must also account for the variance in theunobserved error.

I On average, that error is assumed to be zero; that is, E(u) = 0.

I For a specific value of y, there will be an error ui; we do notknow its magnitude.


CI for prediction for a particular unit – V

I Let y0 denote the value for which we would like to construct anprediction interval.

I y0 could represent a unit not in our original sample.

I Let x01, . . . , x0k be the values of the RHS vars. we assume to

observe; and u0 be the unobserved error.

I Therefore, we have

y0 = β0 + β1x01 + β2x

02 + . . .+ βkx

0k + u0.

I As before our best prediction of y0 is

y0 = β0 + β1x01 + β2x

02 + . . .+ βkx

0k

I The prediction error (with E(e0) = 0) is given by

e0 = y0 − y0 =(β0 + β1x

01 + . . .+ βkx

0k

)+ u0︸︷︷︸

y0

−y0.


CI for prediction for a particular unit – VI

I To find the variance of e0, note that u0 is uncorrelated witheach βj .

I Therefore, the variance of the prediction error is the sum ofthe variances:

V ar(e0) = V ar(y0) + V ar(u0) = V ar(y0) + σ2

where V ar(u0) is the error variance σ2. [See (B.31)]I There are two sources of variation in e0:

1. Sampling error in y0, because the βj are estimated.2. The variance of the error in the population σ2.

I The s.e. of e0 are given by se(e0) ={[se(y0)

]2+ σ2

}1/2.

I Due to the second term, CIs formed for specific values of y willalways be wider than those for predictions of the mean y.


CI for prediction for a particular unit – VII

I Using the same reasoning for the t statistics of the βj , we findthat e0/se(e0) has a t distribution with n− (k + 1) degrees offreedom:

P[−t0.025 ≤ e0/se(e0) ≤ t0.025

]= 0.95,

where t0.025 is the 97.5th percentile in the tn−k−1 distribution.

I Plugging in e0 = y0 − y0 and rearranging gives a 95%prediction interval for y0

y0 : y0 ± t0.025 · se(e0).


CI for prediction for a particular unit – VIII

See Example 6.6:

I y0 ± 1.96 · se(e0) ={

0.022 + 0.562}1/2

.


-------------+------------------------------ F( 4, 4132) = 398.02

Model | 499.030503 4 124.757626 Prob > F = 0.0000


-------------+------------------------------ Adj R-squared = 0.2774

Total | 1794.19567 4136 .433799728 Root MSE = .55986

------------------------------------------------------------------------------

colgpa | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

sat0 | .0014925 .0000652 22.89 0.000 .0013646 .0016204

hsperc0 | -.0138558 .000561 -24.70 0.000 -.0149557 -.0127559

hsize0 | -.0608815 .0165012 -3.69 0.000 -.0932327 -.0285302

hsize20 | .0054603 .0022698 2.41 0.016 .0010102 .0099104

_cons | 2.700075 .0198778 135.83 0.000 2.661104 2.739047

------------------------------------------------------------------------------


Residual analysis

OLS residuals are often calculated and analyzed after an estimation:

I Purely technical: residual may be used to test the validity ofthe several assumptions.

I Systematic behavior in the magnitude, or in their dispersion,would cast doubt on the OLS results.

I When plotted, do they appear systematic?I Does their dispersion appear to be constant, or is it larger for

some RHS var values than others?

I It can show whether particular units have predicted values thatare well above or well below the actual outcome.


Predicting y when log(y) is the dependent variable

I y 6= exp[

ˆlog(yi)]

I y = α0 exp[

ˆlog(yi)]; where α0 = exp(u)

1. Obtain the fitted values, ˆlog(yi), and residuals, ui, from theregression log(y) on x1, . . . , xk.

2. For each obs. i, create mi = exp[

ˆlog(yi)]

3. Regress y on mi without a constant; the coefficient on mi isthe estimate of α0

4. For given values of x1, . . . , xk obtain ˆlog(yi)

5. Plug α0 and ˆlog(yi) into the eq. above

See Example 6.7


course econometrics i - department home · course econometrics i 2. multiple regression analysis:...

Documents