etc 2410 notes

133
Introductory Econometrics Contents 1 Review of simple regression 3 1.1 The Sample Regression Function ............................ 3 1.2 Interpretation of regression as prediction ........................ 6 1.3 Regression in Eviews ................................... 6 1.4 Goodness of t ...................................... 18 1.5 Derivations ........................................ 19 1.5.1 Summation notation ............................... 19 1.5.2 Derivation of OLS ................................ 22 1.5.3 Properties of predictions and residuals ..................... 23 2 Statistical Inference and the Population Regression Function 24 2.1 Simple random sample .................................. 24 2.2 Population distributions and parameters ........................ 25 2.3 Population vs Sample .................................. 25 2.4 Conditional Expectation ................................. 25 2.5 The Population Regression Function .......................... 26 2.6 Statistical Properties of OLS .............................. 26 2.6.1 Properties of Expectations ............................ 27 2.6.2 Unbiasedness ................................... 28 2.6.3 Variance ...................................... 30 2.6.4 Asymptotic normality .............................. 31 2.7 Summary ......................................... 35 3 Hypothesis Testing and Condence Intervals 37 3.1 Hypothesis testing .................................... 37 3.1.1 The null hypothesis ............................... 37 3.1.2 The alternative hypothesis ............................ 37 3.1.3 The null distribution ............................... 38 3.1.4 The alternative distribution ........................... 39 3.1.5 Decision rules and the signicance level .................... 40 3.1.6 The t test theory ............................... 40 3.1.7 The t test two sided example ........................ 41 3.1.8 The t test one sided example ........................ 44 3.1.9 p-values ...................................... 46 3.1.10 Testing other null hypotheses .......................... 49 3.2 Condence intervals ................................... 51 3.3 Prediction intervals .................................... 52 3.3.1 Derivations .................................... 53 1

Upload: mohammad-rashman

Post on 16-Feb-2016

55 views

Category:

Documents


3 download

DESCRIPTION

Econometrics

TRANSCRIPT

Page 1: Etc 2410 Notes

Introductory Econometrics

Contents

1 Review of simple regression 31.1 The Sample Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Interpretation of regression as prediction . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Regression in Eviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.5 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.5.1 Summation notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.5.2 Derivation of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.5.3 Properties of predictions and residuals . . . . . . . . . . . . . . . . . . . . . 23

2 Statistical Inference and the Population Regression Function 242.1 Simple random sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2 Population distributions and parameters . . . . . . . . . . . . . . . . . . . . . . . . 252.3 Population vs Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5 The Population Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . 262.6 Statistical Properties of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6.1 Properties of Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.6.2 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.6.4 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Hypothesis Testing and Confidence Intervals 373.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1.1 The null hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.1.2 The alternative hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.1.3 The null distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.1.4 The alternative distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.1.5 Decision rules and the significance level . . . . . . . . . . . . . . . . . . . . 403.1.6 The t test – theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.7 The t test – two sided example . . . . . . . . . . . . . . . . . . . . . . . . 413.1.8 The t test – one sided example . . . . . . . . . . . . . . . . . . . . . . . . 443.1.9 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.1.10 Testing other null hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3.1 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

1

Page 2: Etc 2410 Notes

4 Multiple Regression 554.1 Population Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2 Sample Regression Function and OLS . . . . . . . . . . . . . . . . . . . . . . . . . 554.3 Example: house price modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.5 Applications to house price regression . . . . . . . . . . . . . . . . . . . . . . . . . 594.6 Joint hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.7 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.7.1 Perfect multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.7.2 Imperfect multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Dummy Variables 695.1 Estimating two means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Estimating several means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.3 Dummy variables in general regressions . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.1 Dummies for intercepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.2 Dummies for slopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Some non-linear functional forms 806.1 Quadratic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.1.1 Example: wages and work experience . . . . . . . . . . . . . . . . . . . . . 816.2 Regression with logs —explanatory variable . . . . . . . . . . . . . . . . . . . . . . 82

6.2.1 Example: wages and work experience . . . . . . . . . . . . . . . . . . . . . 866.3 Regression with logs —dependent variable . . . . . . . . . . . . . . . . . . . . . . . 87

6.3.1 Example: modelling the log of wages . . . . . . . . . . . . . . . . . . . . . . 936.3.2 Choosing between levels and logs for the dependent variable . . . . . . . . . 94

6.4 Practical summary of functional forms . . . . . . . . . . . . . . . . . . . . . . . . . 97

7 Comparing regressions 987.1 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987.2 Information criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.3 Adjusted R2 as an IC∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8 Functional form 100

9 Regression and Causality 1009.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1019.2 Regression for prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1019.3 Omitted variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1029.4 Simultaneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049.5 Sample selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

10 Regression with Time Series 10510.1 Dynamic regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

10.1.1 Finite Distributed Lag model . . . . . . . . . . . . . . . . . . . . . . . . . . 10610.1.2 Autoregressive Distributed Lag model . . . . . . . . . . . . . . . . . . . . . 10610.1.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10710.1.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

10.2 OLS estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10810.2.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

2

Page 3: Etc 2410 Notes

10.2.2 A general theory for time series regression . . . . . . . . . . . . . . . . . . . 11210.3 Checking weak dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11310.4 Model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11310.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

10.5.1 Interpretation of FDL models . . . . . . . . . . . . . . . . . . . . . . . . . . 12410.5.2 Interpretation of ARDL models . . . . . . . . . . . . . . . . . . . . . . . . . 125

11 Regression in matrix notation 12711.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12711.2 Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12811.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12911.4 The PRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12911.5 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13011.6 OLS in matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

11.6.1 Proof∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13111.7 Unbiasedness of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13211.8 Time series regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

1 Review of simple regression

1.1 The Sample Regression Function

Regression is the primary statistical tool used in econometrics to understand the relationshipbetween variables. To illustrate, consider the dataset introduced in Example 2.3 of Wooldridgefor relating the salary paid to corporate chief executive offi cers to the return on equity achievedby their firms. Data is available for 209 firms. The idea is to examine whether the salaries paid toCEOs is related to the earnings of their firms, and specifically whether firms with higher incomesreward their CEOs with higher salaries. A scatter plot of the possible relationship is shown inFigure 1, which reveals the possibility of increasing returns to equity corresponding to higherCEO salaries, but with some apparently high salaries for a small number of CEOs also included(these are known as outliers, to be discussed later).

A regression line can be fit to this data using the method of Ordinary Least Squares (OLS), asshown in Figure 2. The OLS method works as follows. The dependent variable for the regressionis denoted yi, where the subscript i refers to the number of the observation for i = 1, . . . , n. In theexample we have n = 209 and yi corresponds to the CEO salary for each of the 209 firms. Theexplanatory variable, or regressor, is denoted xi for i = 1, . . . , n and corresponds to the Returnon Equity for each of the 209 firms. The data are shown in Table 1. The first observation in thedataset is y1 = 1095 and x1 = 14.10, meaning that the CEO of the first firm earns $1,095,000 andthe firm’s Return on Equity is 14.10%. The second observation is y2 = 1001 and x2 = 10.90, thelast observation is y209 = 626 and x209 = 14.40, and so on.

The regression line is a linear function of xi that is used to calculate a prediction of yi, denotedyi. This regression line is expressed

yi = β0 + β1xi, i = 1, . . . , n. (1)

This is called the Sample Regression Function (SRF). The “hat”on top of any quantity impliesthat it is a prediction or an estimate that is calculated from the data. The method of OLS isused to calculate β0 and β1, respectively the intercept and the slope of the regression line. Theprediction errors, or regression residuals, are denoted

ui = yi − yi, i = 1, . . . , n, (2)

3

Page 4: Etc 2410 Notes

and OLS chooses the values of β0 and β1 such that the overall residuals (u1, . . . , un) are minimised,in the sense that the Sum of Squared Residuals (SSR)

SSR =

n∑i=1

u2i =

n∑i=1

(yi − yi)2

is as small as possible. This is the sense in which the OLS regression line is known as the line ofbest fit.

The formulae for β0 and β1 are given by

β1 =

∑ni=1 (xi − x) (yi − y)∑n

i=1 (xi − x)2, (3)

andβ0 = y − β1x, (4)

where y and x are the sample means

y =1

n

n∑i=1

yi, x =1

n

n∑i=1

xi.

The derivations are given below.For the CEO salary data, the coeffi cients of the regression line can be calculated to be β0 =

963.191 and β1 = 18.501, so the regression line can be written

yi = 963.191 + 18.501xi,

or equivalently using the names of the variables:

salaryi = 963.191 + 18.501 RoEi.

The interpretation of this regression line is that it gives a prediction of CEO salary in terms ofthe return on equity of the firm. For example, for the first firm the predicted salary on the basisof return on equity is

y1 = 963.191 + 18.501xi

= 963.191 + 18.501× 14.10

= 1224.1,

or $1, 224, 100, and the residual is

u1 = y1 − y1= 1095− 1224.1

= −129.1,

or −$129, 100. That is, the CEO of the first company in the dataset is earning $129, 100 lessthan predicted by the firm’s return on equity. Table 2 gives some of the values of yi and uicorresponding to those values of yi and xi given in Table 1.

4

Page 5: Etc 2410 Notes

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

0 10 20 30 40 50 60

ROE

SALA

RY

Figure 1: Scatter plot of CEO salaries against Return on Equity

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

0 10 20 30 40 50 60

ROE

SAL

AR

Y

Figure 2: CEO salaries vs Return on Equity with OLS regression line

5

Page 6: Etc 2410 Notes

Table 1: Data on CEO salaries and Return on EquityObservation (i) Salary (yi) Return on Equity (xi)

1 1095 14.102 1001 10.903 1122 23.50...

......

208 555 13.70209 626 14.40

Table 2: CEO salaries and Return on Equity, with regression predictions and residualsObservation (i) Salary (yi) Return on Equity (xi) Predicted Salary (yi) Residual (ui)

1 1095 14.10 1224.1 −129.12 1001 10.90 1164.9 −163.93 1122 23.50 1398.0 −276.0...

......

......

208 555 13.70 1216.7 −661.7209 626 14.40 1229.6 −603.6

1.2 Interpretation of regression as prediction

The intrepretation of the regression coeffi cients β0 and β1 relies on the interpretation of regressionas giving predictions for yi using xi. For a general regression equation

yi = β0 + β1xi,

the interpretation of β0 is that it is the predicted value of yi when xi = 0. It depends on theapplication whether xi = 0 is practically relevant. In the CEO salary example, a firm with zeroreturn on equity (i.e. net income of zero) is predicted to have a CEO with a salary of $963,191.Such a prediction has some value in this case because it is possible for a firm to have zero netincome in a particular year, and the data contains observations where the return on equity is quiteclose to zero. As a different example, if we had a regression of individual wages on age of the formwagei = β0 + β1agei, it would make no practical sense to predict the wage of an individual of agezero! In this case the intercept coeffi cient β0 does not have a natural interpretation.

The slope coeffi cient β1 measures the change in the predicted value yi that would follow from aone unit increase in the regressor xi. The predicted value of yi given the regressor takes the valuexi is yi = β0 + β1xi, while the predicted value of yi given the regressor takes the value xi + 1 isy∗i = β0+ β1 (xi + 1). The change in prediction for yi based on this change in xi is y∗i −yi = β1. Inthe CEO salary example, an increase of 1% in a firm’s return on equity corresponds to a predictedincrease of 18.501 ($18, 501) in CEO salary. This quantifies how increases in firm income changeour prediction for CEO salary. Econometrics is especially concerned with the estimation andinterpretation of such slope coeffi cients.

1.3 Regression in Eviews

Eviews is statistical software designed specifically for econometric analysis. Data can be read infrom Excel files and then easily analysed using OLS regression. The steps to carry out the CEOsalary analysis in the previous section are presented here.

6

Page 7: Etc 2410 Notes

Figure 3: Excel spreadsheet for CEO salary data

Figure 3 shows part of an Excel spreadsheet containing the CEO salary data. The variablenames are in the first row, followed by the observations for each variable. To open this file inEviews, go to “File - Open - Foreign Data as Workfile...” as shown in Figure 4, and select theExcel file in the subsequent window. On opening the file, the dialog boxes in Figures 5, 6 and 7can often be left unchanged. The first specifies the range of the data within the workfile (in thiscase the first two columns of Sheet 1), the second specifies that the variable names are containedin the first row of the spreadsheet, and the third specifies that a new workfile be created in Eviewsto contain the data. For simple data sets such as this, the defaults in these dialog boxes will becorrect. More involved data sets will be considered later. On clicking “Finish”in the final dialogbox, the new workfile is displayed in Eviews, see Figure 8.

The Range of the workfile specifies the total number of observations available for analysis, inthis case 209. The Sample of the workfile specifies which observations are currently being usedfor analysis, and this defaults to the full range of the workfile unless otherwise specified. Thereare four objects displayed in the workfile – c, resid, roe and salary. The first two of these will bepresent in any workfile. The “c”and “resid”objects contain the coeffi cient values and residualsfrom the most recently estimated regression. The objects roe and salary contain the data on thosetwo variables. For example, double clicking on salary gives the object view shown in Figure 9,where the observations can be seen. Many other views are possible, but a common and importantfirst step is to obtain some graphical and statistical summaries by selecting “View - DescriptiveStatistics & Tests - Histogram and Stats”as shown in Figure 10. This results in Figure 11, wherethe histogram gives an idea of the distribution of the variable and the descriptive statistics providean idea of the measures of central tendency (mean, median), dispersion (maximum, minimum,standard deviation) and other measures. The mean CEO salary is $1,281,120 while the medianis $1,039,000. The substantial difference between these two statistics is because there are atleast three very large salaries that are very influential on the mean, but not the median. Theseobservations were also evident in the scatter plot in Figure 1. The same descriptive statistics can

7

Page 8: Etc 2410 Notes

Figure 4: To open an Excel file in Eviews...

Figure 5: ... specify the range of the data in the spreadsheet...

8

Page 9: Etc 2410 Notes

Figure 6: ... specify that the first header line contains the variables names (salary and ROE)...

9

Page 10: Etc 2410 Notes

Figure 7: ... and specify that a new undated workfile be created.

Figure 8: New Eviews workfile for CEO salary data

10

Page 11: Etc 2410 Notes

Figure 9: Contents of the salary object

be obtained for the Return on Equity variable.The scatter plots in Figure 1 or 2 can be obtained by selecting “Quick - Graph...”as shown in

Figure 12, entering “roe salary”into the resulting Series List box as shown in Figure 13, and thenspecifying a Scatter with regression line (if desired) as shown in Figure 14. The result is Figure 2.

The regression equation itself can be computed by selecting “Quick - Estimate Equation...”as shown in Figure 15, and then specifying the equation as shown in Figure 16. The dependentvariable (salary) for the regression equation goes first, the “c” refers to the intercept of the

equation(β0

)and then the explanatory variable (roe). The results of the regression calculation

are shown in Figure 17. In particular the values of the intercept(β0 = 963.1913

)and the slope

coeffi cient on RoE(β1 = 18.50119

)can be read from the “Coeffi cient”column of the tabulated

results. The equation can be named as shown in Figure 18, which means that it will appear asan object in the workfile and can be saved for future reference.

To obtain the predicted values salaryi for the regression, click on the “Forecast”button andenter a new variable name in the “Forecast name”box, say “salary_hat”, as shown in Figure 19.A new object called salary_hat is created in the workfile and double clicking on it reveals thevalues shown in the Figure ??, the first three of which correspond to the values given in Table 2for yi.

To obtain the residuals ui for the regression select “Proc - Make Residual Series” in theequation window as shown in Figure 20 and name the new residuals object as shown in the Figure21. The resulting residuals for the CEO salary regression are shown in Figure 22, the first threeof which correspond to the values given in Table 2 for ui.

11

Page 12: Etc 2410 Notes

Figure 10: Obtaining descriptive statistics

0

10

20

30

40

50

60

70

80

90

0 2000 4000 6000 8000 10000 12000 14000

Series: SALARYSample 1 209Observations 209

Mean  1281.120Median  1039.000Maximum  14822.00Minimum  223.0000Std. Dev.  1372.345Skewness    6.854923Kurtosis  60.54128

Jarque­Bera  30470.10Probability  0.000000

Figure 11: Descriptive statistics and histogram for the CEO salaries

12

Page 13: Etc 2410 Notes

Figure 12: Selecting Quick - Graph...

Figure 13: Variables to plot – roe first because it goes on the x-axis.

13

Page 14: Etc 2410 Notes

Figure 14: Selecting a scatter plot with regression line.

Figure 15: Selecting a Quick Equation...

14

Page 15: Etc 2410 Notes

Figure 16: Specifying a regression of CEO salary on an intercept and Return on Equity

Figure 17: Regression results for CEO salary on Return on Equity

15

Page 16: Etc 2410 Notes

Figure 18: Naming an equation to keep it in the workfile.

Figure 19: Use the Forecast procedure to calculate predicted values from the regression.

16

Page 17: Etc 2410 Notes

Figure 20: Make residuals object from a regression

Figure 21: Name the residuals series u_hat

17

Page 18: Etc 2410 Notes

Figure 22: The residuals from the CEO salary regression.

1.4 Goodness of fit

The equation (2) that defines the regression residuals can we written

yi = yi + ui, (5)

which states that the regression decomposes each observation into a prediction (yi) that is afunction of xi, and the residual ui. Let var (yi) denote the sample variance of y1, . . . , yn :

var (yi) =1

n− 1

n∑i=1

(yi − y)2 ,

and similarly var (yi) and var (ui) are the sample variances of y1, . . . , yn and u1, . . . , un. Somesimple algebra (in section 1.5.3 below) shows that

var (yi) = var (yi) + var (ui) . (6)

(Note that (6) does not follow automatically from (5) and requires the additional property that∑ni=1 yiui = 0.) Equation (6) shows that the variation in yi can be decomposed into the sum of

the variation in the regression predictions yi and the variation in the residuals ui. The variationof the regression predictions is referred to as the variation in yi that is explained by the regression.A common descriptive statistic is

R2 =var (yi)

var (yi),

which measures the goodness of fit of the a regression as the proportion of variation in thedependent variable that is explained by the variation in xi. The R2 is known as the coeffi cientof determination and lies between 0 and 1. The closer is R2 to one, the better the regression issaid to fit. Note that this is just one criteria by which to evaluate the quality of a regression, andothers will be given during the course.

18

witch
Rectangle
witch
Highlight
witch
Highlight
witch
Highlight
witch
Highlight
Page 19: Etc 2410 Notes

It is common, as in Wooldridge, to express the R2 definitions and algebra in terms of sums ofsquares rather than sample variances. Equation (6) can be written

1

n− 1

n∑i=1

(yi − y)2 =1

n− 1

n∑i=1

(yi − y)2 +1

n− 1

n∑i=1

u2i ,

where use is made of∑n

i=1 ui = 0, which in turn implies∑n

i=1 yi =∑n

i=1 yi (see section 1.5.3 forthe derivation). Cancelling the 1/ (n− 1) gives

SST = SSE + SSR,

where

SST =n∑i=1

(yi − y)2 “total sum of squares”

SSE =n∑i=1

(yi − y

)2“explained sum of squares”

SSR =n∑i=1

u2i “residual sum of squares”.

In this case R2 can equivalently be defined

R2 =SSE

SST.

The R2 for the CEO salary regression in Figure 17 is 0.0132, so that just 1.32% of the variationin CEO salaries is explained by the Return on Equity of the firm. This low R2 (i.e. close to zero)need not imply the regression is useless, but it does imply that CEO salaries are determined byother important factors besides just the profitability of the firm.

Some intuition for what R2 is measuring can be found in Figures 23 and 24, which show twohypothetical regressions with R2 = 0.185 and R2 = 0.820 respectively. The data in Figure 24 areless dispersed around the regression line, so that changes in xi more precisely predict changes iny2,i than y1,i. There is more variation in y1,i that is left unexplained by the regression.

Figure 25 gives one example of how R2 does not always provide a foolproof measure of thequality of a regression. The regression in Figure 25 has R2 = 0.975, very close to the maximumpossible value of one, but the scatter plot clearly reveals that the regression does not explainan important feature of the relationship between y3,i and xi – there is some curvature or non-linearity that is not captured by the regression. A high R2 is a nice property for a regression tohave, but is neither necessary nor suffi cient for a regression to be useful.

1.5 Derivations

1.5.1 Summation notation

It will be necessary to know some simple properties of summation operators to follow the deriva-tions. The summation operator is defined by

n∑i=1

ai = a1 + a2 + . . .+ an.

19

witch
Rectangle
witch
Highlight
witch
Highlight
witch
Rectangle
Page 20: Etc 2410 Notes

­1.0

­0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.2 0.4 0.6 0.8 1.0

X

Y1

Figure 23: R2 = 0.185

­1.0

­0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.2 0.4 0.6 0.8 1.0

X

Y2

Figure 24: R2 = 0.820

20

Page 21: Etc 2410 Notes

­1.0

­0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.0 0.2 0.4 0.6 0.8 1.0

X

Y3

Figure 25: R2 = 0.975

It follows thatn∑i=1

(ai + bi) =n∑i=1

ai +n∑i=1

bi.

If c is a constant (i.e. does not vary with i) then

n∑i=1

c = c+ c+ . . .+ c︸ ︷︷ ︸n times

= nc.

Similarlyn∑i=1

cai = c

n∑i=1

ai,

which is an extension of taking c outside the brackets in (ca1 + ca2) = c (a1 + a2).The sample mean of a1, . . . , an is

a =1

n

n∑i=1

ai,

and then

n∑i=1

(ai − a) =n∑i=1

ai −n∑i=1

a

= na− na= 0, (7)

so that the sum of deviations around the sample mean is always exactly zero.

21

Page 22: Etc 2410 Notes

The sum of squares of ai around the sample mean a can be expressed

n∑i=1

(ai − a)2 =

n∑i=1

(a2i − 2aia+ a2

)=

n∑i=1

a2i − 2a

n∑i=1

ai +

n∑i=1

a2

=

n∑i=1

a2i − 2na2 + na2

=

n∑i=1

a2i − na2.

1.5.2 Derivation of OLS

Consider a prediction for yi of the form

yi = b0 + b1xi,

where b0 and b1 could be any coeffi cients. The residuals from this predictor are

ui = yi − yi = yi − b0 − b1xi.

The idea of OLS is to choose the values of b0 and b1 that minimise the sum of squared residuals

SSR (b0, b1) =

n∑i=1

u2i =

n∑i=1

(yi − b0 − b1xi)2 .

The minimisation can be done using calculus. The first derivatives of SSR (b0, b1) with respectto b0 and b1 are

∂SSR (b0, b1)

∂b0= −2

n∑i=1

(yi − b0 − b1xi)

∂SSR (b0, b1)

∂b1= −2

n∑i=1

xi (yi − b0 − b1xi) .

Setting these first derivatives to zero at the desired estimators β0 and β1 gives the first orderconditions

n∑i=1

(yi − β0 − β1xi

)= 0 (8)

n∑i=1

xi

(yi − β0 − β1xi

)= 0. (9)

(See equations (2.14) and (2.15) of Wooldridge, who takes a different approach to arrive at theseequations.)

The first equation can be written

n∑i=1

yi − nβ0 − β1n∑i=1

xi = 0,

22

witch
Rectangle
Page 23: Etc 2410 Notes

which is equivalent (after dividing both sides by n) to

y − β0 − β1x = 0.

Solving for β0 gives (4).Substituting this expression for β0 into the second equation gives

n∑i=1

xi

((yi − y)− β1 (xi − x)

)= 0,

orn∑i=1

xi (yi − y)− β1n∑i=1

xi (xi − x) = 0.

Notice thatn∑i=1

(xi − x) (yi − y) =

n∑i=1

xi (yi − y)− xn∑i=1

(yi − y) =

n∑i=1

xi (yi − y) ,

and similarlyn∑i=1

(xi − x)2 =

n∑i=1

xi (xi − x) ,

so the first order condition for β1 can be written

n∑i=1

(xi − x) (yi − y)− β1n∑i=1

(xi − x)2 = 0,

which leads to (3).

1.5.3 Properties of predictions and residuals

The OLS residualsui = yi − yi = yi − β0 − β1xi,

satisfyn∑i=1

ui = 0, (10)

because of (8), and hence u = 0. Similarly because of (9) the residuals satisfy

n∑i=1

xiui = 0. (11)

From these two it follows thatn∑i=1

yi =n∑i=1

(yi − ui) =n∑i=1

yi −n∑i=1

ui =n∑i=1

yi, (12)

so the OLS predictions yi have the same sum, and hence the same sample mean, as the originaldependent variable yi. Also

n∑i=1

yiui = β0

n∑i=1

ui + β1

n∑i=1

xiui = 0. (13)

23

Page 24: Etc 2410 Notes

Now consider the total sum of squares

SST =

n∑i=1

(yi − y)2

=

n∑i=1

(yi − yi + yi − y)2

=n∑i=1

(ui + yi − y)2

=n∑i=1

u2i + 2n∑i=1

ui (yi − y) +n∑i=1

(yi − y)2

=n∑i=1

u2i + 2

(n∑i=1

uiyi − yn∑i=1

ui

)+

n∑i=1

(yi − y)2

= SSR+ SSE.

The last step uses (13) and (10) to cancel the middle two terms. It also uses (12) to identify SSE =∑ni=1

(yi − y

)2=∑n

i=1 (yi − y)2. Also (10) implies that u = 0 so that SSR =∑n

i=1

(ui − u

)2does not require the u. Dividing this equality through by n− 1 gives (6).

2 Statistical Inference and the Population Regression Function

The review of regression in section 1 suggests that regression is a useful tool for summarisingthe relationship between two observed variables and for calculating predictions for one variablebased on observations on the other. In econometrics we want to do more than this. We want touse the information contained in a sample to carry out inductive inference (statistical inference)on the underlying population from which the sample was drawn. For example, we want to takethe sample of 209 CEO salaries in section 1 as being representative of the salaries of CEOs inthe population of all firms. In practice it is necessary to be very careful about the definition ofthis population. This dataset, taken from the American textbook of Wooldridge, would best betaken as being representative of only US firms, rather than all firms in the world, or all firms inOECD countries. In fact the population may be US publicly listed firms, since firms unlisted onthe stock market may have quite difference processes for executive salaries. Nevertheless, withthe population carefully defined, the idea of statistical inference is to make statistical statementsabout that population, not only the sample that has been observed.

2.1 Simple random sample

Suppose there is a well defined population in which we are interested, eg. the population of publiclylisted firms in the US. A simple random sample is one in which each firm in the population hasan equal probability of being included in the sample. Moreover each firm in the sample is chosenindependently of all the others. That is the probability of inclusion or exclusion of one firm intothe sample does not depend on the inclusion or exclusion of any other firm.

For each firm included in the sample, we take one or more measurements of interest (eg. CEOsalary and the firm’s Return on Equity). Mathematically these are represented as the randomvariables yi and xi for i = 1, . . . , n, where n is the sample size. The concept of a random variablereflects the idea that the values taken in the sample would have been different if a different randomsample had been drawn. In the observed sample we had y1 = 1095, y2 = 1001, etc, but if another

24

Page 25: Etc 2410 Notes

simple random sample had been drawn then different firms should (most likely) have been chosenand the yi values would have been different. That is, random variables take different values if adifferent sample is drawn.

In the population of all firms, there is a distribution of CEO salaries. The random variablesy1, y2, . . . are independent random drawings from this distribution. That is, each of y1, y2, . . .are independent of each other and are drawn from the same underlying distribution. They aretherefore called independent and identically distributed random variables, always abbreviated asi.i.d..

There are many other sampling schemes that may arise in practice, some of which will beintroduced later. Our initial discussion of regression modelling will be confined to cross-sectionaldata drawn as a simple random sample, i.e. confined to i.i.d. random variables.

2.2 Population distributions and parameters

The population distribution of interest will be characterised by certain parameters of interest.For example, the distribution of CEO salaries for the population of publicly listed firms in theUS will have some population mean that could be denoted µ. The population mean is definedmathematically as the expected value of the population distribution. This expected value is theweighted average of all possible values in the population, with weights given by the probabilitydistribution denoted f (y), i.e. µ =

∫yf (y) dy. (The evaluation of such integrals will not be

required here, but see Appendix B.3 of Wooldridge for some more details.) Since each of therandom variables y1, y2, . . . represent random drawings from the population distribution, each ofthem also has a mean of µ. This is written

E (yi) = µ, i = 1, 2, . . . . (14)

Similarly the population distribution of CEO salaries has some variance that could be denotedσ2, which is defined in terms of yi as

σ2 = var (yi) = E[(yi − µ)2

], i = 1, 2, . . . .

(Or, in terms of integrals, as σ2 =∫

(y − µ)2 f (y) dy.)

2.3 Population vs Sample

It will be important throughout to be clear on the distinction between the population and thesample. The population is too large or unwieldly or simply impossible to fully observe andmeasure. Therefore a quantity such as the population mean µ = E (yi) is also impossible toobserve. Instead we take a sample, which is a subset of the population, and attempt to estimatethe population mean µ based on that sample. An obvious (but not the only) statistic to use toestimate µ is the sample mean y = 1

n

∑ni=1 yi. A sample statistic such as y is observable, for eg

y = 1281.12 for the CEO salary data (see Figure 11).It is vital at all times to keep clear the distinction between an unobservable population pa-

rameter like µ = E (yi) about which we wish to learn, and an observable sample statistic y thatwe use to estimate µ. More generally we want to use y (and perhaps other statistics) to drawstatistical inferences about µ.

2.4 Conditional Expectation

In econometrics we are nearly always interested in at least two random variables in a population,eg yi for CEO salary and xi for Return on Equity, and the relationships between them. Of central

25

Page 26: Etc 2410 Notes

interest in econometrics is the conditional distribution of yi given xi. That is, rather than beinginterested in the distribution of CEO salaries in isolation (the so-called marginal distribution ofyi), we are interested in how the distribution of CEO salaries changes as the Return on Equity ofthe firm changes. For regression analysis, the fundamental population quantity of interest is theconditional expectation of yi given xi, which is denoted by the function

E (yi|xi) = µ (xi) . (15)

(As outlined in Appendix B.4 of Wooldridge, this conditional expectation is defined as µ (x) =∫yfY |X (y|x) dy, where fY |X is conditional distribution of yi given xi.) Much of econometrics is

devoted to estimating conditional expectations functions.The idea is that E (yi|xi) provides the prediction of yi corresponding to a given value of xi

(i.e. the value of yi that we would expect given some value of xi). For example µ (10) is thepopulation mean of CEO salary for a firm with Return on Equity of 10%. This conditional meanwill be different (perhaps lower?) than µ (20) which is the population mean of CEO salary for afirm with Return on Equity of 20%. If the population mean of yi changes when we change thevalue of xi, there is a potentially interesting relationship between yi and xi to explore.

Consider the difference between the unconditional mean µ = E (yi) given in (14) and theconditional mean µ (xi) = E (yi|xi) given in (15). These are different population quantities withdifferent uses. The unconditional mean µ provides an overall measure of central tendency forthe distribution of yi but provides no information on the relationship between yi and xi. Theconditional mean µ (xi), by contrast, describes how the predicted/mean value of yi changes withxi. For example, µ is of interest if we want to investigate the overall average level of CEO salaries(perhaps to compare them to other occupations say), while µ (xi) is of interest if we want to startto try to understand what factors may help explain the level of CEO salaries.

Note also that µ is, by definition, a single number. On the other hand µ (xi) is a function,that is it is able to take different values for different values of xi.

2.5 The Population Regression Function

The Population Regression Function (PRF) is, by definition, the conditional expectations function(15). In a simple regression analysis, it is assumed that this function is linear, i.e.

E (yi|xi) = β0 + β1xi. (16)

This linearity assumption need not always be true and is discussed more later. This PRF specifiesthe conditional mean of yi in the population for any value of xi. It specifies one important aspectof the relationship between yi and xi.

Statistical inference in regression models is about using sample information to learn aboutE (yi|xi), which in the case of (16) amounts to learning about β0 and β1. Consider the SRFintroduced in (1), restated here:

yi = β0 + β1xi. (17)

The idea is that β0 and β1 are the sample OLS estimators that we calculate to estimate theunobserved population coeffi cients β0 and β1. Then for any xi we can use the sample predictedvalue yi to estimate the conditional expectation E (yi|xi).

2.6 Statistical Properties of OLS

An important question is whether the OLS SRF (17) provides a good estimator of the PRF (16)in some sense. In this section we address this question assuming that

26

witch
Highlight
witch
Highlight
witch
Highlight
Page 27: Etc 2410 Notes

A1 (yi, xi)ni=1 are i.i.d. random variables (i.e. from a simple random sample)

A2 the linear form (16) of the PRF is correct.

Estimators in statistics (such as a sample mean y or regression coeffi cient β0, β1) can beconsidered to be random variables since they are functions of the random variables that representthat data. For example the sample mean y = 1

n

∑ni=1 yi is a random variable because it is defined

in terms of the random variables y1, . . . , yn. That is, if a different random sample had been drawnfor y1, . . . , yn then a different value for y would be obtained. The distribution of an estimatoris called the sampling distribution of the estimator. The statistical properties of an estimator isderived from its sampling distribution.

2.6.1 Properties of Expectations

The properties of a sampling distribution are often defined in terms of its mean and variance andother similar quantities. To work these out, it is necessary to use some simple properties of theexpectations operator E and the conditional expectations operator, summarised here.

Suppose z1, . . . , zn are i.i.d random variables and c1, . . . , cn are non-random. Then

E1 E (∑n

i=1 cizi) =∑n

i=1 ciE (zi)

E2 var (∑n

i=1 cizi) =∑n

i=1 c2i var (zi)

E3 E (ci) = ci, var (ci) = 0.

Property E1 continues to hold if zi are not i.i.d. (for example, if they are correlated with eachother) but Property E2 does not continue to hold if zi are correlated. Recall from Assumption A1that, at least for now, we are assuming that the random variables yi and xi are each i.i.d. acrossi. Property E3 simply states that the expectation of a constant (ci) is itself, and that a constanthas no variation.

In view of the definition of the PRF (16), conditional expectations are fundamental to re-gression analysis. It turns out to be useful to be able to work with not only E (yi|xi) butE (yi|x1, . . . , xn), which is the conditional expectation of yi given information on the explana-tory variables for all observations, not only observation i. The reason for this becomes clear inthe following section. Under Assumption A1

E (yi|xi) = E (yi|x1, . . . , xn) . (18)

This can be proven formally, but the intuition is simply that under independent sampling, infor-mation in explanatory variables xj for j 6= i is not informative about yi since (yi, xi) and (yj , xj)are independent for all j 6= i. That is, knowing xj for j 6= i does not change our prediction of yi.For example, our prediction of the CEO salary for firm 1 is not improved by knowing the Returnto Equity of any other firms, it is assumed to be explained only by the performance of firm 1.That is

E (salaryi|RoEi) = E (salaryi|RoE1, . . . ,RoEn) .

Equation (18) is reasonable under Assumption A1, but not in other sampling situations such astime series data considered later.

The conditional variance of a random variable is a measure of its conditional dispersion aroundits conditional mean. For example

var (yi|xi) = E[

(yi − E (yi|xi))2∣∣∣xi] .

27

Page 28: Etc 2410 Notes

(Compare this to the unconditional variance var (yi) = E[(yi − E (yi))

2].) The conditional vari-

ance of yi is the variation in yi that remains when xi is given a fixed value. The uncondi-tional variance of yi is the overall variation in yi, averaged across all xi values. It follows thatvar (yi|xi) ≤ var (yi). If yi and xi are independent then var (yi|xi) = var (yi). It is frequently thecase in practice that var (yi|xi) may vary in important ways with xi. For example, it may be thatCEO salaries are more highly variable for more profitable firms than less profitable firms. Or, ifyi is wages and xi is individual age, then it is likely that the variation in wage across individualsbecome greater as age increases. If var (yi|xi) varies with xi then this is called heteroskedasticity.If var (yi|xi) is constant across xi then this is called homoskedasticity.

Under Assumption A1, the conditional expectations operator has properties similar to E1 andE2. Suppose c1, . . . , cn are either non-random or functions of x1, . . . , xn only (i.e. not functionsof y1, . . . , yn). Then

CE1 E (∑n

i=1 ciyi|x1, . . . , xn) =∑n

i=1 ciE (yi|xi)

CE2 var (∑n

i=1 yi|x1, . . . , xn) =∑n

i=1 c2i var (yi|xi)

Without i.i.d. sampling (eg. time series), CE1 would continue to hold in the form E (∑n

i=1 yi|x1, . . . , xn) =∑ni=1E (yi|x1, . . . , xn) while CE2 would generally not be true.The final very useful property of conditional expectations is the Law of Iterated Expectations:

LIE For any random variables z and x, E [z] = E [E (z|x)].

The LIE may appear odd at first but is very useful and has some intuition. Leaving asidethe regression context, let z represent the outcome from a roll of a die, i.e. a number from1, 2, . . . , 6. The expected value of this random variable is E (z) = 1

6 (1 + 2 + . . .+ 6) = 3.5, sincethe probability of each possible outcome is 1

6 . Now suppose we define another random variablex that takes the value 0 if z is even and 1 if z is odd. That is x = 0 if z = 2, 4, 6, and x = 1 ifz = 1, 3, 5, so that Pr (x = 0) = 1

2 and Pr (x = 1) = 12 . It should be clear that E (z|x = 0) = 4

and E (z|x = 1) = 3, which illustrates the idea that conditional expectations can take differentvalues (4 or 3) when the conditioning variables take different values (0 or 1). The expectedvalue of the random variable E (z|x) is taken as an average over the possible x values, that is,E [E (z|x)] = 1

2 (4 + 3) = 3.5, since the probability of each possible outcome of E (z|x) is 12 . Thisillustrates the LIE, i.e. E (z) = E [E (z|x)] = 3.5. While E [E (z|x)] may appear more complicatedthan E (z), it frequently turns out to be easier to work with.

The LIE also has a version in variances:

LIEvar var (z) = E [var (z|x)] + var [E (z|x)].

This shows that the variance of a random variable can be decomposed into its average conditionalvariance given x and the variance of the regression function on x.

2.6.2 Unbiasedness

An estimator is defined to be unbiased if the mean of its sampling distribution is equal to thetrue value of the parameter being estimated. If θ is any estimator of a parameter θ, it is unbiased

if E(θ)

= θ. The idea is that an unbiased estimator is one that does not systematically under-

estimate or over-estimate the true value θ. Some samples from the population will give values ofθ below θ and some samples will give values of θ above θ, and these differences average out. Inpractice we only get to observe a single value of θ of course, and this single value may differ from θ

28

witch
Highlight
witch
Highlight
Page 29: Etc 2410 Notes

by being too large or small. It is only on average over all possible samples that the estimator givesθ. So unbiasedness is a desirable property for a statistical estimator, although not one that occursvery often. However in linear regression models there are situations where the OLS estimator canbe shown to be unbiased. We consider the unbiasedness of the sample mean first, and then theOLS estimator of the slope coeffi cient in a simple regression.

Let µy = E (yi) denote the population mean of the i.i.d. random variables y1, . . . , yn. Then

E (y) = E

(1

n

n∑i=1

yi

)=

1

n

n∑i=1

E (yi) =1

n

n∑i=1

µy = µy, (19)

where the second step uses Property E1 above, and this shows that the sample mean is an unbiasedestimator of the population mean.

Under Assumptions 1 and 2 above, the OLS estimators β0 and β1 can be shown to be unbiased.Just β1 is considered here. First recall the property of zero sums around sample means (7), whichimplies

n∑i=1

(xi − x) = 0, (20)

andn∑i=1

(xi − x) (yi − y) =n∑i=1

(xi − x) yi − yn∑i=1

(xi − x) =n∑i=1

(xi − x) yi, (21)

and similarlyn∑i=1

(xi − x)2 =

n∑i=1

(xi − x)xi. (22)

Using (21) allows β1 in (3) to be written

β1 =

∑ni=1 (xi − x) yi∑ni=1 (xi − x)2

=n∑i=1

(xi − x)∑ni=1 (xi − x)2

yi

=

n∑i=1

an,iyi. (23)

This shows that β1 is a weighted sum of y1, . . . , yn, with the weight on each observation yi beinggiven by

an,i =(xi − x)∑ni=1 (xi − x)2

, (24)

which for each i depends on all of x1, . . . , xn (hence the subscript n included in the an,i notation).Now use the LIE to write

E(β1|x1, . . . , xn

)=

n∑i=1

an,iE (yi|x1, . . . , xn)

=

n∑i=1

an,i (β0 + β1xi)

= β0

n∑i=1

an,i + β1

n∑i=1

an,ixi, (25)

29

Page 30: Etc 2410 Notes

where the second lines uses (18), which holds under Assumption A1. Using (20) givesn∑i=1

an,i =

∑ni=1 (xi − x)∑ni=1 (xi − x)2

= 0,

and using (22) givesn∑i=1

an,ixi =

∑ni=1 (xi − x)xi∑ni=1 (xi − x)2

= 1.

Substituting these into (25) gives

E(β1|x1, . . . , xn

)= β1, (26)

and hence, applying the LIE,

E[β1

]= E

[E(β1|x1, . . . , xn

)]= E [β1] = β1.

This shows that β1 is an unbiased estimator of β1.

2.6.3 Variance

The variance of an estimator measures how dispersed values of the estimator can be around themean. In general it is preferred for an estimator to have a small variance, implying that it tendsnot to produce estimates very far from its mean. This is especially so for an unbiased estimator,for which a small variance implies the distribution of the estimator is closely concentrated aroundthe true population value of the parameter of interest.

For the sample mean, consider again the i.i.d. random variables y1, . . . , yn each with populationmean µy = E (yi) and population variance σ2y. Then

var (y) = var

(1

n

n∑i=1

yi

)=

1

n2

n∑i=1

var (yi) =σ2yn, (27)

the second equality following from Property E2. This formula shows what factors influence theprecision of the sample mean – the variance σ2y and the sample size n. Specifically havinga population with a small variance σ2y leads to a more precise estimator y of µy, which makesintuitive sense. Similarly intuitively, a larger sample size n implies a smaller variance of y, implyingthat more precise estimates are obtained from larger sample sizes.

Now consider the variance of the OLS slope estimator β1. Using Property LIEvar above, thevariance of β1 can be expressed

var(β1

)= E

[var(β1|x1, . . . , xn

)]+ var

[E(β1|x1, . . . , xn

)]= E

[var(β1|x1, . . . , xn

)]+ var [β1]

= E[var(β1|x1, . . . , xn

)],

where (25) is used to get the second line and then Property E3 (the variance of a constant is zero)to get the third line. The conditional variance of β1 given x1, . . . , xn is

var(β1|x1, . . . , xn

)=

n∑i=1

a2n,i var (yi|xi)

=

∑ni=1 (xi − x)2 var (yi|xi)(∑n

i=1 (xi − x)2)2 ,

30

witch
Highlight
witch
Highlight
witch
Highlight
witch
Highlight
witch
Highlight
witch
Highlight
Page 31: Etc 2410 Notes

using Property CE2 to obtain the first line and then substituting for an,i to obtain the secondline. This implies

var(β1

)= E

∑ni=1 (xi − x)2 var (yi|xi)(∑n

i=1 (xi − x)2)2

, (28)

which is a fairly complicated formula that doesn’t shed a lot of light on the properties of β1, butit does have later practical use when we talk about hypothesis testing.

A simplification of the variance occurs under homoskedasticity, that is when var (yi|xi) = σ2

for every i. If the conditional variance is constant then

var(β1

)= E

[σ2∑n

i=1 (xi − x)2

]

=σ2

n− 1E

[1

s2x

], (29)

where

s2x =1

n− 1

n∑i=1

(xi − x)2

is the usual sample variance of the explanatory variable xi. Formula (29) is simple enough tounderstand what factors in a regression influence the precision of β1. The variance will be smallfor small values of σ2 and large values of n − 1 and s2x. This implies practically that slopecoeffi cients can be precisely estimated in situations where the sample size is large, where theregressor xi is highly variable, and where the dependent variable yi has small variation aroundthe regression function (i.e. small σ2).

2.6.4 Asymptotic normality

Having discussed the mean and variance of a sampling distribution, it is also possible to considerthe entire sampling distribution. This becomes important when we discuss hypothesis testing.

First consider the sample mean of some i.i.d. random variables y1, . . . , yn with mean µy andvariance σ2y. Recall from (19) and (27) that the sample mean y has mean E (y) = µy and variancevar (y) = σ2y/n. In general the sampling distribution of y is not known, but in the special case weit is known that each yi is normally distributed, then it also follows that y is normally distributed.That is if yi ∼ i.i.d.N

(µy, σ

2y

)then

y ∼ N(µy,

σ2yn

). (30)

If the distribution of yi is not normal, then the distribution of y is also not normal. In econometricsis it very rare to know that each yi is normally distributed, so it would appear that (30) has onlytheoretical interest. However, there is a powerful result in probability called the Central LimitTheorem that states that even if yi is not normally distributed, the sample mean y can still betaken to be approximately normally distributed, with the approximation generally working betterfor larger values of n. Technically we say that y converges to a normal distribution as n→∞, orthat y is asymptotically normal, and we will write this in the form

ya∼ N

(µy,

σ2yn

), (31)

31

Page 32: Etc 2410 Notes

.00

.04

.08

.12

.16

.20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

X

F_G

AM

MA

Figure 26: The Gamma distribution with parameters b = r = 2

with the “a”denoted the fact that the normal distribution for y is asymptotic (i.e. as n → ∞)or more simply is approximate.

The proof of the Central Limit Theorem goes beyond our scope, but it can be illustrated usingsimulated data. Suppose that y1, . . . , yn are i.i.d. random variables with a Gamma distributionas shown in Figure 26. The mean of this distribution is µy = 4. The details of the Gammadistribution are not important for this discussion, although it is a well-known distribution formodelling certain types of data in econometrics. For example, the skewed shape of the distributioncan make it suitable for income distribution modelling, in which many people or households makelow to moderate incomes and a relative few make high to very high incomes. Clearly this Gammadistribution is very different in shape from a normal distribution! We can use Eviews to draw asample of size n from this Gamma distribution and to compute y. Repeating this many timesbuilds up a picture of the sampling distribution of y for the given n. The results of doing this aregiven in Figures 27-31.

Figure 27 shows the simulated sampling distribution of y when n = 5. The skewness of thepopulation distribution of yi in Figure 26 remains evident in the distribution of y in Figure 27,but to a reduced extent. The approximation (31), which is meant to hold for large n, does notwork very well for n = 5. As n increases, however, through n = 10, 20, 40, 80 in Figures 28-31,it is clear that the sampling distribution of y becomes more and more like a normal distribution,even though the underlying data from yi is very far from being normal. This is the CentralLimit Theorem at work and is why, for reasonable sample sizes, we are prepared to rely on anapproximate distribution such as (31) to carry out statistical inference.

Two other features of the sampling distributions in Figures 27-31 are worth noting. Firstlythe mean of each sampling distribution is known to be µy = 4 because y is unbiased for everyn. Secondly the variance of the sampling distribution becomes smaller as n increases becausevar (y) = σ2y/n. That is, the sampling distribution becomes more concentrated around µy = 4 asn increases (note carefully the scale on the horizontal axis changing as n increases).

The same principle applies to the regression coeffi cients β0 and β1. Each can be shown to beasymptotically normal because of the Central Limit Theorem. For β1, the Central Limit Theorem

32

Page 33: Etc 2410 Notes

0

100

200

300

400

500

600

700

800

900

0 1 2 3 4 5 6 7 8 9 10 11

Freq

uenc

y

YBAR_5

Figure 27: Sampling distribution of y with n = 5 observations from the Gamma(2, 2) distribution.

0

200

400

600

800

1,000

1,200

1 2 3 4 5 6 7 8

Freq

uenc

y

YBAR_10

Figure 28: Sampling distribution of y with n = 10 observations from the Gamma(2, 2) distribution.

33

Page 34: Etc 2410 Notes

0

400

800

1,200

1,600

2,000

2 3 4 5 6 7 8

Freq

uenc

y

YBAR_20

Figure 29: Sampling distribution of y with n = 20 observations from the Gamma(2, 2) distribution.

0

200

400

600

800

1,000

2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0

Freq

uenc

y

YBAR_40

Figure 30: Sampling distribution of y with n = 40 observations from the Gamma(2, 2) distribution.

34

Page 35: Etc 2410 Notes

0

200

400

600

800

1,000

1,200

1,400

2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4

Freq

uenc

y

YBAR_80

Figure 31: Sampling distribution of y with n = 80 observations from the Gamma(2, 2) distribution.

applies to the sum (23) and gives the approximate distribution

β1a∼ N

(β1, ω

21,n

), (32)

where in general

ω21,n = var(β1

)= E

∑ni=1 (xi − x)2 var (yi|xi)(∑n

i=1 (xi − x)2)2

, (33)

as given in (28). Under homoskedasticity, this simplifies to

ω21,n =σ2

n− 1E

[1

s2x

], (34)

as shown in (29).

2.7 Summary

In introductory econometrics the topic of statistical inference and its theory is typically the mostdiffi cult to grasp, both in its concept and formulae. What follows is a summary of the importantconcepts of this section.

Populations and Samples

• Statistical inference is the process of attempting to learn about some characteristics of apopulation based on a sample drawn from that population.

• The most straightforward sampling approach is a simple random sample, in which everyelement in the population has an equal chance of being included in the sample.

Mean and variance in the Population and the Sample

35

Page 36: Etc 2410 Notes

• Population characteristics such as means and variances are defined as the expectationsµy = E (yi) and σ2y = E

[(yi − µy

)2].• Sample estimators of means and variances are defined as the sums y = 1

n

∑ni=1 yi and

s2y = 1n−1

∑ni=1 (yi − y)2.

Regression in the Population and the Sample

• The Population Regression Function (PRF) is defined in terms of the conditional expecta-tions operator E (yi|xi) = β0 + β1xi.

• The Sample Regression Function (SRF) is defined in terms of the OLS regression line yi =β0 + β1xi.

Statistical propertiesUnder simple random sampling

• y a∼ N(µy, σ

2y/n)

• β1a∼ N

(β1, ω

21,n

), where in general

ω21,n = E

∑ni=1 (xi − x)2 var (yi|xi)(∑n

i=1 (xi − x)2)2

,or under homoskedasticity

ω21,n =σ2

n− 1E

[1

s2x

].

36

Page 37: Etc 2410 Notes

3 Hypothesis Testing and Confidence Intervals

The idea of statistical inference is that we use the observable sample information summarised bythe SRF

yi = β0 + β1xi

to make inferences about the unobservable PRF

E (yi|xi) = β0 + β1xi.

For example, in the CEO salary regression in Figure 17, we take β1 = 18.50 to be the pointestimate of the unknown population coeffi cient β1. This point estimate is very useful, but onits own it doesn’t communicate the uncertainty that is implicit in having taken a sample of justn = 209 firms from all firms in the population. If we had taken a different sample of firms, wewould have obtained a different value for β1. This uncertainty is summarised in the samplingdistribution of β1 in equation (32), which quantifies (approximately) the entire distribution ofβ1 that could have been obtained by taking different samples from the underlying population.The techniques of hypothesis testing and confidence intervals provide ways of making probabilisticstatements about β1 that are more informative, and more honest about the statistical uncertainty,than a simple point estimate.

3.1 Hypothesis testing

3.1.1 The null hypothesis

The approach in hypothesis testing is to specify a null hypothesis about a particular value fora population parameter (say β0 or β1) and then to investigate whether the observed providesevidence for the rejection of this hypothesis. For example, in the CEO salary regression, we mightspecify a null hypothesis that firm profitability has no predictive power for CEO salary. In thePRF

E (Salaryi|RoEi) = β0 + β1RoEi, (35)

the null hypothesis would be expressed

H0 : β1 = 0. (36)

If the null hypothesis were true then E (Salaryi|RoEi) = β0, which states that average CEOsalaries are constant (β0) across all levels of firm profitability.

Note that the hypothesis is expressed in terms of the population parameter β1, not the sampleestimate β1. Since we know that β1 = 18.50, it would be nonsense to investigate whether β1 = 0....it isn’t! Instead we are interested in testing to see whether β1 = 18.50 differs suffi ciently fromzero such that we can conclude that β1 also differs from zero, albeit with some level of uncertaintythat acknowledges the sampling variability inherent in β1.

3.1.2 The alternative hypothesis

After the specifying the null hypothesis, the next requirement is the alternative hypothesis. Thealternative hypothesis is specified as an inequality, as opposed to the null hypothesis which is anequality. In the case of a null hypothesis specified as (36), the alternative hypothesis would beone of the following three possibilities

H1 : β1 6= 0 or

H1 : β1 > 0 or

H1 : β1 < 0,

37

Page 38: Etc 2410 Notes

depending on the practical context. The alternative H1 : β1 6= 0 is called a two-sided alternative(falling on both sides of the null hypothesis) while H1 : β1 > 0 or H1 : β1 < 0 are called one-sidedalternatives. A one-sided alternative would be specified in situations where the only reasonable orinteresting deviations from the null hypothesis lie on one side. In the case of the null hypothesisH0 : β1 = 0 in (35), we might specify H1 : β1 > 0 if the only interest were in the hypothesis thatprofitable firms reward their CEOs with higher salaries. However there is also a possibility thatsome less profitable firms might try to improve their fortunes by attempting to attract provenCEOs with offers of a higher salary. With two conflicting stories like this, the sign of the possiblerelationship would be unclear and we would specify H1 : β1 6= 0. One very important point isthat we must not use the sign of the sample estimate to specify the alternative hypothesis – thehypothesis testing methodology requires that the hypotheses be specified before looking at anysample information. The hypotheses must be specified on the basis of the practical questions ofinterest. Both one and two sided testing will be discussed below.

3.1.3 The null distribution

The idea in hypothesis testing is to make a decision whether or not to reject H0 in favour of H1on the basis of the evidence in the data. For testing (36), an approach to making this decisioncan be based on the sampling distribution (32). Specifically, if H0 : β1 = 0 is true then

β1a∼ N

(0, ω21,n

),

where ω21,n is given in (33), or (34) in the special case where homoskedasticity can be assumed.This sampling distribution can also be written

β1ω1,n

a∼ N (0, 1) ,

which is useful because the distribution on the right hand side is now a very well known one, thestandard normal distribution, for which derivations and computations are relatively straightfor-ward. However this expression is not yet usable in practice because ω1,n depends on populationexpectations (i.e. it contains an E and a var) and is not observable. It can, however, be estimatedusing

ω1,n =

√∑ni=1 (xi − x)2 u2i∑ni=1 (xi − x)2

, (37)

which is obtained from (33) by dropping the outside expectation, replacing var (yi|xi) by thesquared residuals u2i , and then taking the square root to turn ω

21,n into ω1,n. This quantity ω1,n is

called the standard error of β1. It can then be shown (using derivations beyond our scope) that

β1ω1,n

a∼ N (0, 1) .

That is, replacing the unknown standard deviation ω1,n with the observable standard error ω1,ndoes not change the approximate distribution of β1. However, often a practical better approxi-mation is provided by

β1ω1,n

a∼ tn−2, (38)

where tn−2 denotes the t distribution with n − 2 degrees of freedom. For large values of n thetn−2 and N (0, 1) distributions are almost indistinguishable (indeed limn→∞ tn−2 = N (0, 1)) but

38

Page 39: Etc 2410 Notes

for smaller n using (38) can often give a more accurate approximation. Equation (38) providesa practically usable approximate null distribution for β1 (it’s called the null distribution becauserecall we imposed the null hypothesis to obtain β1

a∼ N(0, ω21,n

)in the first step above).

If it is known that the conditional distribution of yi given xi is homoskedastic (i.e. thatvar (yi|xi) is constant) then (34) can be used to justify the alternative estimator

ω1,n =

√σ2∑n

i=1 (xi − x)2, (39)

where

σ2 =1

n− 2

n∑i=1

u2i ,

is the sample variance of the OLS residuals. In small samples the standard error estimated using(39) may be more precise than that estimated by (37), provided the assumption of homoskedas-ticity is correct. If the assumption of homoskedasticity is incorrect, however, the standard errorin (39) is not valid. In econometrics the standard error in (37) is referred to as “White’s standarderror” while (39) is referred to the “OLS standard error”. Modern econometric practice is tofavour the robustness of (37), and we will generally follow that practice.

If it is known that the conditional distribution of yi given xi is both homoskedastic andnormally distributed (written yi|xi ∼ N

(β0 + β1xi, σ

2)) then the null distribution (38) with ω1,n

given in (39) is exact, no longer an approximation. This is a beautiful theoretical result, but sinceit is very rarely known that yi|xi ∼ N

(β0 + β1xi, σ

2)in practice, we should acknowledge that

(38) is an approximation.

3.1.4 The alternative distribution

If H0 : β1 = 0 is not true then the approximate sampling distribution (32) can be written

β1a∼ β1 +N

(0, ω21,n

),

which is informal notation that represents a normal distribution with a constant β1 added to it(which is identical to a N

(β1, ω

21,n

)distribution). Then repeating the steps leading to (38) gives

β1ω1,n

a∼ β1ω1,n

+N (0, 1) ,

and then replacing ω1,n by the standard error ω1,n gives

β1ω1,n

a∼ β1ω1,n

+ tn−2. (40)

This equation says that if the null hypothesis is false then the distribution of the ratio β1/ω1,nis no longer approximately tn−2, but instead is tn−2 with a constant β1/ω1,n added to it. Thatis, the distribution is shifted (either positively or negatively depending on the sign of β1) relativeto tn−2 distribution. The difference between (38) under the null and (40) under the alternativeprovides the basis for the hypothesis test.

39

Page 40: Etc 2410 Notes

3.1.5 Decision rules and the significance level

In hypothesis testing we either reject H0 or do not reject H0 (we don’t accept hypotheses, moreon this soon). A hypothesis test requires a decision rule, that specifies when H0 is to be rejected.

Because we have only partial information, i.e. a random sample rather than the entire popu-lation, there is some probability that any decision we make will be incorrect. That is, there is achance we might reject H0 when H0 is in fact true, which is called a Type I error. There is also achance that we might not reject H0 when H0 is in fact false, which is called a Type II error. Thefour possibilities are summarised in this table.

Truth in the populationDecision H0 true H0 falseReject H0 Type I error CorrectDo not reject H0 Correct Type II error

Clearly we would like a hypothesis test to minimise the probabilities of both Type I and II errors,but there is no unique way of doing this. The convention is to set the significance level of thehypothesis to a small fixed probability α, which specifies the probability of a Type I error. Themost common choice is α = 0.05, although α = 0.01 and α = 0.10 are sometimes used.

3.1.6 The t test – theory

The t statistic for testing H0 : β1 = 0 is

t =β1ω1,n

. (41)

From (38) we know that t a∼ tn−2 if H0 is true, while from (40) we know that t is shifted awayfrom the tn−2 distribution if H0 is false. First consider testing H0 : β1 = 0 against the one-sidedalternative H1 : β1 > 0, implying the interesting deviations from the null hypothesis induce apositive shift of t away from the tn−2 distribution. We will therefore define a decision rule basedon t that states that H0 is reject if t takes a larger value than would be thought reasonable fromthe tn−2 distribution. The way we formalise the statement “t takes a larger value than would bethought reasonable from the tn−2 distribution”is to use the significance level. The decision ruleis defined to reject H0 if t takes a larger value than a critical value cα, which is defined by theprobability

Pr (tn−2 > cα) = α

for significance level α. The distribution of t under H0 is tn−2, so the value of cα can be computedfrom the tn−2 distribution, as shown graphically in Figure 32 for α = 0.05 and n − 2 = 30. Thecritical value in this case is c0.05 = 1.697, which can be found in Table G.2 of Wooldridge (p.833)or computed in Eviews.

For testing H0 : β1 = 0 against H1 : β1 < 0, the procedure is essentially a mirror image. Thedecision rule is to reject H0 if t takes a smaller value than the critical value, which is shown inFigure 33. If cα is the α-significance critical value for testing against H1 : β1 > 0, then −cα is theα-significance critical value for testing against H1 : β1 < 0. That is

Pr (tn−2 < −cα) = α.

The critical value for α = 0.05 and n− 2 = 30 is therefore simply −c0.05 = −1.697.For testing H0 : β1 = 0 against H1 : β1 6= 0, the potentially interesting deviations from the null

hypothesis might induce either a positive or negative shift of t away from the tn−2 distribution.

40

Page 41: Etc 2410 Notes

Figure 32: The tn−2 distribution with α = 0.05 critical value for testing H0 : β1 = 0 againstH1 : β1 > 0.

Therefore we need to check in either direction. That is, we will reject H0 if either t takes a largervalue than considered reasonable for the tn−2 distribution, or a smaller value. The decision ruleis to reject H0 if t > cα/2 or t < −cα/2, which can be expressed more simply as t >

∣∣cα/2∣∣, wherecα/2 satisfies

Pr(tn−2 > cα/2

)=α

2,

or equivalentlyPr(|tn−2| > cα/2

)= α.

The critical value for α = 0.05 and n− 2 = 30 is cα/2 = 2.042.

3.1.7 The t test – two sided example

Every hypothesis test needs to specify the following elements:

1. The null hypothesis H0.

2. The alternative hypothesis H1.

3. A significance level α.

4. A test statistic. (in this case t, but we will see others soon)

5. A decision rule that states when H0 is rejected.

6. The decision, and its interpretation.

Consider the CEO salary regression, which has PRF

E (Salaryi|RoEi) = β0 + β1RoEi, (42)

41

Page 42: Etc 2410 Notes

Figure 33: The tn−2 distribution with α = 0.05 critical value for testing H0 : β1 = 0 againstH1 : β1 < 0.

Figure 34: The tn−2 distribution with α = 0.05 critical value for testing H0 : β1 = 0 againstH1 : β1 6= 0.

42

Page 43: Etc 2410 Notes

Figure 35: Choosing to use White standard errors that allow for heteroskedasticity

and the hypotheses H0 : β1 = 0 and H1 : β1 6= 0,so that we are interested in either positive ornegative deviations from the null hypothesis, i.e. any role for firm profitability in predicting CEOsalaries, whether positively or negatively. We will choose α = 0.05, which is the default choiceunless specified otherwise.

The test statistic will be the t statistic given in (41). This statistic can be computed in Eviewsusing either of (37) or (39), with the default choice being (39) which imposes the homoskedasticityassumption. This assumption can frequently be violated in practice, and can be tested for, butwe will play it safe for now and use the (37) version of ω1,n which allows for heteroskedasticity.This requires an additional option to be changed in Eviews. When specifying the regression inEviews in Figure 16, click on the “Options” tab to reveal the options shown in Figure 35, andselect “White” for the coeffi cient covariance matrix as shown. The resulting regression is shownin Figure 36, with the selection of the appropriate White standard errors highlighted. We nowhave enough information to carry out the hypothesis test. The details are as follows.

1. H0 : β1 = 0

2. H1 : β1 6= 0

3. Significance level: α = 0.05

4. Test statistic: t = 2.71

5. Reject H0 if |t| > c0.025 = 1.980

6. H0 is rejected, so Return on Equity is a significant predictor for CEO Salary.

The critical value of c0.025 = 1.96 is found from the table of critical values on p.833 of Wooldridge,reproduced in Figure 37. For this regression with n = 209, the relevant t distribution has n− 2 =207 degrees of freedom. This many degrees of freedom is not included in the table, so we choosethe closest degrees of freedom that is less than this number, i.e. 120. The test is two-sided withsignificance level of α = 0.05, so the critical value of c0.025 = 1.980 can be read from the thirdcolumn of critical values in the table.

43

Page 44: Etc 2410 Notes

Figure 36: CEO salary regression with White standard errors

3.1.8 The t test – one sided example

The assessment for ETC2410/ETC3440 in semester two of 2013 consisted of 40% assignmentsduring the semester and a 60% final exam. Descriptive statistics for these marks, both expressedas percentages, are shown in Figures 38 and 39. It may be of interest to investigate how wellassignment marks earned during the semester predict final exam marks. In particular, we wouldexpect that those students who do better on assignments during the semester will go on to also dobetter on their final exams. The scatter plot in Figure 40 show that such a relationship potentiallydoes exist in the data, so we will carry out a formal hypothesis test in a regression.

The PRF has the form

E (exami|asgnmti) = β0 + β1asgnmti, (43)

and we will test H0 : β1 = 0 (that assignment marks have no predictive power for exam marks)against the one-sided alternative H1 : β1 > 0 (that higher assignment marks predict higher exammarks). The estimates are given in Figure 41, in which the SRF is

exami = 23.763(5.360)

+ 0.548(0.095)

asgnmti.

The numbers in parentheses below the coeffi cients are the standard errors. This is a commonway of reporting an estimated regression equation, since it provides suffi cient information forthe reader to carry out some inference themselves if they wish. The hypothesis test of interestproceeds as follows.

1. H0 : β1 = 0

2. H1 : β1 > 0

3. Significance level: α = 0.05

4. Test statistic: t = 5.766

44

Page 45: Etc 2410 Notes

Figure 37: Critical values from the t distribution from Wooldridge

45

Page 46: Etc 2410 Notes

0

2

4

6

8

10

12

14

16

20 25 30 35 40 45 50 55 60 65 70 75 80 85

Series: ASGNMTSample 1 118Observations 118

Mean  59.67415Median  61.92188Maximum  83.62500Minimum  19.25000Std. Dev.  13.27081Skewness   ­0.738493Kurtosis  3.340469

Jarque­Bera  11.29557Probability  0.003525

Figure 38: Assignment marks for ETC2410 / ETC3440 in semester two of 2013.

0

2

4

6

8

10

12

0 10 20 30 40 50 60 70 80 90

Series: EXAMSample 1 118Observations 118

Mean  56.49154Median  56.61960Maximum  93.24731Minimum  0.000000Std. Dev.  16.74263Skewness   ­0.790440Kurtosis  4.709527

Jarque­Bera  26.65651Probability  0.000002

Figure 39: Exam marks for ETC2410 / ETC3440 in semester two of 2013.

5. Reject H0 if t > c0.05 = 1.662

6. H0 is rejected, so there is evidence that higher assignment marks predict significantly higherfinal exam marks.

The critical value in this case is found from the table in Figure 37 using 90 degrees of freedom(n− 2 = 116 in this case) and the column corresponding to the α = 0.05 level of significance fora one-sided test.

3.1.9 p-values

A convenient alternative way to express a decision rule for a hypothesis test is to use p-valuesrather than critical values, where they are available.

First consider testing H0 : β1 = 0 against H1 : β1 > 0. The critical value for this t test is c0.05as shown in Figure 32. Recall that c0.05 is defined to satisfy Pr (tn−2 > c0.05) = 0.05, which meansthat the area under the tn−2 distribution to the right of c0.05 is 0.05. Any value of the test statistict that falls above c0.05 leads to a rejection of the null hypothesis, and the area under the tn−2distribution to the right of such a value of tmust be less than 0.05. So instead of defining a decision

46

Page 47: Etc 2410 Notes

0

20

40

60

80

100

10 20 30 40 50 60 70 80 90

ASGNMT

EXA

M

Figure 40: Scatter plot of exam marks against assignment marks.

Figure 41: Regression of exam marks on assignment marks

47

Page 48: Etc 2410 Notes

rule in terms of t > c0.05, we could equivalently define decision in terms of Pr (tn−2 > t) < 0.05.That is the decision rules “reject H0 if t > c0.05”and “reject H0 if Pr (tn−2 > t) < 0.05”yieldidentical tests. Similarly if we are testing H0 : β1 = 0 against H1 : β1 < 0, the decision rules“reject H0 if t < −c0.05”and “reject H0 if Pr (tn−2 < t) < 0.05”yield identical tests.

For the two sided problem H0 : β1 = 0 against H1 : β1 6= 0, the decision rule is “reject H0if |t| > c0.025”. Recall that c0.025 is defined to satisfy Pr (tn−2 > c0.025) = 0.025, see Figure 34.The condition |t| > c0.025 therefore implies that Pr (tn−2 > |t|) < 0.025, because |t| is furtherout into the tail of the tn−2 distribution than c0.025. Multiplying this inequality by 2 gives2 Pr (tn−2 > |t|) < 0.05, so the critical value decision rule “reject H0 if |t| > c0.025”is equivalentto “reject H0 if 2 Pr (tn−2 > |t|) < 0.05”.

It is conventional in econometrics and statistics (and in Eviews!) to define the p-value for aregression t statistic as

p = 2 Pr (tn−2 > |t|) . (44)

Therefore the decision rule for testing H0 : β1 = 0 against H1 : β1 6= 0 is

“reject H0 if p < 0.05”,

where p is the value printed out by Eviews under the “Prob.” column of the regression output.The two-sided test of the significance of RoEi in the model (42) can be re-expressed in terms ofp values as follows.

1. H0 : β1 = 0

2. H1 : β1 6= 0

3. Significance level: α = 0.05

4. Test statistic: p = 0.0073

5. Reject H0 if p < 0.05

6. H0 is rejected, so Return on Equity is a significant predictor for CEO Salary.

The p-value in item 4 is read directly from the regression output in Figure 36. Clearly having a p-value available makes the hypothesis test more convenient to carry out because it is not necessaryto look up or compute a critical value. The vast majority of hypothesis tests computed in moderneconometrics and statistics software are accompanied by a p-value for easy testing.

For testing against one-sided alternative hypotheses, a small modification is required. In theintroductory discussion it was shown that the decision rule for testing H0 : β1 = 0 against H1 :β1 > 0 is to “reject H0 if Pr (tn−2 > t) < 0.05”. If t ≥ 0 then Pr (tn−2 > t) = Pr (tn−2 > |t|) = p/2using (44). On the other hand if t < 0 then Pr (tn−2 > t) = 1 − Pr (tn−2 > |t|) > 0.5 (by thesymmetry of the tn−2 distribution) so H0 will never be rejected if t < 0. This makes intuitivesense since t < 0 can only occur if β1 < 0, and an estimate of β1 < 0 cannot provide evidence toreject H0 : β1 = 0 in favour of H1 : β1 > 0. So the decision rule for testing H0 : β1 = 0 againstH1 : β1 > 0 is “reject H0 if t > 0 and p/2 < 0.05”, or more simply

“reject H0 if t > 0 and p < 0.10”.

That is, to carry out a one-sided test at the 5% level of significance, the comparison of the p-valueis made with 0.10 not 0.05. The reason is that the p-value provided by Eviews is (44), which isfor testing against two-sided alternatives.

48

Page 49: Etc 2410 Notes

The decision rule for testing H0 : β1 = 0 against H1 : β1 < 0 is a mirror image of theupper-tailed version, that is

“reject H0 if t < 0 and p < 0.10”,

so that the null is rejected only for negative estimates of β1 whose p-value is less than 0.10.The one-sided test of the significance of assignment marks in (43) can therefore be re-expressed

as follows.

1. H0 : β1 = 0

2. H1 : β1 > 0

3. Significance level: α = 0.05

4. Test statistic: t = 5.766, p = 0.0000

5. Reject H0 if t > 0 and p < 0.10

6. H0 is rejected, so there is evidence that higher assignment marks predict significantly higherfinal exam marks.

The outcome of a hypothesis test carried out using critical values or p-values will always bethe same, the choice comes down to one of convenience. Most often p-values are more convenientand most often used in practice, and we will generally rely on them from now on.

3.1.10 Testing other null hypotheses

By far the most common hypotheses testing in regression models have the null in the form H0 :β1 = 0. However there are other null hypotheses that can also be of interest. For example, in theexam marks application we might want to test whether an extra 1% gained on assignment markspredicts an extra 1% gained on the final exam. In the regression model (43), this would translateto a null hypothesis of the form H0 : β1 = 1.

In general, consider testing a null hypothesis of the form H0 : β1 = b1, where b1 is a specifiednumber (eg, 0,1, etc). The t statistic for testing this null hypothesis is

t =β1 − b1ω1,n

. (45)

Obviously this reduces to (41) when b1 = 0. The decision rules presented above remain unchanged,both for critical values and p-values.

To illustrate, consider testing H0 : β1 = 1 against H1 : β1 6= 1 in the exam regression (43).From the results in Figure 41 we can calculate

t =0.548− 1

0.095= −4. 758. (46)

The hypothesis test using critical values can then proceed as follows.

1. H0 : β1 = 1

2. H1 : β1 6= 1

3. Significance level: α = 0.05

49

Page 50: Etc 2410 Notes

4. Test statistic: t = −4.758

5. Reject H0 if |t| > c0.025 = 1.987.

6. H0 is rejected, so the predicted change in final exam scores corresponding to a 1% higherassignment score is significantly different from 1%.

Note that a two-sided alternative was used in this case because there is no prior expectation beforethe analysis whether the coeffi cient should be greater than one or less than one. Having estimatedthe regression it appears the coeffi cient is less than one, but we must not use that information toformulate the alternative hypothesis.

In order to avoid re-calculating the t statistic manually as we did above, and also in order toobtain convenient p-values, the regression model can be re-estimated in a form that makes testinga null hypothesis H0 : β1 = b1 very easy. Suppose in general we have a PRF of the form

E (yi|xi) = β0 + β1xi.

Subtracting b1xi from both sides gives

E (yi − b1xi|xi) = β0 + (β1 − b1)xi= β0 + β∗1xi.

The null hypothesis H0 : β1 = b1 in the original regression of yi on xi can therefore be equivalentlyre-expressed as H0 : β∗1 = 0 in the regression of (yi − b1xi) on xi.

For testing H0 : β1 = 1 against H1 : β1 6= 1 in the exam regression (43), the PRF is re-written

E (exami − asgnmti|asgnmti) = β0 + (β1 − 1) asgnmti= β0 + β∗1asgnmti.

The results from regressing (exami−asgnmti) on asgnmti are given in Figure 42, which showsthat β

∗1 = −0.452 with t = −4.746. (This latter t statistic differs from (46) only because of the

rounding error induced by the calculation in (46) being carried out using three decimal placesin numerator and denominator. Without this rounding error, the two would be identical.) Thehypothesis test in terms of p-values then proceeds as follows.

1. H0 : β1 = 1

2. H1 : β1 6= 1

3. Significance level: α = 0.05

4. Test statistic: p = 0.0000

5. Reject H0 if p < 0.05

6. H0 is rejected, so the predicted change in final exam scores corresponding to a 1% higherassignment score is significantly different from 1%.

The same conclusions will always be found from this approach (re-specifying the regression) andthe previous approach that manually computes the t statistic and uses a critical value. It willusually be more convenient to practice to re-specify the regression and use the p-value that isthen automatically provided.

50

Page 51: Etc 2410 Notes

Figure 42: Regression of (exami−asgnmti) on asgnmti

3.2 Confidence intervals

Confidence intervals provide an alternative method for summarising the uncertainty due to sam-pling in coeffi cient estimates. A confidence interval is a pair of numbers that form an intervalwithin which the true value of the parameter is contained with a pre-specified probability. Thisprobability, called the confidence level, is typically chosen to be 1−α, where α is the usual signif-icance level used in hypothesis tests. So, for a regression coeffi cient β1, the aim is to find numbersβ1 and β1 such that

Pr(β1≤ β1 ≤ β1

)= 1− α. (47)

The derivation of the confidence interval follows from hypothesis tests of the form H0 : β1 = b1against H1 : β1 6= b1. If we imagine testing these hypothesis for all possible values of b1, theconfidence interval is formed by those values of b1 for which H0 : β1 = b1 is not rejected using atwo-sided t test with significance level α. To show where this leads, for any b1 the null hypothesisH0 : β1 = b1 is not rejected if the t statistic in (45) satisfies |t| ≤ cα/2, which implies

−cα/2 ≤β1 − b1ω1,n

≤ cα/2.

These two inequalities can be re-arranged to give

β1 − cα/2ω1,n ≤ b1 ≤ β1 + cα/2ω1,n. (48)

That is, H0 : β1 = b1 will not be rejected for all b1 in the interval[β1, β1

]=[β1 − cα/2ω1,n, β1 + cα/2ω1,n

], (49)

which is the desired confidence interval. It has the desired level α because when b1 is the truevalue of the parameter, the null hypothesis H0 : β1 = b1 is rejected with probability α (this is thedefinition of the significance level of the test), which implies that it is not rejected with probability

51

Page 52: Etc 2410 Notes

1 − α. Therefore the true value β1 is included in the confidence interval (49) with probability1− α as required.

To illustrate, consider a confidence interval for the slope coeffi cient in the salary PRF (42).From the results in Figure 36 we see that β1 = 18.501 and ω1,n = 6.829. The critical value for atwo-sided t test with significance level α = 0.05 is c0.025 = 1.980. The 95% confidence interval forβ1 is therefore [

β1, β1

]= [18.501− 1.980× 6.829, 18.501 + 1.980× 6.829]

= [4.980, 32.022] . (50)

The interpretation of this interval is that it contains the true value of β1 with probability 95%.(In fact this probability of 95% is an approximation because the distribution of t in (38) onwhich it is based is also approximate. In practice though we usually just talk about a “95%confidence interval”, rather than an approximate or asymptotic 95% confidence interval.) The95% confidence interval (or “interval estimate”) of the coeffi cient implies that an increase of 1%in a firm’s Return on Equity predicts an increase in CEO salary of between $4,980 and $32,022.

A confidence interval provides a convenient and informative way to report the findings of aregression. The mid-point of the interval is the point estimate β1, while its width represents howmuch uncertainty there is about the estimate. A narrow confidence interval implies the sample hasprovided a precise estimate of the coeffi cient. The width of the confidence interval is determinedby the standard error ω1,n, so a small standard error implies a precise estimate and a narrowconfidence interval.

From a hypothesis testing perspective, the confidence interval provides a nice summary of allthe null hypotheses that would not be rejected by a two-sided t test (those values within theinterval) and all of the null hypotheses that would be rejected (those values outside the interval).Clearly this is much more informative than simply reporting a coeffi cient estimate and whether ornot it is significantly different from zero (which does happen sometimes...). A confidence intervalthat does not include zero, such as (50) above, immediately conveys the information that thecoeffi cient estimate is significantly different from zero, but it contains much more information aswell.

These ideas also emphasise why in a hypothesis test we never claim to “accept H0”, we only

say that we “do not reject H0”. Consider the confidence interval[β1, β1

]= [4.980, 32.022]

constructed above. This implies that H0 : β1 = b1 would not be rejected for all b1 between 4.980and 32.022. It would be illogical to say that we accept H0 : β1 = 5 and H0 : β1 = 10 andH0 : β1 = 25 and so on, we cannot accept that β1 is equal to several different values at once!Instead we say that the sample does not provide suffi cient evidence to reject those values at thespecified level of significance.

3.3 Prediction intervals

Suppose we want to make a prediction for yi for a particular fixed value x of xi. For example,to predict average CEO salary for Return on Equity of x = 15%, or final exam marks for anassignment mark of x = 75%. The prediction is given by

y (x) = β0 + β1x, (51)

and this can be taken as an estimator of the true value

µy (x) = E (yi|xi = x) = β0 + β1x.

52

Page 53: Etc 2410 Notes

Just like a confidence interval for the population parameter β1, a prediction interval can be

calculated for the population conditional mean E (yi|xi = x), i.e. an interval[µy (x), µy (x)

]such

thatPr(µy (x) ≤ µy (x) ≤ µy (x)

)= 1− α,

compare to (47) for β1.The distribution of y (x) as an estimator of µy (x) can be derived to be

y (x)a∼ N

(µy (x) , ω2n,µ

),

where

ω2n,µ = E

[n∑i=1

(1

n+ (x− x) an,i

)2var (yi|x1, . . . , xn)

],

and an,i was given in (24). This leads to the prediction interval[µy (x), µy (x)

]=[y (x)− cα/2ωn,µ, y (x) + cα/2ωn,µ

], (52)

where

ω2n,µ =

n∑i=1

(1

n+ (x− x) an,i

)2u2i .

Fortunately there is a convenient way to calculate ω2n,µ without dealing with the formula. If

we take the usual SRF yi = β0 + β1xi and subtract the prediction formula at x given by (51), weobtain

yi = y (x) + β1 (xi − x) .

This shows that an OLS regression of yi on an intercept and (xi − x) will provide an interceptthat corresponds to y (x), and then ωn,µ required for the confidence interval is simply the standarderror on this estimate.

As an example, consider making a prediction for the average final exam mark for an assignmentmark of x = 75%. A regression in Eviews specified as “exam c (asgnmt-75)”will produce anintercept corresponding to y (75). The Eviews output is shown in Figure 43. The prediction isy (75) = 64.90%, with standard error ωn,µ = 2.34. The 95% prediction interval based on (52) istherefore [

µy (75), µy (75)]

= [64.90− 1.987× 2.34, 64.90 + 1.987× 2.34]

= [60.25, 69.55] ,

where c0.025 = 1.987 is obtained from the t distribution table with 90 degrees of freedom (theclosest to n− 2 = 116 in this example). The interpretation of this interval is that it contains thepopulation conditional mean µy (75) = E (exami|asgnmti = 75) with probability of 95%.

3.3.1 Derivations

These derivations of the distribution of the prediction follow easily from the preceding derivationswe did for y and β1, but this subsection is not required for the course.

First recall the representation

β1 =n∑i=1

an,iyi,

53

Page 54: Etc 2410 Notes

Figure 43: Predicting final exam mark for an assignment mark of 75%

wherean,i =

xi − x∑ni=1 (xi − x)2

,

in which∑n

i=1 an,i = 0 and∑n

i=1 an,ixi = 1. This can be used to give a representation for β0 :

β0 = y − β1x =n∑i=1

(1

n− xan,i

)yi

and substituting for β0 and β1 into (51) gives

y (x) =

n∑i=1

(1

n+ (x− x) an,i

)yi,

which shows that y (x) is a weighted sum of y1, . . . , yn. Its mean and variance can therefore bederived in the same way as we did for β1.

The mean of y (x) is

E [y (x)] = E

[n∑i=1

(1

n+ (x− x) an,i

)E (yi|x1, . . . , xn)

]by the LIE

= E

[n∑i=1

(1

n+ (x− x) an,i

)(β0 + β1xi)

]substituting the PRF

= E [β0 + β1x+ β1 (x− x)] usingn∑i=1

an,i = 0,

n∑i=1

an,ixi = 1

= β0 + β1x

= µy (x) ,

so y (x) is an unbiased estimator of µy (x).

54

Page 55: Etc 2410 Notes

The variance is

ω2n,µ = var (y (x)) = E

[n∑i=1

(1

n+ (x− x) an,i

)2var (yi|x1, . . . , xn)

],

using LIEvar, which can be estimated by

ω2n,µ =

n∑i=1

(1

n+ (x− x) an,i

)2u2i ,

where ui are the OLS residuals.The approximate normality of y (x) follows from the Central Limit Theorem.

4 Multiple Regression

An extremely useful feature of regression modelling is that it easily allows for the inclusion ofmore than one explanatory variable. This is very useful for interpreting the roles of individualexplanatory variables and potentially for improving predictions. The techniques for OLS esti-mation and inference that we have discussed for simple regression extend straightforwardly tothe multiple regression setting. The models and methods will be discussed here, with formulaepostponed until the section on matrix notation for regression.

4.1 Population Regression Function

A linear PRF with multiple explanatory variables x1,i, . . . , xk,i takes the form

E (yi|x1,i, . . . , xk,i) = β0 + β1x1,i + . . .+ βkxk,i. (53)

That is, the population conditional mean of yi given x1,i, . . . , xk,i is specified as a weighted sumof x1,i, . . . , xk,i.

The interpretation of the coeffi cients β1, . . . , βk is similar to that in a simple regression, withan important qualification. To interpret β1, consider the predicted value of yi with x1,i increasedby one unit and with x2,i, . . . , xk,i unchanged:

E (yi|x1,i + 1, . . . , xk,i) = β0 + β1 (x1,i + 1) + . . .+ βkxk,i.

ThenE (yi|x1,i + 1, . . . , xk,i)− E (yi|x1,i, . . . , xk,i) = β1,

so that we interpret β1 as the change in the prediction of yi corresponding to a one unit increasein x1,i, holding x2,i, . . . , xk,i constant. This aspect of holding all of the other explanatory variablesconstant leads to the regression coeffi cient being called a marginal effect or partial effect. In gen-eral, for any j = 1, . . . , k, the parameter βj is the change in the predicted value of yi correspondingto a one unit increase in xj,i, holding xh,i constant for all h 6= j.

The intercept β0 only has a meaningful interpretation if it makes sense for all of x1,i, . . . , xk,ito take the value zero. In that case β0 is the predicted value of yi when x1,i = . . . = xk,i = 0.

4.2 Sample Regression Function and OLS

The SRF that estimates the PRF in (53) is

yi = β0 + β1x1,i + . . .+ βkxk,i, (54)

55

Page 56: Etc 2410 Notes

where β0, β1, . . . , βk are the values that minimise the sum of squared residuals

SSR (b0, b1, . . . , bk) =n∑i=1

(yi − b0 − b1x1,i − . . .− bkxk,i)2 .

The separate formulae for β0, β1, . . . , βk are messy and omitted for now, but can easily be ex-pressed in matrix notation later. The OLS residuals are denoted

ui = yi − yi= yi − β0 − β1x1,i − . . .− βkxk,i.

The R2 for the regression is

R2 =SSE

SST=

∑ni=1 (yi − y)2∑ni=1 (yi − y)2

,

which has the same derivation, properties and interpretation as the R2 in a simple regression.That is, R2 measures the proportion of the variance in yi explained by the regression.

4.3 Example: house price modelling

An example data set from Chapter 4 of Wooldridge contains the following data on house pricesand explanatory variables.

price : selling price of the house ($’000)assess : assessed value prior to sale ($’000)lotsize : size of the block in square feetsqrft : size of the house in square feetbdrms : number of bedrooms

The histogram and descriptive statistics for the dependent variable price are shown in Figure Itmay be expected that increases in each of the explanatory variables assess, lotsize, sqrft, bdrmswould predict an increase in the selling price of a house. The PRF in this case is

E (pricei|assessi, lotsizei, sqrfti, bdrmsi) = β0+β1assessi+β2lotsizei+β3sqrfti+β4bdrmsi. (55)

The specification of a multiple regression in Eviews simply involves a list of variables as shownin Figure 45, with the dependent variable price first, followed by the explanatory variables. Theresults are shown in Figure 46. The SRF can be written

pricei = −38.89(23.77)

+ 0.908(0.119)

assessi + 0.000587(0.000210)

lotsizei − 0.000517(0.0174)

sqrfti + 11.60(5.55)

bdrms

n = 88, R2 = 0.83

The intercept of β0 = −38.89 has no meaningful interpretation since none of the explanatoryvariables would reasonably take the value zero. The slope coeffi cients are interpreted as follows.

1. β1 = 0.908 : an increase in assessed value of a house of $1,000 predicts an increase in thesale price of $908, holding lot size, house size and number of bedrooms fixed. That is, thecoeffi cient measures the effect of variations in assessed value for a house of particular size.It is therefore capturing variations in other aspects that affect the price of the house besidesits size, for example, its kitchen and bathroom quality, its suburb, proximity to transport,shops, schools and major roads, architectural style, renovated or not, and so on.

56

Page 57: Etc 2410 Notes

0

4

8

12

16

20

24

100 150 200 250 300 350 400 450 500 550 600 650 700 750

Series: PRICESample 1 88Observations 88

Mean  293.5460Median  265.5000Maximum   725.0000Minimum   111.0000Std. Dev .    102.7134Skewness    1.998857Kurtosis  8.393914

Jarque­Bera  165.2787Probability  0.000000

Figure 44: Histogram and descriptive statistics of house price data

2. β2 = 0.000587 : each extra square foot of lot size predicts an increase in sale price of 58.7cents, holding the other explanatory variables fixed. The interpretation could equivalentlybe expressed as saying that an extra 1000 square feet of lot size predicts an increase in saleprice of $587, which may make the magnitudes more relevant. Note that this coeffi cientmeasures the effect of lot size on average sale price while holding house size and bedroomsfixed. That is, it measures the effect of a larger lot for a house of a given size. It does notmeasure the effect of a larger lot size with a larger house on it. It isolates the effect of lotsize alone.

3. β3 = −0.000517 : each extra square foot of house size predicts a decrease in sale price of51.7 cents, holding the other explanatory variables fixed. This finding is highly counter-intuitive, but when we look at t test in this regression it will be seen that the coeffi cient isnot significantly different from zero, so this interpretation can be ignored.

4. β4 = 11.60 : each extra bedroom in a house predicts an increase in sale price of $11,600.Note that this interpretation holds house size constant, so it is specifically measuring theeffect of number of bedrooms, not overall size of house. Generally these two variables wouldbe positively related (a correlation of 0.53 in this sample) but the multiple regression allowstheir effects to be estimated separately.

The regression has an R2 of 83% and so explains a high proportion of the variation in sellingprices of houses in this sample.

4.4 Statistical Inference

The derivations of OLS properties in multiple regression are simply in matrix notation, but messyotherwise. For now they are simply stated. If (yi, xi)

ni=1 are i.i.d. and the PRF is given by (53),

each OLS coeffi cient βj for j = 0, 1, . . . , k is unbiased and satisfies

βja∼ N

(βj , ω

2j,n

), (56)

where ω2j,n is a variance that depends on the conditional variance var (yi|x1,i, . . . , xk,i). Theimplications are the same as in the simple regression.

57

Page 58: Etc 2410 Notes

Figure 45: Specifying the multiple regression for house prices in Eviews

Figure 46: Results for house price multiple regression

58

Page 59: Etc 2410 Notes

To carry out a hypothesis test of H0 : βj = bj for any given bj the t statistic is

t =βj − bjωj,n

,

where ωj,n is the standard error of βj that is computed to estimate ωj,n. As in simple regressions,the computation can be done imposing homoskedasticity (OLS standard errors) or allowing forheteroskedasticity (White’s standard errors). The approximate null distribution of this statisticcan be derived from (56) and is given by

ta∼ tn−k−1,

which is the t distribution with n−k−1 degrees of freedom. The degrees of freedom in a multipleregression is the sample size less the number of regression coeffi cient estimated. The decision rulesfor a hypothesis test at the α = 0.05 significance level are summarised in the following table, inwhich c0.025 and c0.05 are critical values from the tn−k−1 distribution.

Rejection rule for H0 : βj = bjCritical value p value

H1 : βj 6= bj |t| > c0.025 p < 0.05

H1 : βj > bj t > c0.05 t > 0 and p < 0.10

H1 : βj < bj t < −c0.05 t < 0 and p < 0.10

A 95% confidence interval for a parameter βj is given by[βj, βj

]=[βj − cα/2ωj,n, βj + cα/2ωj,n

],

where again c0.025 is the critical value for the tn−k−1 distribution.To make a prediction from a multiple regression, values x1, . . . , xk need to be specified for the

explanatory variables. Then

y (x1, . . . , xk) = β0 + β1x1 + . . .+ βkxk.

Substracting this from (54) and rearranging gives

yi = y (x1, . . . , xk) + β1 (x1,i − x1) + . . .+ βk (xk,i − xk) . (57)

That is, y (x1, . . . , xk) can be calculated as the intercept in a regression of yi on an intercept and(x1,i − x1) , . . . , (xk,i − xk).

4.5 Applications to house price regression

The significance of the regression coeffi cients reported in Figure 46 can tested using t tests. Hereis the test of whether increased lot size predicts increased sale price.

1. H0 : β2 = 0

2. H1 : β2 > 0

3. α = 0.05

59

Page 60: Etc 2410 Notes

4. Test statistic : t = 2.80, p = 0.0064

5. Decision rule : reject H0 if t > 0 and p < 0.10

6. Reject H0, so increased lot size does predict increased selling price, holding the other threeregressors fixed.

The same analysis shows that assessed value and number of bedrooms also predict increasedselling price. However the house size (sqrft), with p value of 0.9764, is not significant – theimplication is that once we control for the size of the block the house is on and the number ofbedrooms, the overall size of the house itself has no further predictive power for the selling price.

It may be of interest to test the null hypothesis H0 : β1 = 1. Under this null we could take theassessed value as being an unbiased predictor1 of the sale price, in this sense that changes in theassessed value would be matched one-for-one by changes in the predictor of the sale price. Thetest is as follows.

1. H0 : β1 = 1

2. H1 : β1 6= 1

3. α = 0.05

4. Test statistic : t = (0.908− 1) /0.119 = −0.773

5. Decision rule : reject H0 if |t| > c0.025 = 2.000

6. Do not reject H0, so there is no evidence to suggest that assessed value is not an unbiasedpredictor of the sale price.

A 95% confidence interval can be constructed for β4 in order to give an interval estimate ofthe contribution to the selling prices of each bedroom. The calculation is[

β4, β4

]=

[β4 − c0.025ω4,n, β4 + c0.025ω4,n

]= [11.602− 2.000× 5.552, 11.602 + 2.000× 5.552]

= [0.498, 22.706]

which states that the predicted increase in selling value corresponding to an extra bedroom liesin the interval [$498, $22, 706] with confidence level of 95%.

Suppose we want to predict the selling price of a four bedroom house with assessed value of$350,000, lot size of 6000 square feet, house size of 2000 square feet. Following (57), we specifythe SRF pricei = price (350, 6000, 2000, 4) + β1 (assessi − 350) + β2 (lotsizei − 6000)

+β3 (sqrtfti − 2000) + β4 (bdrmsi − 4) .

The specification of this SRF in Eviews is shown in Figure 47, with results in Figure 48. This givesprice (350, 6000, 2000, 4) = 327.913, or a predicted selling price of $327,913. The 95% predictioninterval is

price (350, 6000, 2000, 4)± c0.025ωµ,n= [327.913− 2.000× 8.035, 327.913 + 2.000× 8.035]

= [311.843, 343.983] ,

so the predicted selling price lies within $311,843 and $343,983 with confidence level of 95%.1"Unbiased" has a different meaning in prediction, as opposed to statistical estimation.

60

Page 61: Etc 2410 Notes

Figure 47: Specification of the regression for prediction of selling price of a four bedroom housewith assessed value of $350,000, lot size of 6000 square feet, house size of 2000 square feet

Figure 48: OLS regression for predicting the selling price of a four bedroom house with assessedvalue of $350,000, lot size of 6000 square feet, house size of 2000 square feet

61

Page 62: Etc 2410 Notes

4.6 Joint hypothesis tests

In multiple regressions it can be interesting to test hypotheses about more than one coeffi cient ata time. The most common example is to jointly tests that all slope coeffi cients are equal to zero,which implies that none of the explanatory variables have any predictive power for the dependentvariable. The null hypothesis in (53) takes the form

H0 : β1 = β2 = . . . = βk = 0,

i.e. all k slope coeffi cients are set to zero. The alternative hypothesis is

H1 : at least one of β1, . . . , βk not equal to 0,

which covers the possibilities than one or some or all of the slope coeffi cients are not zero. Thealternative hypothesis implies that the regression provides some explanatory power for yi. Themost common way of testing this null is using an F test. The test statistic is

F =(SSR0 − SSR1) /kSSR1/ (n− k − 1)

,

where SSR0 is the sum of squared residuals from the SRF under H0 :

yi = β0

and SSR1 is the sum of squared residuals from the SRF under H1 :

yi = β0 + β1x1,i + . . .+ βkxk,i.

The null distribution of the F statistic is an Fk,n−k−1 distribution; that is, an F distribution withk and n− k − 1 degrees of freedom. Tables of critical values are provided for this distribution inWooldridge, but Eviews provides convenient p values. For the regression results for house pricesin Figure 46, the F statistic is reported (F = 100.7409) along with its p-value (p = 0.0000). Thetest proceeds as follows.

1. H0 : β1 = β2 = β3 = β4 = 0

2. H1 : at least one of β1, β2, β3, β4 not equal to zero

3. α = 0.05

4. Test statistic : p = 0.0000

5. Decision rule : reject H0 if p < 0.05.

6. Reject H0, at least one of the regressors has significant explanatory power for housing prices.

This F test is very convenient and hence popular for this hypothesis, but unfortunately is notvalid in the presence of heteroskedasticity.

Just as White’s standard errors can be used to construct a t test that is valid in the presenceof heteroskedasticity, there is a modification of the F test to allow for heteroskedasticity. Theformula is not easily expressed without matrix notation, but the implementation in Eviews isstraightforward. To test H0 : β1 = β2 = β3 = β4 = 0 in (55), make sure the regression has beenestimated using White’s standard errors, and then select “View -- Coefficient Diagnostics-- Wald Test - Coefficient Restrictions...” as shown in Figure 49. In the subsequentdialogue box shown in Figure 50, specify the null hypothesis for the test. Eviews uses the syntax

62

Page 63: Etc 2410 Notes

Figure 49: Selecting a Wald test in a regression equation.

Figure 50: Specifying the null hypothesis for a Wald test

63

Page 64: Etc 2410 Notes

Figure 51: Results of the Wald F test on the house price regression.

c(1), c(2),. . . corresponding to our regression coeffi cients β0, β1, . . ., so the null hypothesis β1 =β2 = β3 = β4 = 0 is entered as shown in the Figure. The results of the test are shown in Figure51. The heteroskedasticity-robust F statistic is F = 55.90, with p = 0.0000. The presentationand outcome of the test is therefore unchanged from that given preceding this paragraph, but atleast now we know that the result is still valid even if there is heteroskedasticity.

It is possible to test other joint hypotheses as well. For example, in the housing regression (55)we might be interested in testing H0 : β2 = β3 = β4 = 0, which would imply that the assessedvalue of the house fully takes into account all of the information about the size of the house andits block. That is, under H0 the PRF would be

E (pricei|assessi, lotsizei, sqrfti, bdrmsi) = β0 + β1assessi,

which states that once we have the assessor’s valuation, there is no extra explanatory power inthe block size, house size, or number of bedrooms. This could be taken as a test of the effi ciencyof the assessor’s valuation. The Wald test is carried out in Eviews following the same steps as inFigures 49 and 50, except that the hypothesis is now entered as only “c(3)=0, c(4)=0, c(5)=0”.The results are given in Figure 52, resulting in the following hypothesis test.

1. H0 : β2 = β3 = β4 = 0,

2. H1 : at least one of β2, β3, β4 not equal to zero

3. α = 0.05

4. Test statistic : p = 0.0013

5. Decision rule : reject H0 if p < 0.05.

6. Reject H0, so the size of the house and block have significant predictive power for houseprices in addition to that in the assessed value. This is evidence that the assessed valuedoes not effi ciently capture all information about the pricing of the house.

The question might be raised why do we need this joint test of β2, β3, β4 when we can alreadysee from the individual t tests that β2 and β4 are significantly different from zero? The answerto this lies in the significance level α. In hypothesis testing, we aim to make a decision abouta hypothesis with probability of type I error equal to α. If we do three separate t tests to testa single hypothesis about three coeffi cients then each of these t tests have a significance level ofα, so the three of them together have a significance level that will be greater than α. Intuitivelythere are three opportunities for this procedure to make a type I error instead of just one. So inorder to test a hypothesis about three coeffi cients and keep the significance level controlled at α,it is necessary to do a single test (a Wald test) and not three separate tests.

64

Page 65: Etc 2410 Notes

Figure 52: Wald test of H0 : β2 = β3 = β4 = 0 in the house price regression

As a final example of a joint test, we can combine the hypotheses about the unbiasedness andeffi ciency of the assessed value of the house as a predictor of the selling price. We can test thenull hypothesis

H0 : β1 = 1 and β0 = β2 = β3 = β4 = 0,

under which the PRF reduces to

E (pricei|assessi, lotsizei, sqrfti, bdrmsi) = assessi,

which states that the assessor’s value is an unbiased predictor of the selling price (adjusts one-for-one and is not systematically too high or low) and is effi cient is the sense of capturing all ofthe size characteristics of the house. The alternative hypothesis is

H1 : β1 6= 1 and/or at least one of β0, β2, β3, β4 not equal to zero.

The Wald test in Eviews is carried out by specifying the null hypothesis as in Figure 53 to givethe results in Figure 54. The hypothesis test is therefore done as follows.

1. H0 : β1 = 1, β0 = β2 = β3 = β4 = 0,

2. H1 : β1 6= 1 and/or at least one of β0, β2, β3, β4 not equal to zero.

3. α = 0.05

4. Test statistic : p = 0.0000

5. Decision rule : reject H0 if p < 0.05.

6. Reject H0, so that joint unbiasedness and effi ciency of the assessor’s value as a predictor ofselling price is rejected.

4.7 Multicollinearity

In multiple regression there are potentially issues associated with the degree of correlation betweenthe explanatory variables. These go under the general heading of “multicollinearity”, which refersto linear relationships among the explanatory variables.

4.7.1 Perfect multicollinearity

Perfect multicollinearity means that there is an exact linear relationship among some or all of theexplanatory variables. This makes the computation of the OLS estimator impossible and Eviewswill return an error message if this is attempted.

65

Page 66: Etc 2410 Notes

Figure 53: Specifying a Wald test of H0 : β1 = 1 and β0 = β2 = β3 = β4 = 0

Figure 54: Wald test of H0 : β1 = 1 and β0 = β2 = β3 = β4 = 0 in the house price equation

A version of the problem can occur in a simple regression if there is no variation in theexplanatory variable xi. Suppose we want to estimate the PRF

E (yi|xi) = β0 + β1xi,

but we have an unfortunate (or ill-designed) sample in which x1 = x2 = . . . = xn = c for someconstant c. For example, we might want to regress yi = wagei on xi = agei, but in our sampleevery individual is the same age. Without variation in age in our sample, we can’t expect to beable to estimate the effect of variations in age on wages. In the formulae for the OLS estimatorβ1

β1 =

∑ni=1 (xi − x) (yi − y)∑n

i=1 (xi − x)2,

note that xi = c for every i implies that x = c and hence that xi − x = 0 for every i, so that

β1 =0

0,

which is undefined. Even in more complicated multiple regression cases, perfect multicollinearityinduces this sort of “divide by zero”problem for the OLS estimator.

To illustrate in the house price example, suppose we wanted to include the size of the gardenas a possible predictor of selling price. The size of the garden can be taken to be

gardeni = lotsizei − sqrfti,

that is, that part of the block not taken up by the house. The PRF

E (pricei|assessi, lotsizei, sqrfti,bdrmsi) = β0+β1assessi+β2lotsizei+β3sqrfti+β4bdrmsi+β5gardeni

66

Page 67: Etc 2410 Notes

Figure 55: Generating the garden variable

is then subject to the perfect multicollinearity problem because of the perfect linear relationshipbetween gardeni, lotsizei and sqrfti. To see what happens in Eviews if we try to estimate thisregression, we first generate the gardeni variable as in Figure 55 and then specify the regressionas in Figure 56. Attempting to estimate this equation gives the error message shown in Figure57.

Perfect multicollinearity can easily be fixed by removing one or more explanatory variablesuntil the problem disappears. In the example, any one of gardeni, lotsizei or sqrfti can be removedfrom the regression. The choice of which to drop depends on the practical interpretation of thevariables in each case. In this example, it might be argued that the most natural variable todrop is lotsizei, since in the original PRF (55) the interpretation of β2 is really capturing gardensize anyway. Recall that β2 measures the change in predicted selling price for a one square footincrease in lot size, holding all the other explanatory variables constant. A one square foot increasein lot size holding house size constant must be a one square foot increase garden size, so the clarityof the practical interpretation of the model could be improved by including gardeni instead oflotsizei. The results are shown in Figure 58. Nearly all of the results are identical, except for thecoeffi cient on sqrfti, which can be explained by substituting lotsizei = sqrfti+ gardeni into (55)to obtain

E (pricei|assessi, lotsizei, sqrfti,bdrmsi) = β0 + β1assessi + β2 (sqrfti + gardeni) + β3sqrfti + β4bdrmsi= β0 + β1assessi + β2gardeni + (β2 + β3) sqrfti + β4bdrmsi.

This show the PRF with gardeni included is the same as that with lotsizei included except thatthe coeffi cient on sqrfti is changed to β2+β3. Therefore the coeffi cient on sqrfti in Figure 58 (i.e.0.0000693) is the sum of the coeffi cients on lotsizei (i.e. 0.000587) and sqrfti (i.e. −0.000517) inFigure 46. The other coeffi cient estimates and goodness of fit are unchanged so overall the tworegressions statistically equivalent and the choice can be made on the grounds of which choice ofvariables is more meaningfully interpretable.

67

Page 68: Etc 2410 Notes

Figure 56: Attempting to estimate the house price regression with gardeni included.

Figure 57: The Eviews error message when there is perfect multicollinearity.

68

Page 69: Etc 2410 Notes

Figure 58: House price regression with garden size instead of lot size.

4.7.2 Imperfect multicollinearity

Imperfect multicollinearity is a situation where some or all of the regressors are highly correlatedwith each other, but not with a correlation of ±1 that would come with an exact linear rela-tionship that implies perfect multicollinearity. Imperfect multicollinearity does not invalidate anyassumptions of OLS estimation, so computation of the estimator can proceed and its unbiasednessand distributional properties hold. The issue with imperfect multicollinearity is that the standarderrors of the estimated regression coeffi cients can be quite large as a result, implying the estimatesare not very precise and hence confidence intervals will be quite wide. One symptom of perfectmulticollinearity is a regression whose coeffi cients are insignificant according to the individual ttests (because of the large standard errors) but are significant according to the joint F test (orits Wald heteroskedasticity-consistent variant). More details are given in Wooldridge p.94—97 forthe homoskedastic case.

5 Dummy Variables

A dummy variable (or indicator variable) can be used to include qualitative or categorical variablesin a regression. In the simplest case this refers to variables for which there are two categories,for example an individual can be categorised as male/female or employed/unemployed or havesome/no private health insurance and so on. The inclusion of such characteristics in regressionmodels can be extremely informative.

5.1 Estimating two means

Consider the CEO salary data again. The sample of n = 209 CEOs have been drawn from severalindustries, one of which is summarised as “Utility”, which includes firms in the transport andutilities industries (utilities includes electricity, gas and water firms). We can define a dummy

69

Page 70: Etc 2410 Notes

0

40

80

120

160

200

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Series: UTILITYSample 1 209Observations 209

Mean  0.172249Median  0.000000Maximum  1.000000Minimum  0.000000Std. Dev.  0.378503Skewness    1.735986Kurtosis  4.013648

Jarque­Bera  113.9231Probability  0.000000

Figure 59: Histogram of the utilities dummy variable

variable, or indicator variable, as

utilityi =

{1, if firm i is in transport or utilities0, if firm i is in any other industry

A histogram of this variable is shown in Figure 59, where it can be seen the variable is takingonly values 0 or 1. There are 36 firms in the sample in transport and utilities, with 173 in otherindustries. The mean of this variable is therefore 1

209

∑209i=1utilitiesi = 36

209 = 0.1722, as shown inthe Figure.

Consider the simple regression

E (salaryi|utilityi) = β0 + β1utilityi. (58)

If firm i is not in either transport or utilities then utilityi = 0, giving

E (salaryi|utilityi = 0) = β0,

so that β0 is the population mean of CEO salaries across all industries except transport andutilities. If firm i is in either transport or utilities then utilityi = 1, giving

E (salaryi|utilityi = 1) = β0 + β1,

so that the population mean of CEO salaries in the transport and utilities industries is β0 + β1.Therefore β1 measures the difference between average salaries in transport and utilities versus allother industries. We can therefore use (58) to estimate the mean salaries for these two industrygroups and also test for differences between them.

Figure 60 shows the results of an OLS regression of CEO salary on an intercept and utilityi.The SRF is salaryi = 1396.225

(112.402)− 668.523

(12.229)utilityi,

implying β0 = 1396.225 and β1 = −668.523. The estimated average CEO salary across all indus-tries other than transport and utilities is therefore $1,396,225, while the estimated average CEOsalary in transport and utilities is $1, 396, 225− $668, 253 = $727, 972. A test of the significanceof β1 is a test of whether average salaries differ in the transport and utilities industries from theothers.

70

Page 71: Etc 2410 Notes

Figure 60: SRF of CEO salary on an intercept and utility dummy variable

1. H0 : β1 = 0 (average CEO salaries across all industries are equal)

2. H1 : β1 6= 0 (average CEO salaries are different in transport and utilities)

3. α = 0.05

4. Test statistic: p = 0.0000

5. Decision rule: reject H0 if p < 0.05

6. Reject H0, so average CEO salaries are significantly different in transport and utilitiescompared to other industries.

5.2 Estimating several means

Dummy variables can be used to estimate several means (or differences between means) at once. Inthe CEO salary dataset the firms are classified into four industries – utilities/transport, financial,industrial and consumer products. The additional dummy variables are

financei =

{1, if firm i is in the finance industry0, if firm i is in any other industry

indusi =

{1, if firm i is in industrial production0, if firm i is in any other industry

consprodi =

{1, if firm i is in consumer products0, if firm i is in any other industry

Each firm in the sample falls into one of these four categories. The following PRF can be specified:

E (salaryi|utilityi,financei, indusi) = β0 + β1utilityi + β2financei + β3 indusi. (59)

The implied mean for CEO salaries in the consumer product industry is found from settingutilityi = financei = indusi = 0, giving

E (salaryi|utilityi = 0,financei = 0, indusi = 0) = β0.

71

Page 72: Etc 2410 Notes

The average salaries for the other three industries are defined relative to the consumer productindustry (the “base category” in this case). For utilities/transport we have utilityi = 1 andfinancei = indusi = 0, so

E (salaryi|utilityi = 1,financei = 0, indusi = 0) = β0 + β1.

For the finance industry we have financei = 1 and utilityi = indusi = 0, so

E (salaryi|utilityi = 0,financei = 1, indusi = 0) = β0 + β2.

For industrial production we have indusi = 1 and financei = utilityi = 0, so

E (salaryi|utilityi = 0,financei = 0, indusi = 0) = β0 + β3.

We do not also include the consprodi dummy variable in the PRF because this would cause aperfect multicollinearity problem – because each firm in the sample is categorised as one (and onlyone) of utilities, financial, industrial or consumer product, we have the perfect linear relationship

utilityi + financei + indusi + consprodi = 1,

where 1 is the regressor for an intercept term. Therefore a PRF

E (salaryi|utilityi,financei, indusi, consprodi) = β0+β1utilityi+β2financei+β3 indusi+β4consprodi

has an exact linear relationship among its regressors and therefore has perfect multicollinearityand cannot be estimated. One of the five explanatory variables needs to be omitted in order forthe PRF to be estimated. In (59) we chose to omit consprodi, but any one of the other regressorscould have been omitted, including the intercept.

The SRF corresponding to (59) is shown in Figure 61. The estimated average salary for CEOsin the consumer products industry is β0 = 1722.417, or $1,722,417. The estimated average salaryfor CEOs in the finance industry is β0 + β2 = 1722.417 − 377.5036 = 1344.913 or $1,344,913.The finance dummy variable is not significant at the 5% level (p = 0.2473 so that H0 : β2 = 0would not be rejected against H1 : β2 6= 0) so there is no evidence of a significant differencebetween CEO salaries in the financial and consumer products industries. The interpretationsfor the utilities and industrial dummies follows similarly, with average CEO salaries in utilitiesdiffering significantly (p = 0.0009) from the consumer products industry, while those in industrialproduction do not differ significantly from consumer products (only just, with p = 0.0519).

5.3 Dummy variables in general regressions

Dummy variables are useful for more than just estimating means, they can be used in more generalregression models as well. The dataset cochlear.wf1 contains observations on n = 91 severelyhearing impaired children who have received Cochlear Implants (CIs) to enable some form ofhearing. Some children have a single CI in one ear (a unilateral CI) while others have receivedtwo CIs, one in each ear (bilateral CIs). It is believed that bilateral CIs provide an advantageto children in real world listening and learning situations because they allow better directionalrecognition of sounds and voices and also better hearing in background noise. However, CIs areexpensive (approximately $25,000 per implant), which must be borne by public or private healthinsurance or the families themselves. Also the implantation of a CI involves damage to the innerear that then rules out the use of any newly discovered surgical procedure or device in the futurethat might deliver improved performance. This background provides motivation for why it isimportant to be able to detect and quantify improvements in listening and language that childrencan achieve through the use of either one or two CIs.

72

Page 73: Etc 2410 Notes

Figure 61: SRF for average CEO salary classified by industry

The datafile contains outcomes for young children (ages 5—8) with either unilateral or bilateralCIs on the standardised2 Peabody Picture Vocabulary Test (PPVT). The histogram and descrip-tive statistics for this dependent variable are shown in Figure 62. The datafile also contains thedummy variable bilati, which takes the value 1 if child i has bilateral CIs and the value 0 if theyhave unilateral CIs. The PRF

E (PPVTi|bilati) = β0 + β1bilati

allows a test of the difference of means between the bilateral and unilateral outcomes. The SRFin Figure 63 shows that β0 = 85.21 is the estimated average score for unilateral children, whileβ0 + β1 = 85.21 + 9.36 = 94.57 is the estimated average score for bilateral children. The nullhypothesis H0 : β1 = 0 is rejected against H1 : β1 6= 0 at the 5% level of significance (p = 0.0045)so there is a significant difference in outcomes between bilateral and unilateral children.

5.3.1 Dummies for intercepts

There is also clinical experience that children should not be made to wait too long to receive theirCIs. There is a window early in life when the young brain needs to receive sounds and languageinputs in order to develop best to be able to hear and understand language. Delaying the CIs canresult in developmental delays that are very diffi cult to later catch up. A PRF to analyse thisquestion is

E (PPVTi|bilati, ageCI1i, ageCI2i) = β0 + β1bilati + β2ageCI1i + β3ageCI2i, (60)

where ageCI1i and ageCI2i are the respective ages in years when the first and second CIs wereswitched on. (For children with only a unilateral CI, ageCI2i = 0.) Histograms and descriptive

2The average score in the normal-hearing population is standardised to be 100.

73

Page 74: Etc 2410 Notes

statistics for these ages are shown in Figures 64 and 65 and the SRF is shown in Figure 66. Ithas the form

ppvti = 92.77(4.64)

+ 16.22(5.91)

bilati − 3.81(1.86)

ageCI1i − 3.13(1.53)

ageCI2i.

A way to interpret this SRF is to think of it as containing two different SRFs: one for unilateralchildren (bilati = 0 and ageCI2i = 0):

ppvti = 92.77− 3.81ageCI1i,

and one for bilateral children (bilati = 1)

ppvti = 108.99− 3.81ageCI1i − 3.13ageCI2i.

The role of the bilati dummy variable in this SRF is to allow the regression to have differentintercepts for unilateral and bilateral children – in this the intercept for bilateral children is higher(108.99 vs 92.77) reflecting the higher average scores for them relative to unilateral children. Thebilati dummy variable is significant at the 5% level (p = 0.0074) so the difference between theregression lines is a statistically significant one.

To interpret its practical significance, suppose we compare the difference between the predictedoutcomes for a unilateral child with ageCI1 = a1 against a bilateral child also with ageCI1 = a1and who received their second CI at age 2 (the average age of second implant being 2.16 years).The unilateral prediction would be

ppvt (0, a1, 0) = 92.77− 3.81a1

and the bilateral prediction would be

ppvt (1, a1, 2) = 108.99− 3.81a1 − 3.13× 2

= 102.73− 3.81a1.

The difference between these two is the predicted difference due to the bilateral CI:

ppvt (1, a1, 2)− ppvt (0, a1, 0) = 102.73− 92.77 = 9.96. (61)

Relative to the average standardised score of 100, the bilateral child is predicted to score approx-imately 10% better on the PPVT language test. A method of computing this difference and itsstandard error is to recognise that

ppvt (1, a1, 2)− ppvt (0, a1, 0) = β1 + 2β3,

and that β1 + 2β3 can be directly estimated by re-parameterising the PRF as

E (PPVTi|bilati) = β0 + (β1 + 2β3)bilati + β2ageCI1i + β3 (ageCI2i − 2bilati) .

That is, the coeffi cient on bilati in a regression of PPVTi on an intercept, bilati, ageCI1i and(ageCI2i − 2bilati) delivers the estimated effect of a second CI at age 2 versus sticking with aunilateral CI. This results of this regression are shown in Figure 67, where it can be seen thatβ1+ 2β3 = 9.95 (which differs from (61) only because of rounding) with standard error 3.86. The95% confidence interval for this effect is therefore

[9.95− 2.00× 3.85] = [2.25, 17.65] .

74

Page 75: Etc 2410 Notes

0

2

4

6

8

10

12

60 70 80 90 100 110 120 130 140

Series: PPVTSample 1 91Observations 91

Mean  92.09890Median  94.00000Maximum  139.0000Minimum  59.00000Std. Dev.  16.43104Skewness    0.244310Kurtosis  3.076533

Jarque­Bera  0.927466Probability  0.628932

Figure 62: Histogram of language scores for children with Cochlear Implants

Figure 63: SRF for PPVT scores on bilateral CI dummy variable

75

Page 76: Etc 2410 Notes

0

1

2

3

4

5

6

7

8

9

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Series: AGE_CI1Sample 1 91Observations 91

Mean  1.532857Median  1.430000Maximum  3.790000Minimum  0.340000Std. Dev.  0.815059Skewness  0.725634Kurtosis  2.778721

Jarque­Bera  8.171589Probability  0.016810

Figure 64: Histogram and descriptive statistics for age at first CI.

0

5

10

15

20

25

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

Series: AGE_CI2Sample 1 91Observations 91

Mean  2.162637Median  2.020000Maximum  5.820000Minimum  0.000000Std. Dev.  1.754301Skewness    0.250492Kurtosis  1.876018

Jarque­Bera  5.741798Probability  0.056648

Figure 65: Histogram and descriptive statistics for age at second CI

76

Page 77: Etc 2410 Notes

Figure 66: SRF for PPVT languages scores on bilateral CIs and implant ages.

Figure 67: SRF for the marginal effect of a second CI at age 2

77

Page 78: Etc 2410 Notes

5.3.2 Dummies for slopes

Dummy variables can be used to allow slope coeffi cients to change for different categories ofobservations as well. For example we could specify

E (PPVTi|bilati, ageCI1i, ageCI2i) = β0+β1bilati+β2 (bilati × ageCI1i)+β3 (unilati × ageCI1i)+β4ageCI2i,(62)

where unilati = 1− bilati is a dummy variable that takes the value 1 for a child with a unilateralCI and 0 for a child with a bilateral CI. Regressors that involve products of explanatory variableslike this are often called interactions. In this case the effect of the bilateral CI is being allowed tointeract with the age of first implant. The PRF for a unilateral child is then

E (PPVTi|bilati = 0, ageCI1i, ageCI2i) = β0 + β3ageCI1i,

while for a bilateral child it is

E (PPVTi|bilati = 1, ageCI1i, ageCI2i) = (β0 + β1) + β2ageCI1i + β4ageCI2i.

The unilateral and bilateral PRFs therefore have potentially different intercepts and slope coeffi -cients on ageCI1i. The statistical significance of these differences can be tested.

The SRF is shown in Figure 66. The slope coeffi cients on ageCI1i are different for unilateraland bilateral children, and in fact is only significant (p = 0.0299) for bilateral children. A year ofdelay in the first CI predicts a fall in the PPVT outcome of 5.43 points for bilateral children butonly 0.62 points for unilateral children. The prediction equations are

ppvti = 86.43− 0.62ageCI1i

for a unilateral child and

ppvti = (86.43 + 23.69)− 5.43ageCI1i − 2.76ageCI2i= 110.12− 5.43ageCI1i − 2.76ageCI2i

for a bilateral child.A Wald test can be used to test H0 : β2 = β3 against H1 : β2 6= β3. Under H0 the slope

coeffi cients on ageCI1i are equal for unilateral and bilateral children and the PRF would simplifyback to (60). The results of the Wald test (specified as “c(3)=c(4)”) are shown in Figure 69.The details are

1. H0 : β2 = β3 in (62)

2. H1 : β2 6= β3

3. α = 0.05

4. Test statistic : F statistic p-value = 0.1639.

5. Decision rule : reject H0 if p < 0.05.

6. Do not reject H0, the PRF could be simplified back to (60).

78

Page 79: Etc 2410 Notes

Figure 68: SRF of regression for language outcomes with bilateral interactions

Figure 69: Wald test for equality of ageCI1i coeffi cients

79

Page 80: Etc 2410 Notes

6 Some non-linear functional forms

In all cases so far we have assumed that the PRF is a linear function of the explanatory variables.Non-linearities can be introduced in an endless variety of ways. The interactions involving dummyvariables in the previous section was one step in this direction. Here we look at some commonexamples of non-linear regression models that can be handled using the methods of OLS estimationand inference considered so far.

6.1 Quadratic regression

Some non-linearity can be introduced into a PRF by including the square of an explanatoryvariable. Consider a simple regression

E (yi|xi) = β0 + β1xi + β2x2i . (63)

It no longer makes sense to interpret β1 as the change in E (yi|xi) from a unit increase in xi, sincea unit increase in xi necessary increases x2i as well. The term β1xi +β2x

2i needs to be interpreted

as a whole. The sign of β2 dictates the shape of the parabola – a positive sign giving a convexshape (a valley) and a negative sign giving a concave shape (a hill). The general approach toregression interpretation is to consider the difference between the conditional mean at some valuex :

E (yi|xi = x) = β0 + β1x+ β2x2 (64)

and the conditional mean at x+ 1 :

E (yi|xi = x+ 1) = β0 + β1 (x+ 1) + β2 (x+ 1)2 .

This gives an expression for the change in the conditional mean that results from a one unitincrease in xi :

E (yi|xi = x+ 1)− E (yi|xi = x) = β1 + β2 (2x+ 1) . (65)

For a linear regression (i.e. one without the x2i term) the expression is E (yi|xi = x+ 1) −E (yi|xi = x) = β1. The effect of the quadratic term is to introduce β2 (2x+ 1) into the marginaleffect of xi. An important difference from the linear model is that this marginal effect now de-pends on x, implying that the effect of increasing xi by one unit now depends on what value of xiwe start from. The following application will illustrate the sense of this property. Sometimes thederivative of the regression function with respect to x is used as an approximation to the marginaleffect shown in (65). The derivative is

dE (yi|xi = x)

dx= β1 + 2β2x,

which differs from (65) by β2. Estimated values of β2 are often small, so that the derivative isclose to (65). We will focus on (65) as the exact marginal effect of a unit change in xi.

For a given value x, the estimation of E (yi|xi = x) and its standard error follows by subtracting(64) from (63) to obtain

E (yi|xi) = E (yi|xi = x) + β1 (xi − x) + β2(x2i − x2

),

so that computing this prediction involves a regression of yi on an intercept, (xi − x) and(x2i − x2

),

and taking the intercept estimate and its standard error. The approach is identical to that for

80

Page 81: Etc 2410 Notes

linear models. The estimation of the marginal effect in (65) for a given value x can proceed byre-arranging (63) to give

E (yi|xi) = β0 + (β1 + β2 (2x+ 1))xi + β2(x2i − (2x+ 1)xi

)= β0 + β∗1xi + β2

(x2i − (2x+ 1)xi

).

That is, the marginal effect (65) for a given x is estimated from a regression of yi on an intercept,xi and

(x2i − (2x+ 1)xi

), and taking the coeffi cient on xi and its standard error.

A quadratic function has a turning point, either a maximum (for β2 < 0 ) or a minimum (forβ2 > 0). The location of this turning point is found by setting the derivative equal to zero:

dE (yi|xi = x)

dx= 0 ⇒ x = − β1

2β2.

This is the point at which the effect of xi on E (yi|xi) changes from being positive to negative (forβ2 < 0) or negative to positive (β2 > 0). The estimated turning point is therefore −β1/

(2β2

),

where β1 and β2 are the usual OLS estimators of the PRF (63).

6.1.1 Example: wages and work experience

A quadratic term is commonly included when modelling wages in terms of labour market experi-ence. The workfile wages.wf1 contains data on n = 1260 individuals with their wages ($/hour)and various potential explanatory variables. A straightforward linear PRF would take the form

E (wagei|femalei, educi, experi) = β0 + β1femalei + β2educi + β3experi,

where educi is years of education, experi is years of labour force experience and femalei is a dummyvariable that takes the value 1 if individual i is female and 0 otherwise. The SRF is shown inFigure 70. Each of the slope coeffi cients are significant at the 5% level and each has interpretablesigns and magnitudes. An extra year of education increases the estimated conditional mean ofwages by $0.45, and extra year of experience increases the estimated conditional mean of wagesof $0.08, and the average wage of females is $2.57 below that of males with the same levels ofeducation and experience.

However, experience is generally not modelled in a linear form in a wage equation like this.The idea is that the initial years of work experience involve the greatest learning and greatestincreases in productivity for an employee, resulting in the greater increases in wages at that time.As experience increases, the rates of growth in productivity and hence wages slows. This effectcan be captured by including a quadratic term into the PRF

E (wagei|femalei, educi, experi) = β0 + β1femalei + β2educi + β3experi + β4exper2i , (66)

The interpretations of the femalei and wagei variables in this model are unchanged, but the inter-pretation of experience must be altered as shown above. The coeffi cients β3 and β4 are no longerindividually interpretable because it is impossible to make a one year increase in experi while hold-ing exper2i fixed (or visa versa). Instead the marginal effect on E (wagei|femalei, educi, experi) ofone extra year of work experience follows from (65):

E (wagei|femalei, educi, experi + 1)− E (wagei|femalei, educi, experi) = β3 + β4 (2experi + 1) .(67)

The effect on average wage of an extra year of work experience depends on the amount of expe-rience obtained so far. For an individual with one year of work experience, a second year of work

81

Page 82: Etc 2410 Notes

experience will change the expected wage by β3+ 3β4 dollars per hour. For an individual with 20years of work experience, the next year of experience will change the expected wage by β3 + 41β4dollars per hour.

The SRF for (66) is shown in Figure 71, showing β3 = 0.2527 and β4 = −0.0039. Thequadratic term in experience is significant at the 5% level so it adds some explanatory power forwages. Figure 72 gives a graphical representation of the contribution of the experience variables

to the wages PRF, given by(β3experi + β4exper

2i

)plotted over the range of observed values of

experi. Also shown for comparison is the linear term in experience from the SRF in Figure 70,given by 0.0847experi. The quadratic function has a positive slope for all levels of experience from

one year up to the turning point experi = −β3/(

2β4

)= −0.2527/ (2×−0.0039) = 32.40 years.

After that the quadratic has a negative slope. The implication is that extra experience increasesaverage wages, at a decreasing rate, until experience reaches about 32.4 years. After that, extrawork experience has a negative effect on average wages. The same information is also displayedin Figure 73, which graphs the effects of an extra year of work experience on average wages. Theeffect in the linear case is forced to be constant for all experience at β3 = 0.0847, while the effectin the quadratic case is β3 + β4 (2experi + 1) = 0.2527 − 0.0039 (2experi + 1). Again this showsthat an extra year of experience raises average wages until experience reaches 32.4 years, at whichpoint it cross the x-axis and implies decreases in average wages.

Prediction in a quadratic regression works in the same way as a linear model. Suppose wewant to calcuate the average wage for a female with 15 years of education and 10 years of workexperience. Figure 74 shows the regression for this purpose, re-specified in terms of (femalei − 1),(educi − 15), (experi − 10) and

(exper2i − 102

). The resulting prediction is wage (1, 15, 10) = $5.08

with standard error of 0.23 that can be used to compute the 95% prediction interval

[5.0827± 1.980× 0.2261] = [$4.63, $5.53] .

Suppose we also want a 95% confidence interval for the effect of an extra year of work experi-ence on average wages for an individual with these characteristics. The desired effect is (67) withexperience set to 10 years, i.e. β3 + 21β4. The PRF (66) can be re-written as

E (wagei|femalei, educi, experi) = β0+β1femalei+β2educi+(β3 + 21β4) experi+β4(exper2i − 21experi

),

so that a regression of wagei on an intercept, femalei, educi, experi and(exper2i − 21experi

)will

provide the desired coeffi cient on experi. The SRF is shown in Figure 75, which shows thatβ3 + 21β4 = 0.17 with 95% confidence interval computed as

[0.1707± 1.980× 0.0154] = [$0.14, $0.20] .

6.2 Regression with logs —explanatory variable

It is common practice to work with variables in logs rather than their original levels. Consider aPRF

E (yi|xi) = β0 + β1 log xi. (68)

Taking logs is only possible when xi only takes values greater than zero, so this specification isnot always available. It can be used for a positive variable like years of work experience though.The interpretation of the effect of xi in this PRF needs to be derived. Following the same generalapproach as in the quadratic model, we consider a fixed value x and compare

E (yi|xi = x) = β0 + β1 log x (69)

82

Page 83: Etc 2410 Notes

Figure 70: Linear SRF for wages

Figure 71: SRF for wages with quadratic in experience

83

Page 84: Etc 2410 Notes

0

1

2

3

4

5

0 5 10 15 20 25 30 35 40 45 50

EXPER

LinearQuadratic

Figure 72: Quadratic and linear in experience components of SRFs

­.15

­.10

­.05

.00

.05

.10

.15

.20

.25

0 5 10 15 20 25 30 35 40 45 50

EXPER

LinearQuadratic

Figure 73: Effects of an extra year of work experience on average wages

84

Page 85: Etc 2410 Notes

Figure 74: SRF to predict the wages for females with 15 years of education and 10 years of workexperience

Figure 75: SRF to estimate the effect on average wages of an extra year of experience for anindividual with 10 years of experience

85

Page 86: Etc 2410 Notes

andE (yi|xi = x+ 1) = β0 + β1 log (x+ 1) .

The effect on the conditional mean of yi of a one unit increase in xi is therefore

E (yi|xi = x+ 1)− E (yi|xi = x) = β1 (log (x+ 1)− log x)

= β1 log

(1 +

1

x

). (70)

For a fixed value x this can be estimated by re-arranging (68) as

E (yi|xi) = β0 +

(β1 log

(1 +

1

x

))log xi

log(1 + 1

x

)= β0 + β∗1

log xi

log(1 + 1

x

) ,so the desired marginal effect (70) for a given value x is estimated as the slope coeffi cient from aregression of yi on an intercept and

(log xi/ log

(1 + 1

x

)).

An alternative and very common interpretation of this PRF is to consider a 1% increase in xirather than a one unit increase. That is, instead of comparing E (yi|xi) at xi = x and xi = x+ 1,we compare it at xi = x and xi = 1.01x. This gives

E (yi|xi = 1.01x)− E (yi|xi = x) = β1 (log (1.01x)− log x)

= β1 (log 1.01 + log x− log x)

= β1 log (1.01)

≈ β1100

,

where the last step uses log 1.01 = 0.00995 ≈ 0.01 = 1100 . Therefore a 1% increase in xi results in a

change of β1/100 in E (yi|xi). This interpretation is common because the result is not dependenton x, so it gives β1 a convenient interpretation without reference to some starting value x.

Prediction is carried in the usual way by subtracting (69) from (68) to obtain

E (yi|xi) = E (yi|xi = x) + β1 (log xi − log x) ,

so the estimate of E (yi|xi = x) is the estimated intercept in a regression of yi on an intercept and(log xi − log x).

6.2.1 Example: wages and work experience

In the context of wages and work experience, using the log of work experience provides an alter-native non-linear specification to a quadratic. Consider the PRF

E (wagei|femalei, educi, experi) = β0 + β1femalei + β2educi + β3 log (experi) . (71)

Including experience in logs instead of linearly allows the effect of an extra year of experienceon average wages to decrease as experience increases. The effect of an extra year of experience(holding femalei and educi constant) on the conditional mean of wages is obtained from (70) tobe

β3 log

(1 +

1

experi

). (72)

86

Page 87: Etc 2410 Notes

Alternatively an increase of 1% in work experience increases average wages by approximatelyβ3/100 dollars per hour.

The SRF for (71) is shown in Figure 76, and the experience component is illustrated in Figure77, with the quadratic component from Figure 71 for comparison. Both functional forms capturethe initially larger gains to work experience at the beginning of a career and the reduction ofthose gains as experience grows. The difference is that the log specification does not include anegative effect of experience at any level, rather showing a continuing gradual increase in wageswith experience at all levels. Figure 78 shows the estimated effect on average wages of an extrayear of work experience, comparing the results from the log and quadratic models. The biggestdifferences occur at the ends of the distribution of experience, where data is most sparse, sochoosing between the two specifications will not be simple. For now we consider each as providinga reasonable approximation to the role of experience and return later to the topics of modelcomparison and selection.

The average wages can again be estimated for a female with 15 years of education and 10years of work experience, see Figure 79. This gives wage (1, 15, 10) = $5.27 with 95% predictioninterval

[5.2728± 1.980× 0.2231] = [$4.83, $5.71] .

This interval mostly overlaps that from the quadratic model (i.e. [$4.63, $5.53]) so the predictionsfrom the two models are very similar for this type of individual. We can expect them to differmore for very small or very large values of experience.

To estimate the effect of an extra year of experience for this individual, which from (72) wouldbe

β3 log

(1 +

1

10

)= β3 log (1.1) ,

consider the re-specified PRF

E (wagei|femalei, educi, experi) = β0 + β1femalei + β2educi + β∗3log (experi)

log (1.1),

where β∗3 = β3 log (1.1). The results are shown in Figure 80, from which we find the estimatedincrease in average wages is $0.12 with 95% confidence interval

[0.1246± 1.980× 0.0099] = [$0.11, $0.14] .

This interval does not overlap that constructed with the quadratic model, so the two models makedifferent predictions in this case.

For any values of the explanatory variables (i.e. regardless of work experience) an increaseof work experience by 1% increases the average wage by approximately $0.013. In this case aone year change in experience is the more natural unit to consider, but in other applications apercentage change is very natural.

6.3 Regression with logs —dependent variable

It is standard in econometrics to model a variable like wages in log form rather than its levels.There is no definite rule for choosing between logs and levels, but variables like wages or incomesthat are positive (logs only apply to positive numbers) and generally highly positively skewed andnon-normal are rendered less skewed and closer to normal by a log transformation. Figures 81(wages in levels) and 82 (wages in logs) illustrate the point. This can lessen the impact of thesmall number of very large incomes. Also the approximate normal, t and F distributions thatrely on the Central Limit Theorem will tend to work better for more symmetrically distributeddata.

87

Page 88: Etc 2410 Notes

Figure 76: SRF for wages with log of work experience

0

1

2

3

4

5

6

0 5 10 15 20 25 30 35 40 45 50

EXPER

QuadraticLog

Figure 77: Comparison of experience in quadratic and log form in the wage equations

88

Page 89: Etc 2410 Notes

­0.2

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30 35 40 45 50

EXPER

QuadraticLog

Figure 78: Estimated effects of a year of extra work experience of average wages for the log andquadratic models

Figure 79: Prediction for a female with 15 years education and 10 years of work experience

89

Page 90: Etc 2410 Notes

Figure 80: Estimating the effect of an extra year of work experience for an individual with 10years of work experience

0

100

200

300

400

500

0 10 20 30 40 50 60 70 80

Series: WAGESample 1 1260Observations 1260

Mean  6.306690Median  5.300000Maximum  77.72000Minimum  1.020000Std. Dev.  4.660639Skewness  4.813465Kurtosis  54.01341

Jarque­Bera  141489.9Probability  0.000000

Figure 81: Histogram of wages

90

Page 91: Etc 2410 Notes

0

40

80

120

160

200

240

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

Series: LWAGESample 1 1260Observations 1260

Mean  1.658800Median  1.667705Maximum  4.353113Minimum  0.019803Std. Dev.  0.594508Skewness  0.083235Kurtosis  3.425003

Jarque­Bera  10.93785Probability  0.004216

Figure 82: Histogram of log of wages

Consider the simple regression

E (log yi|xi) = β0 + β1xi. (73)

The interpretation of this regression in terms of log yi is simple. However we are rarely interestedin log yi for practical purposes, we are interested in yi. So we want to work out the implications ofthis model in log yi for yi itself. That is, we would like to deduce an expression for E (yi|xi), butthe fundamental diffi culty is that E (log yi|xi) 6= logE (yi|xi). The log is a non-linear function so itcannot be interchanged with the expectations operator. In fact we know from Jensen’s inequalitythat E (log yi|xi) < logE (yi|xi). Instead we write (73) as

log yi = β0 + β1xi + (log yi − E (log yi|xi))= β0 + β1xi + ui, (74)

whereui = log yi − E (log yi|xi) .

Taking the exponential of both sides of (74) gives

yi = exp (β0 + β1xi) exp (ui) . (75)

Making any progress with the interpretation of this model requires the assumption that ui isindependent of xi. This is a diffi cult assumption to interpret or test, but tends to be madein practice without any discussion, so we will do so here. Under this assumption, taking theconditional expectation of both sides of (75) gives

E (yi|xi) = exp (β0 + β1xi)E [exp (ui) |xi]= exp (β0 + β1xi)E [exp (ui)]

= α0 exp (β0 + β1xi)

where α0 = E [exp (ui)]. Now

E (yi|xi = x) = α0 exp (β0 + β1x) (76)

91

Page 92: Etc 2410 Notes

and

E (yi|xi = x+ 1) = α0 exp (β0 + β1 (x+ 1))

= α0 exp (β0 + β1x) exp (β1)

= E (yi|xi = x) exp (β1)

soE (yi|xi = x+ 1)− E (yi|xi = x) = E (yi|xi = x) (exp (β1)− 1) .

This is frequently expressed as the percentage change

E (yi|xi = x+ 1)− E (yi|xi = x)

E (yi|xi = x)· 100% = (exp (β1)− 1) · 100%. (77)

That is, a one unit increase in xi produces a (exp (β1)− 1) · 100% change in E (yi|xi).It is common to approximate exp (β1)− 1 by β1 (an approximation that works best for small

β1) so that the interpretation of the model becomes

E (yi|xi = x+ 1)− E (yi|xi = x)

E (yi|xi = x)· 100% = β1 · 100%. (78)

That is, a one unit increase in xi produces an approximate β1 · 100% change in E (yi|xi). Theconvenience of using β1 instead of having to compute (exp (β1)− 1) means this approximate in-terpretation is more often used in practice. The approximation can also be derived using calculus:

d

dxE (yi|xi = x) = α0 exp (β0 + β1x) · β1

= E (yi|xi = x) · β1,

which implies1

E (yi|xi = x)

d

dxE (yi|xi = x) · 100% = β1 · 100%,

each side of which is an approximation to each side of (77). We will proceed with (78) as theinterpretation of (73).

Estimation of E (yi|xi) is more diffi cult in a model than is expressed in terms of log yi. It isstraightforward to obtain the SRF log yi = β0 + β1xi,

from an OLS regression of log yi on an intercept and xi. The prediction equationlog y (x) = β0 + β1x

provides the right way to estimate E (log yi|xi = x). It is then tempting to use

exp( log y (x)

)= exp

(β0 + β1x

)to estimate E (yi|xi = x), but this would not be right because it omits the α0 term in the correctexpression (76). Since α0 is necessarily greater than one (because E (log yi|xi) < logE (yi|xi)),using y (x) = exp

(β0 + β1x

)will systematically under-estimate E (yi|xi = x), i.e. it will be

negatively biased. An estimator of α0 is required to correct this bias. Since α0 = E [exp (ui)],(the population mean of ui) a natural estimator is the sample mean

α0 =1

n

n∑i=1

exp (ui) ,

92

Page 93: Etc 2410 Notes

Figure 83: SRF for log(wagei)

whereui = log yi − β0 − β1xi,

are the usual OLS residuals from the SRF corresponding to (73). The prediction equation

y (x) = α0 exp(β0 + β1x

)can then be used.

6.3.1 Example: modelling the log of wages

A common base specification for a wage equation takes the form of the PRF

E (log (wagei) |femalei, educi, experi) = β0 + β1femalei + β2educi + β3experi + β4exper2i .

The interpretation of this PRF is simple in terms of log (wagei), but it is generally wagei whichis the quantity of interest. Using (77), an extra year of education increases the average wage by(exp (β2)− 1) · 100%. The SRF for this model is given in Figure 83, in which β2 = 0.0679. Theestimated effect of an extra year of education in this regression is to increase the average wage

by(

exp(β2

)− 1)· 100% = exp (0.0679) − 1 = 7.03%. The approximation (78) gives the effect

as 6.79%, which is practically very similar. Moreover the standard error for this latter estimateβ2 is immediately available for computing a confidence interval, whereas computing a standard

error for exp(β2

)− 1 goes beyond our scope.

The interpretation of the effect of work experience on wages involves a non-linear transforma-tion for both variables. The general form is

E (log yi|xi) = β0 + β1xi + β2x2i ,

93

Page 94: Etc 2410 Notes

which can be written, following the steps leading to (76) above, as

E (yi|xi) = α0 exp(β0 + β1xi + β2x

2i

).

Now to find the marginal effect of xi we take

E (yi|xi = x) = α0 exp(β0 + β1x+ β2x

2)

and

E (yi|xi = x+ 1) = α0 exp(β0 + β1 (x+ 1) + β2 (x+ 1)2

)= α0 exp

(β0 + β1x+ β2x

2)

exp (β1 + β2 (2x+ 1))

= E (yi|xi = x) exp (β1 + β2 (2x+ 1)) .

Thus the marginal effect can be expressed

E (yi|xi = x+ 1)− E (yi|xi = x)

E (yi|xi = x)· 100% = (exp (β1 + β2 (2x+ 1))− 1) · 100%

≈ (β1 + β2 (2x+ 1)) · 100%.

This shows that the intepretation of a quadratic regression with a logged dependent variablesimply combines the two elements of each of the transformations. The marginal effect of thequadratic regression is (β1 + β2 (2x+ 1)) as before, but the presence of the logged dependentvariable means that this effect needs to be interpreted as a percentage change in E (yi|xi), ratherthan an absolute change in E (yi|xi).

To compute the marginal effect of an extra year of experience for an individual with x yearsof experience, we would re-arrange the PRF to give

E (logwagei|femalei, educi, experi)= β0 + β1femalei + β2educi + (β3 + β4 (2x+ 1)) experi + β4

(exper2i − (2x+ 1) experi

)so the required marginal effect is the coeffi cient on experi in a regression of wagei on an intercept,femalei, educi, experi and

(exper2i − (2x+ 1) experi

). For individuals with 10 years of experience,

the SRF is shown in Figure 84. The interpretation of the estimate is that an extra year of workexperience increases average wages by approximately 2.73%. A confidence interval for this isobtained from

0.0273± 1.980× 0.00238 = [0.0226, 0.0320] ,

or [2.26%, 3.20%] .

6.3.2 Choosing between levels and logs for the dependent variable

Comparing (71) and (83) shows that the model in log wages has a much higher R2, which wouldappear to suggest it is superior. However, models with different dependent variables can not becompared using R2. Instead, the predictions from one of the models needs to be transformed tomatch those of the other model to allow a valid comparison.

Suppose we want to compareE (yi|xi) = β0 + β1xi

andE (log yi|xi) = δ0 + δ1xi.

94

Page 95: Etc 2410 Notes

Figure 84: SRF for computing the marginal effect of one extra year of experience on wages for anindividual with 10 years of experience

Letyi = β0 + β1xi, (79)

denote the usual SRF for the levels yi. For the log of yi, write the SRF as˜log yi = δ0 + δ1xi,

where δ0 and δ1 are the usual OLS estimators from log yi on an intercept and xi, the differentnotation only being used to distinguish them from β0 and β1 in the levels SRF. Now transformthe fitted values for log yi, denoted ˜log yi, into fitted values for yi, denoted

yi = exp(δ0 + δ1xi

).

These fitted values are not unbiased estimators of E (yi|xi) since they omit consideration of α0in (76), but this turns out not to matter for the R2 comparison. The comparison is made bycomputing the R2 from a regression of yi on an intercept and yi, and then comparing this withthe R2 from (79). Whichever is larger suggests whether yi should be logged or not. Note that thiscomparison is valid only when the two regressions (for yi and log yi) contain the same explanatoryvariables.

This procedure is made more convenient in Eviews because it offers the option of computingfitted values for yi directly from a regression estimated in log yi. In the regression for log wagei,choose “Proc - Forecast...” as shown in Figure 85, and then ensure that the fitted valuesare obtained for wage, and not log(wage), as shown in Figure 86. This creates a variable called“wagef”in the workfile (its name can be changed if desired). Figure 87 shows the regression usedto compute R2 = 0.211, which is the percentage of variation in wagei explained by the regressionfor log (wagei). This R

2 is slightly higher than that for the regression for wagei, implying the logtransformation for wages is to be preferred in this case.

95

Page 96: Etc 2410 Notes

Figure 85: Choosing to calculate fitted values

Figure 86: Choosing to calculate fitted values for wage, not log(wage)

96

Page 97: Etc 2410 Notes

Figure 87: Regression to compute R2 for wagei from the log wagei regression

6.4 Practical summary of functional forms

Define the general notation µy (x) = E (yi|xi = x). The following gives a summary of the func-tional forms and their marginal effects and prediction equations.

LinearPRF E (yi|xi) = β0 + β1xiSRF yi = β0 + β1xiMarginal effect µy (x+ 1)− µy (x) = β1Prediction µy (x) E (yi|xi) = µy (x) + β1 (xi − x)

Quadratic xPRF E (yi|xi) = β0 + β1xi + β2x

2i

SRF yi = β0 + β1xi + β2x2i

Marginal effect µy (x+ 1)− µy (x) = β1 + β2 (2x+ 1) = β∗1Marginal effect estimation E (yi|xi) = β0 + β∗1xi + β2

(x2i − (2x+ 1)xi

)Prediction µy (x) E (yi|xi) = µy (x) + β1 (xi − x) + β2

(x2i − x2

)Turning point x = − β1

2β2

Log xPRF E (yi|xi) = β0 + β1 log xiSRF yi = β0 + β1 log xiMarginal effect µy (1.01x)− µy (x) ≈ β1/100

Prediction µy (x) E (yi|xi) = µy (x) + β1 log (xi/x)

97

Page 98: Etc 2410 Notes

Log yPRF E (log yi|xi) = β0 + β1xiSRF log yi = β0 + β1xi

Marginal effectµy (x+ 1)− µy (x)

µy (x)· 100% ≈ β1 · 100%

Prediction µlog y (x) E (log yi|xi) = µlog y (x) + β1 (xi − x)

Prediction µy (x) µy (x) = α0 exp(µlog y (x)

)α0 = 1

n

∑ni=1 exp

(log yi − log yi

)R2 for yi R2 from SRF of yi on intercept and exp

( log yi

)Log y + quadratic xPRF E (log yi|xi) = β0 + β1xi + β2x

2i

SRF log yi = β0 + β1xi + β2x2i

Marginal effectµy (x+ 1)− µy (x)

µy (x)· 100% ≈ (β1 + β2 (2x+ 1)) · 100% = β∗1 · 100%

Marginal effect estimation E (log yi|xi) = β0 + β∗1xi + β2(x2i − (2x+ 1)xi

)Prediction µlog y (x) E (log yi|xi) = µlog y (x) + β1 (xi − x) + β2

(x2i − x2

)Prediction µy (x) µy (x) = α0 exp

(µlog y (x)

)α0 = 1

n

∑ni=1 exp

(log yi − log yi

)R2 for yi R2 from SRF of yi on intercept and exp

( log yi

)7 Comparing regressions

Comparing the fit of different regressions for the same dependent variable can be done in manydifferent ways, there is not one correct approach. Four statistics will be discussed here for thepurpose. First note, however, that while R2 is a useful descriptive statistic for a single regression,it has only very limited use for comparing different regressions. It can only be used for comparingregressions with the same number of explanatory variables. The problem with R2 is that it willnever decrease when a new explanatory variable is added to a regression, no matter how littleexplanatory power it has. So comparing regressions with R2 will always end up giving preferenceto the largest model. The four closely related statistics given here do not have this problem andshould be used for regression comparison.

7.1 Adjusted R2

For any SRFyi = β0 + β1x1,i + . . .+ βkxk,i,

recall the definitions

SST =n∑i=1

(yi − y)2 “total sum of squares”

SSE =n∑i=1

(yi − y

)2“explained sum of squares”

SSR =n∑i=1

u2i “residual sum of squares”,

98

Page 99: Etc 2410 Notes

which satisfySSR = SSE + SSR.

The R2 is defined as

R2 =SSE

SST=SST − SSR

SST= 1− SSR/ (n− 1)

SST/ (n− 1).

The adjusted R2, denoted R2, is defined as

R2 = 1− SSR/ (n− k − 1)

SST/ (n− 1),

the adjustment being the inclusion of the degrees of freedom as the divisor of SSR in the nu-merator. A result of this change is that R2 may decrease if an explanatory variable with littlepredictive power is added to a regression, so it is legitimate strategy to compare regressions withdifferent numbers of explanatory variables based on R2 (as long as they have the same dependentvariable). Another result of the change is that R2 ≥ 0 need not always hold as it does for R2. Anegative R2 is a sign of a regression with very little overall explanatory power.

7.2 Information criteria

There are three closely related information criteria that can be used for comparisons of regressionmodels – the Akaike, Schwarz and Hannan-Quinn criteria. They have the general form

IC = logSSR

n+ (k + 1)

p

n,

where p is a “penalty term”taking the values

Akaike: p = 2Schwarz: p = log nHannan-Quinn: p = 2 log log n

A regression is preferred to another if it has a smaller IC, whichever of the three is used.The problem with having four different criteria for model comparison is that it is unclear

which to rely on. All four methods are widely used in practice and each of them is derived fromdifferent principles and has different desirable (and undesirable) properties. In order, the R2

is most included to prefer the larger of two regression models, followed by the Akaike IC, theHannan-Quinn IC and then the Schwarz IC, which is most likely of the four to prefer the smallerof two regression models. We will rely on the Akaike criterion in this subject.

7.3 Adjusted R2 as an IC∗

The R2 at first sight appears quite different from the other three ICs, but in fact is very closelyrelated. Choosing a model with a larger value of R2 is identical to choosing a model with a smallervalue of log

(1− R2

), and

log(1− R2

)= log

SSR

n− k − 1− log

SST

n− 1

= logSSR

n+ log

1

1− k+1n

− logSST

n− 1

≈ logSSR

n+k + 1

n− log

SST

n− 1.

99

Page 100: Etc 2410 Notes

Therefore choosing a regression with larger R2 is (almost) equivalent to choosing a regression withsmaller

logSSR

n+k + 1

n,

implying p = 1.

8 Functional form

An additional issue that can occur is one of incorrect functional form. This has implications forthe estimation of a conditional mean, whether or not causal inference is of interest. In generalsuppose the true conditional expectation is

E (yi|xi) = δ0 + δ1xi + g (xi)

but a linear SRFyi = β0 + β1xi

is estimated. Recall the slope coeffi cient β1 has the representation

β1 =

n∑i=1

an,iyi,

where

an,i =(xi − x)∑ni=1 (xi − x)2

and∑n

i=1 an,i = 0 and∑n

i=1 an,ixi = 1. Then

E[β1

]= E

[n∑i=1

an,iE (yi|xi)]

= E

[δ0

n∑i=1

an,i + δ1

n∑i=1

an,ixi +

n∑i=1

an,ig (xi)

]

= δ1 + E

[n∑i=1

an,ig (xi)

]6= δ1

where∑n

i=1 an,ig (xi) has the interpretation of being the slope coeffi cient from a regression ofg (xi) on xi.

The general conclusion from this is that a misspecified functional form results in biased esti-mates of the conditional mean E (yi|xi). This is a different problem from omitted variables, whichdoes not bias conditional mean estimates, although it will generally bias estimates of causal effectsif this is of interest.

9 Regression and Causality

A regression is a statistical model for the conditional mean of a dependent variable given someexplanatory variables. To take a simple example, the PRF

E (wagei|educi) = β0 + β1educi (80)

100

Page 101: Etc 2410 Notes

measures how the average wage changes with different values of education. With β1 > 0 , wewould find the average wage of individuals with 15 years of education is higher than the averagewage of individuals with 12 years of education, and the difference between these two averageswould be 3β1.

It is common in practice to want to take the interpretation of a regression further and to claima causal relationship. For example, that an individual who undertakes a university degree (henceincreasing their years of education from 12 to 15) can expect to increase their wages by 3β1 asa result of this extra education. This causal statement is a much stronger interpretation of (80)than simply saying that higher educated individuals have higher average wages, and is far morediffi cult to justify. Much research at the frontier of econometrics focusses on if and how differentstatistical models might be given causal interpretations. It is generally necessary to go beyondstatistical arguments to a clear understanding of the nature of the practical question and the waythat the data has been obtained.

In order to give (80) a causal interpretation, it is necessary that an individual’s wages becaused in a manner that satisfies the mathematical relationship

wagei = β0 + β1educi + ui, (81)

where ui is the disturbance term that captures all of the other factors that cause wages besideseducation, and it is necessary that this disturbance term satisfy

E (ui|educi) = 0. (82)

Taking the conditional expectation of both sides of (81) given educi and applying (82) gives(80). It is necessary that both (81) and (82) hold in order for the regression (80) to be given theinterpretation that an extra year of education causes an individual’s wage to rise by β1. Sometimesthis interpretation may be possible, but there are many ways in which (81) and especially (82)may be violated, even though (80) may be a valid representation of the conditional mean of wages.Note that (82) requires that education have no explanatory power for any of the factors that makeup the disturbance term ui, a requirement that can be very diffi cult to satisfy in practice.

9.1 Notation

One aspect of the notation here differs from that of the textbook. A regression is a statisticalmodel of a conditional expectation, and so for our purposes is always represented explicitly asa conditional expectation as in (80). In Wooldridge and other textbooks, it is common to alsorepresent a regression in the form (81), as well as sometimes in the form (80). In these notes thenotation (81) will be reserved for an equation representing how the dependent variable is caused.This causal equation may or may not correspond to a regression equation, as we will now discuss.

To be clear, a regression model represents the conditional mean of the dependent variable andis therefore written in terms of that conditional mean (eg E (wagei|educi)). A causal equationrepresents how the dependent variable itself is determined and is therefore written in terms ofthat dependent variable (eg wagei). The regression model always measures the conditional mean,but if the regression model and the causal equation happen to coincide then the regression canalso be given a causal interpretation.

9.2 Regression for prediction

Before discussing causal interpretations further, it should be noted that many regressions arenot meant to be causal in the first place. Regressions for prediction / forecasting are a leadingexample. Consider the PRF for final exam marks

E (exami|asgnmti) = β0 + β1asgnmti. (83)

101

Page 102: Etc 2410 Notes

This provides a statistical model for how predicted final exam marks vary with assignment marks.It may have interest for both students and teachers in summarising the relationship between on-course assessment and the final exam. It is clearly not a causal regression though. Assignmentmarks do not cause exam marks. A better causal story would be that both assignment andexam marks are caused by some combination of study during the semester (including lecture andtutorial participation, reading and revision and so on) and pre-existing ability (extent of previousexposure to statistics, general intelligence and so on). A highly stylised causal model of this mightbe

exami = δ0 + δ1studyi + δ2abilityi + ui

asgnmti = γ0 + γ1studyi + γ2abilityi + vi,

where ui and vi represent the disturbances capturing all the other causal factors that influenceindividual marks. Presumably all of δ1, δ2, γ1, γ2 are positive, so that the causal model generatesa positive statistical relationship between assignment and exam marks, and this statistical rela-tionship is captured by (83). So estimates of (83) may be useful for estimating predicted finalexam marks, but they do not attempt to uncover any causal factors that produce either of thosemarks in the first place. Regression (83) is an example of the saying that “correlation need notimply causation”.

This discussion reveals one way in which an attempt at causal modelling may fail. A regressionmodel E (yi|xi) = β0 + β1xi may be specified in the belief that xi causes yi, when the true storyis that some other factor zi causes both yi and xi and produces a purely statistical relationshipbetween them.

9.3 Omitted variables

Omitted explanatory variables is a common reason that regression models fail to measure causaleffects. The case of wages and education is famous for this problem in econometrics. Supposewages are truly caused by

wagei = δ0 + δ1educi + δ2abilityi + ui, (84)

whereE (ui|educi, abilityi) = 0. (85)

This is a highly simplified model of wages, but is suffi cient for this discussion. Natural abilityis a diffi cult concept involving intelligence of various sorts, persistence, resilience and other suchfactors. Numerical measurement of natural ability is probably impossible and wage regressionsdo not contain this variable in practice. Nevertheless, ability is surely an important causal factorfor an individual’s productivity, and hence their wages, implying δ2 > 0 in (84).

In addition, more able individuals will generally obtain higher levels of education, since theycan use their ability to qualify for higher education opportunities and also will benefit more fromtaking up such opportunities. We might therefore expect to find a statistical relationship betweeneducation and ability of the form

E (abilityi|educi) = γ0 + γ1educi, (86)

with γ1 > 0. This education / ability may or may not be causal, or causation may run in theopposite direction, but it doesn’t matter for the discussion of interpretation of (84).

Now suppose we specify a PRF of the form

E (wagei|educi) = β0 + β1educi, (87)

102

Page 103: Etc 2410 Notes

not including ability. The omission of ability does not introduce a problem for the SRF as anestimator of this PRF (it is still unbiased, asymptotically normal coeffi cients and t statistics andso on), so the estimation of the conditional mean of wages given education is correct. The questionis whether β1 measure the causal effect of education on wages, i.e. whether β1 = γ1 in (84)?

To answer this requires an extension of the LIE E [y] = E [E (y|x)]. A more general version is

E [y|z] = E [E (y|x, z) |z] .

This has exactly the same structure as the basic LIE, but each of the expectations has z as anadditional conditioning variable. In the current context this extended LIE can be used to write

E (wagei|educi) = E [E (wagei|educi, abilityi) |educi] (88)

Taking the conditional expectation of (84) given educi and abilityi and applying (85) gives

E (wagei|educi, abilityi) = δ0 + δ1educi + δ2abilityi,

and substituting this into (88) gives

E (wagei|educi) = E [δ0 + δ1educi + δ2abilityi|educi]= δ0 + δ1educi + δ2E (abilityi|educi) .

Substituting (86) then gives

E (wagei|educi) = δ0 + δ1educi + δ2 (γ0 + γ1educi)

= (δ0 + δ2γ0) + (δ1 + δ2γ1) educi.

Comparison with (87) reveals the relationship

β1 = δ1 + δ2γ1.

That is, β1 does not measure the causal effect δ1. Instead it measures a mixture of coeffi cientsfrom both (84) and (86). The fact that the SRF for (87) estimates β1 and not δ1 is generallyreferred to as “omitted variable bias”. It is not bias in the statistical sense, since β1 is unbiasedfor β1 regardless. The so-called bias is an estimator property, but is really the fact that the model(87) does not match the causal mechanism (84) and therefore has different parameter values.

In this case it is plausible to think that δ2 > 0 and γ1 > 0, which implies β1 > δ1. That is,the regression (87) will over-state the causal effect of education on wages. For some intuition forthis, imagine comparing average wages between two groups of individuals, the first group with12 years of education, the second group with 15 years of education. The average wage for thesecond group will be higher (by 3β1). But this different is due to two factors – the second grouphas extra education, but will also consist of individuals of generally higher ability. So the averagewage difference between the two groups is due to both education and ability differences, noteducation alone. Attributing the entire average wage difference to education is an error becausethe comparison fails to control for ability differences.

Note that omitted variables would not be a problem for causal estimation if γ1 = 0. (It isassumed that δ2 6= 0 for this discussion, otherwise abilityi would be irrelevant anyway and couldbe safely omitted.) That is, if the included explanatory variable has no explanatory power for theomitted variable, there will be no “omitted variable bias”, i.e. β1 = δ1.

103

Page 104: Etc 2410 Notes

9.4 Simultaneity

Another problem with causal interpretations of regression models arises when the causality be-tween two variables runs in both directions. That is, there is causality from the explanatoryvariable to the dependent variable of the regression, but also causality in the other directionfrom the dependent variable to the explanatory variable. In this case we say the variables aresimultaneously determined.

Consider the CEO salary example, where firm profitability (as measured by Return on Equity)was used as an explanatory. It was found that average CEO salary varied with firm profitability.This is not the same thing, however, as saying that the level of CEO salary is caused by firmprofitability. This may be true, or it may be that highly paid CEOs are more competent andcause firms to be more highly profitable, or a mixture of the two effects. If firms determine theirCEO’s salary on the basis of their profitability, and highly paid CEO’s also cause higher profits,we would say the two outcomes are simultaneously determined. This might be represented inequation form as

salaryi = δ0 + δ1roei + ui (89)

roei = γ0 + γ1salaryi + vi, (90)

where ui represents the other factors that determine CEO salary and vi represents all the otherfactors that determine the firm’s return on equity. In order for each of these equations to be givensome sort of statistical interpretation, it is necessary to say something about ui and vi. In thefirst equation we would like to assume that E (ui|roei) = 0, while in the second E (vi|salaryi) = 0.These assumptions would allow each of these equations to be given regression representations. Un-fortunately neither assumption is possible when there is simultaneity. For example, E (ui|roei) = 0implies that ui and roei must be uncorrelated, but the simultaneous structure of the equationsdictates that any factors that causes the CEO’s salary must then also be a factor causing Returnon Equity because of salary’s presence in the second equation. This can be made explicit bysubstituting the equation for salaryi into the equation for roei and re-arranging to give

roei =γ0 + γ1δ01− δ1γ1

+γ1

(1− δ1γ1)ui +

1

(1− δ1γ1)vi.

This correlation between roei and ui implies that E (ui|roei) = 0 is not possible. Therefore thePRF

E (salaryi|roei) = β0 + β1roei

does not have the same parameters as the causal equation (89), i.e. β1 6= δ1. The PRF providesa representation of the conditional mean of CEO salary given Return on Equity, and an unbiasedestimate is provided by the SRF, but the conditional mean differs from the causal equation becauseof the simultaneity.

9.5 Sample selection

Sample selection problems can result in differences between the parameters of a PRF and theunderlying causal mechanism. The problem arises when a simple random sample is not available,and instead the sample is chosen at least partly based on the dependent variable itself, or someother factor correlated with the dependent variable.

In Tutorial 5 it was found that a firm’s CEO salary was a positive predictor of the risk of thefirm’s stock. Suppose there is a causal relationship

|returni| = δ0 + δ1salaryi + ui, (91)

104

Page 105: Etc 2410 Notes

with δ1 > 0 implying the higher CEO salaries cause higher risk in the stocks (greater magnitudemovements in share price, either positive or negative). Further suppose for this story that

E (ui|salaryi) = 0,

so thatE (|returni| |salaryi) = δ0 + δ1salaryi. (92)

However, the risks undertaken by some highly paid CEOs may be been so large and gone sowrong that their firms went bankrupt. Such firms with very large negative returns may thereforebe excluded from the sample (if, for example, their bankruptcy resulted in them being removedfrom a database of currently trading firms). To make the story simple, suppose we only observedfirms for whom returni > −90 say, such that firms that lost more than 90% of their value wentbankrupt and were excluded for the database. (This 90% figure is just made up for this story,firm bankruptcy is more complicated in practice of course!) In that case our regression model forE (|returni| |salaryi) is in fact a regression model for E (|returni| |salaryi, returni > −90). That is,if firms with returni ≤ −90 are unavailable for our sample, our regression model is really

E (|returni| |salaryi, returni > −90) = β0 + β1salaryi. (93)

However E (|returni| |salaryi, returni > −90) ≤ E (|returni| |salaryi), because the latter averagesover some larger absolute returns that are excluded from the former. The main point is that thePRF (93) based on the available sample would not match the PRF (92) derived from the causalequation (91) for all firms, so the coeffi cients in (93) would differ from the causal coeffi cients in(91).

10 Regression with Time Series

Time series data differs in important respects from cross-sectional data. Time series data on avariable is collected over a period of time, as opposed to a cross-section which is collected at (atleast approximately) a single point in time. Examples of time series data include observations ona share price or market index recording each minute or each day or at any other frequency, orexchange rates measured similarly, or macroeconomic variables like price inflation or GDP growththat are measured monthly or quarterly, and so on. This time series aspect introduces differentfeatures to the data compared to a cross section. Firstly the observations are ordered, meaningthat there is a natural ordering in time that does not apply to cross sections. When we take asimple random sample of individuals or firms or countries there is no single order of observationsthat is naturally imposed (although they can of course be ordered according to any criterion wewish after they are collected).

Statistically a very interesting feature of time series data is that there is generally some form oftemporal dependence that is interesting to model. Temporal dependence means there is statisticaldependence (i.e. correlation or predictability) that exists between observations at different pointsin time. For example there may be information in today’s stock prices that is useful to predictmovements in prices tomorrow, or information in this month’s inflation figure about next month’sinflation or interest rates or GDP growth, and so on. Modelling this dependence over time is ofgreat interest both for forecasting / prediction purposes and also for attempts at causal modellingwith time series. The dependence also means that the theory underlying regression using OLSis different, because the i.i.d. assumption is generally no longer applicable. That is, time seriesdata is cannot be collected using a simple random sample.

Variables with time series data are generally denoted as yt and xt rather than yi and xi. Thedifference is purely convention, but helps to remind which type of data is in use for a particular

105

Page 106: Etc 2410 Notes

model. Following Wooldridge, it will be useful to begin by denoting the dependent variable as ytand an explanatory variable as zt. The switch to zt instead of xt will be become clear, but isn’tvery important. A simple static regression with time series then looks like

E (yt|zt) = β0 + β1zt, (94)

which has the same structure as a cross sectional regression, but needs different theoretical un-derpinnings without the structure of an i.i.d. sample. We will not pursue this, but instead discusssome more interesting time series models that are used both for forecasting and causal modelling.

10.1 Dynamic regressions

A dynamic regression is one that models the ways in which the relationships between variables canevolve over time. There are many ways of doing this, but just two of the most popular approacheswill be covered here.

10.1.1 Finite Distributed Lag model

In specifying a regression model, the concept of conditioning is obviously fundamental. A regres-sion model is a model of a conditional mean. In time series analysis it becomes important to beclear about exactly what is being conditioned in any regression model. It is often of most practicalinterest to not just condition on zt as in (94), but also on the past values as well. That is, a timeseries regression model is often specified conditional on all values of the explanatory variable thatare observable at time t. The conditional expectation is written E (yt|zt, zt−1, . . . z1). The idea isthat previous values of zt might also be useful for explaining yt. A regression of the form

E (yt|zt, zt−1, . . . z1) = α0 + δ0zt + δ1zt−1 + . . .+ δqzt−q (95)

is often used. A variable of the form zt−j (for any j > 0) is called a lag of zt. The regression (95)is called a Finite Distributed Lag (FDL) model (the “Finite”part not being used by all authors).The number of lags q to include in this model can be determined on the basis of the sample size(using few lags if few observations are available), the frequency of the data (sometimes q = 12 formonthly data, q = 5 for daily data etc) or most commonly on the basis of statistical analysis ofthe model to see what value of q seems most appropriate for explaining yt.

An FDL model captures the idea that the full effect of a change in zt on the mean of yt maynot occur immediately, but may take several time periods. For example, a central bank may raiseoffi cial interest rates in order to attempt to reduce the level of inflation, but it is well known thereare lags in adjustments in the economy such that interest rate changes take some months (asmany as 12-18 months) for their effects to be fully felt. A very simple FDL model with monthlydata to capture this idea would take the form

E (inft|rt, rt−1, . . . , r1) = α0 + δ0rt + δ1rt−1 + . . .+ δ12rt−12, (96)

which allows for current inflation (inft) to be explained by interest rate changes that were madeup to 12 months ago. FDL models are typically used for policy analysis questions such as these.

10.1.2 Autoregressive Distributed Lag model

FDL models can be extended to allow for lags of the dependent variable. The conditioning setfor the regression is extended to cover not only present and past explanatory variables, but alsopast values of the dependent variable. The model is

E (yt|zt, yt−1, zt−1, . . . , y1z1) = α0 + φ1yt−1 + . . . φpyt−p

+δ0zt + δ1zt−1 + . . .+ δqzt−q, (97)

106

Page 107: Etc 2410 Notes

so that past values of yt are permitted to have explanatory power for yt. This is an additionalway to introduce a concept a lagged effects or inertia into a model of a dynamic situation. Model(97) is called an Autoregressive Distributed Lag (ARDL) model. It is a flexible way of capturingdynamic effects, but requires more effort to interpret than the FDL model.

10.1.3 Forecasting

A small variation on the ARDL model is often used for forecasting. Forecasting is the attempt topredict a variable in the future. In the simplest case, a forecast is made on time period into thefuture. Regressions such as (95) and (97) are not useful for forecasting because they contain zt asan explanatory variable for yt, which means a forecast for the future value of yt is being expressedin terms of the future value of zt. A forecasting model needs to remove any variables at time tfrom its set of conditioning variables, and make take a form such as

E (yt|yt−1, zt−1, . . . , y1z1) = α0 + φ1yt−1 + . . . φpyt−p

+δ1zt−1 + . . .+ δqzt−q. (98)

In this model the forecast of yt (i.e. E (yt|yt−1, zt−1, . . . , y1z1)) is expressed purely in terms ofvariables that are available in the previous time period t− 1.

10.1.4 Application

Especially since the GFC, there has been considerable discussion in economics about the variouspossible effects of government debt on economics growth. The “austerity”story is that governmentdebt crowds out private sector economics activity and undermines confidence, and hence prolongsthe recession, giving rise to calls to cut government spending and hence government debt. The“fiscal stimulus”story is that at a time of recession the government should spend more than theyotherwise might (going into further debt if necessary) in an effort to stimulate the economy andend the recession, leaving the task of then reducing the debt to when strong economic growth hasresumed.

We will look at a simple dynamic model relating government debt and economic growth inAustralia using annual data from 1971 to 2012 on the real GDP growth rate per year (from theRBA) and on the net government debt as a percentage of GDP (from the Australian governmentbudget papers). Time series plots are shown in Figures 88 and 10.1.4. Observe that the evolutionof the debt/GDP ratio is much smoother than that of GDP growth, a fact that will inform ourregression modelling later.

A time series regression model for the question of interest is the ARDL PRF

E(growtht|debtt, growtht−1, debtt−1, . . .

)= α0 + φ1growtht−1 + . . .+ φpgrowtht−p

+δ0debtt + δ1debtt−1 + . . .+ δqdebtt−q.

The debt variables in this model can be used to measure the effect of government debt on economicgrowth. The inclusion of the lagged dependent variables is a simple way for the model to allowfor the other dynamics in the economy.

107

Page 108: Etc 2410 Notes

­3

­2

­1

0

1

2

3

4

5

6

1975 1980 1985 1990 1995 2000 2005 2010

GROWTH

Figure 88: Annual real GDP growth, Australia

­4

0

4

8

12

16

20

1975 1980 1985 1990 1995 2000 2005 2010

DEBT_GDP

Net government debt as a percentage of GDP

10.2 OLS estimation

The algebra of OLS estimation in these regressions is almost identical to that for cross-sectionalregressions, with one exception. The presence of lags in these regressions requires adjustments tobe made at the start of the sample. To illustrate, suppose we have observations for t = 1, . . . , n,and specify the first-order FDL model

E (yt|zt, zt−1, . . . , z1) = α0 + δ0zt + δ1zt−1.

This equation has a problem for t = 1 because it involves the variable zt−1 = z0 on the right handside, and this is unavailable. Strictly speaking we should write down these models as applyingonly to values of t for which the variables are available. For example

E (yt|zt, zt−1, . . . , z1) = α0 + δ0zt + δ1zt−1, t = 2, . . . , n

108

Page 109: Etc 2410 Notes

orE (yt|zt, zt−1, . . . z1) = α0 + δ0zt + δ1zt−1 + . . .+ δqzt−q, t = q + 1, . . . , n

or

E (yt|zt, yt−1, zt−1, . . . , y1z1) = α0 + φ1yt−1 + . . . φpyt−p

+δ0zt + δ1zt−1 + . . .+ δqzt−q, t = max (p, q) + 1, . . . , n

and so on.OLS estimation therefore does not use all n observations, it only uses those observations for

which the regression is well-defined for the sample. In the preceding examples, OLS will userespectively n− 1, n− q and n−max (p, q) observations.

The theory for time series regressions is more diffi cult than for i.i.d. regressions. Only anoutline of some practically important points is given here.

10.2.1 Bias

Unbiasedness is more diffi cult to show in time series regressions, and often does not hold. Recallthe unbiasedness proof for an i.i.d. regression

E (yi|xi) = β0 + β1xi,

in which the OLS estimator is written

β1 =

∑ni=1 (xi − x) yi∑ni=1 (xi − x)2

=n∑i=1

an,iyi,

and then the independence part of the i.i.d. conditions is used to deduce that

E (yi|xi) = E (yi|x1, . . . , xn) , (99)

so that

E[β1

]= E

[n∑i=1

an,iE (yi|x1, . . . , xn)

](100)

= E

[β0

n∑i=1

an,i + β1

n∑i=1

an,ixi

](101)

= β1

(using∑n

i=1 an,i = 0 and∑n

i=1 an,ixi = 1). The crucial condition is (99), since without that thestep from (100) to (101) cannot happen.

“Small”biases Consider a simple time series regression

E (yt|xt, . . . , x1) = β0 + β1xt, (102)

where xt might be an explanatory variable such as zt in (95), or xt might just be the laggeddependent variable xt = yt−1, in which case we would have the so-called AR(1) model

E (yt|yt−1, yt−2, . . . , y1) = β0 + β1yt−1, (103)

109

Page 110: Etc 2410 Notes

which is often used as a very simple forecasting model. Now the crucial condition, analogous to(99), that is required is

E (yt|xt, . . . , x1) = E (yt|xn, . . . , x1) . (104)

If this is true then xt is said to be a strictly exogenous regressor and the OLS estimator of β1 isunbiased. However, in a time series setting without independence across time, (104) can easilyfail.

The simplest situation in which (104) is certain to fail is in the AR(1) model. In that case wehave

E (yt|yn−1, . . . , y1) = yt,

since yt is included in the conditioning set yn−1, . . . , y1. Thus E (yt|yn−1, . . . , y1) differs fromE (yt|yt−1, yt−2, . . . , y1) in (103) for all t = 2, . . . , n − 1, implying that strict exogeneity does nothold in this model . The OLS estimator of an AR(1) model is biased, and more generally theOLS estimator of any model with a lagged dependent variable (eg any ARDL model) will also bebiased. It turns out, however, that this bias is “small” in the sense that it does not arise from amisspecification of the model and disappears as the sample size grows. That is, in a “reasonable”sized sample we can expect the bias to be practically unimportant (just as in a “reasonable”sizedsample we can treat the OLS coeffi cients and t statistics as being approximately normal and tdistrbuted).

In (102) it is also possible for there to be bias if xt is not a lagged dependent variable, dependingon the nature of the relationships between xt and yt. If yt has some explanatory value for futurevalues of xt (i.e. the leads xt+j for some j > 0) then (104) will fail. For example, if yt is correlatedwith xt+1 as well as xt then we may have the relationship

E (yt|xn, . . . x1) = γ0 + γ1xt + γ2xt+1,

which is not equal to (102). This latter relationship is usually not interesting from a practicalperspective since saying that yt is explained by future values of xt is useless for forecasting andis unlikely to be meaningful in causal modelling. The point is that we can use this to derivean expression for the coeffi cients in (102) by taking expectations conditional on xt, . . . , x1 andapplying the LIE:

E (yt|xt, . . . , x1) = E [E (yt|xn, . . . x1) |xt, . . . , x1]= γ0 + γ1xt + γ2E [xt+1|xt, . . . , x1] .

For simplicity suppose that xt has the AR(1) conditional mean

E [xt+1|xt, . . . , x1] = λ0 + λ1xt,

so that

E (yt|xt, . . . , x1) = γ0 + γ1xt + γ2 (λ0 + λ1xt)

= (γ0 + γ2λ0) + (γ1 + γ2λ1)xt

= β0 + β1xt.

That is, in this situation the value of β1 in (102) is given by (γ1 + γ2λ1). Now the usual unbi-

110

Page 111: Etc 2410 Notes

asedness proof approach

E[β1

]= E

[n∑t=1

an,tE (yt|xn, . . . , x1)]

= E

[n∑t=1

an,t (γ0 + γ1xt + γ2xt+1)

]

= δ0 + δ1E

[n∑t=1

an,txt+1

]= δ0 + δ1E

[λ1

],

where λ1 is the OLS estimator of λ1 in the AR(1) model for xt. As an OLS estimator of an AR(1)

model, E[λ1

]6= λ1, and this is the source of the bias in β1. Again this bias is “small”, so that

for “reasonable”sample sizes the bias in β1 can be treated as unimportant.Such explanatory power of yt for future xt is quite realistic. For example, in (96), current values

of inflation may be useful for forecasting future interest rate movements because the central bankmay set interest rates partly in response to observed inflation. Some bias may be present in theOLS estimation of the FDL model (96) as a result.

“Large” biases Biases can also arise from mis-specifying the conditional expectations. Forexample, suppose (102) is the assumed model,

E (yt|xn, . . . , x1) = α0 + δ0xt + δ1xt−1,

with the form of the conditioning set implying that xt is a strictly exogenous regressor. This lookslike a lot like an omitted variables problem (i.e. xt−1 is omitted in (102)) but the consequencesof omitting xt−1 are more like a functional form misspecification. That is, the estimates of theconditional mean E (yt|xt, . . . , x1) can be biased by the omission of xt−1. To see this, we take thesame approach as in the analysis of functional form misspecification. The OLS estimator β1 inthe SRF

yt = β0 + β1xt,

can be written

β1 =n∑t=1

an,tyt,

as usual, with∑n

t=1 an,t = 0 and∑n

t=1 an,txt = 1. Now

E[β1

]= E

[n∑t=1

an,tE (yt|xn, . . . , x1)]

= E

[n∑t=1

an,t (α0 + δ0xt + δ1xt−1)

]

= δ0 + δ1E

[n∑t=1

an,txt−1

]6= δ0

where∑n

t=1 an,txt−1is the slope coeffi cient in a regression of xt−1 on an intercept and xt. If thereis temporal dependence in xt then this regression coeffi cient will be generally non-zero, implying

111

Page 112: Etc 2410 Notes

that β1 is not an unbiased estimator of δ0. This bias does not disappear with larger samples. Inattempting to model E (yt|xt, . . . , x1) (i.e. in any FDL or ARDL model), it is necessary to havea method of choosing enough lags to go in the regression in order to avoid inducing biases in theestimates.

Summary of biases A time series regression is a conditional expectation E (yt|xt, xt−1, . . . x1).The explanatory variables xt may include lagged dependent variables yt−1, yt−2, . . . and/or otherexplanatory variables zt, zt−1, . . .. That is, E (yt|xt, xt−1, . . . x1) can include FDL, AR and ARDLmodels.

A “large bias”occurs if E (yt|xt, xt−1, . . . x1) is not correctly specified, i.e. if insuffi cient lagsare included or if the incorrect functional form is specified. A “large bias” is once that doesnot disappear no matter how large the sample size and is one we should try to avoid by carefulspecification.

If E (yt|xt, xt−1, . . . x1) is correctly specified then the OLS estimates of its parameters may stillbe subject to “small biases”arising from the temporal dependence in the variables. This bias isdiffi cult to avoid (i.e. it arises even in well-specified models) but will disappear for larger samplesand is usually not worried about in practical work.

10.2.2 A general theory for time series regression

A general theoretical result that underpins much of practical time series analysis is as follows. If

1. the true conditional expectation for the PRF is

E (yt|zt, yt−1, zt−1, . . . , y1, z1) = α0 + φ1yt−1 + . . .+ φpyt−p

+δ0zt + δ1zt−1 + . . . δqzt−q,

2. both yt and zt are weakly dependent,

then the parameters of the OLS SRF

yt = α0 + φ1yt−1 + . . .+ φpyt−p

+δ0zt + δ1zt−1 + . . . δqzt−q

are consistent and asymptotically normal estimators of the parameters of the PRF.There are some new terms in this. A consistent estimator is asymptotically unbiased, so that

any bias disappears as the sample size grows. That is, a consistent estimator may exhibit the“small bias” discussed above, but not “large bias”. An asymptotically normal estimator is onethat obeys the Central Limit Theorem, just like the cross sectional case. Then the OLS estimatorsare approximately normal and subsequent t and Wald tests are valid. The practical implicationof this result is that we can use the OLS estimators (and resulting t and Wald tests) in just thesame way as they are used in cross-sectional regressions.

There are two important conditions to be satisfied. The first is that suffi cient lags havebeen included in the ARDL model to remove any “large bias”, as discussed above. The secondcondition is that yt and zt are weakly dependent, which is a new concept. A time series xt isweakly dependent if any dependence between xt and xt−h decreases quickly to zero as h increasesto infinity. An implication is that the correlation between xt and xt−h must quickly decreaseto zero as h increases, which we will use to check for weak dependence. In a time series plot, astrongly dependent time series may exhibit a trend (a persistent upwards or downwards movement)and/or evolve very smoothly, and needs to be transformed before being included in a time seriesregression.

The practical steps for time series regression are therefore the following.

112

Page 113: Etc 2410 Notes

1. Check each variable for weak dependence and transform if necessary.

2. Choose an FDL/AR/ARDL specification with suffi cient lags.

3. Carry out estimation and inference by OLS methods as usual.

This is not the final word on time series regression, there are more complications that can arise,but this approach is often suffi cient.

10.3 Checking weak dependence

Deciding whether or not a time series displays weak or strong dependence can be a diffi cult andinexact process. The first piece of evidence to check is the time series plot. A strongly dependenttime series may display a trend or very smooth plot, while a weakly dependent time series will beless smooth. Figures 88 and 10.1.4 suggest that GDP growth is weakly dependent because its plotis not smooth at all, while the plot of debt/GDP is quite smooth and suggests strong dependence.

The other piece of evidence we will use is the Correlogram. For a weakly dependent timeseries the correlation cor (xt, xt−h) will decrease quickly to zero as h increases, while for a stronglydependent time series cor (xt, xt−h) will decrease much more slowly. This is not a clear-cut criterionto apply, but is often informative. To obtain the correlogram of a time series, choose “View -Correlogram...” for that series as shown in Figure 89, for then select “Level” for now. Thecorrelograms for growth and debt/GDP are shown in Figures 90 and 91. The relevant correlationsare in the graph under the heading “Autocorrelation”and in the table under the heading “AC”.The autocorrelations for growth are all quite small and support the graphical evidence that GDPgrowth is weakly dependent. The autocorrelations for debt/GDP are considerably larger anddecrease much more slowly towards zero. This evidence, together with the time series plot, leadsus to treat debt/GDP as strongly dependent.

If a variable is judged to be strongly dependent then the usual next step is to take its firstdifference in order to achieve weak dependence. The difference is defined as

∆debtt = debtt − debtt−1,

i.e. the amount by which the debt/GDP changes from one year to the next. Usually one differenceis suffi cient, but occasionally differencing twice may be required. Eviews uses the letter “D”togenerate a difference. The time series plot of ∆debtt is shown in Figure 92, where it can beseen to be substantially less smooth than its undifferenced version. The correlogram of ∆debtt isshown in Figure 93, where the autocorrelations can be seen to decrease towards zero faster thanthe undifferenced version. These pieces of evidence are suffi cient for us to proceed using ∆debttin its first-differenced version.

10.4 Model specification

To illustrate the specification and interpretation of these models, we will first select an FDL modeland then an ARDL model. This is just for illustration and usually the selection process couldcover all possibilities together.

For any model of E (yt|zt, yt−1, zt−1, . . . , y1, z1), it is a necessary condition for correct speci-fication that the residuals of the SRF not display significant temporal dependence, in particularno autocorrelation. If we define the prediction error of the PRF as

et = yt − E (yt|zt, yt−1, zt−1, . . . , y1, z1) ,

113

Page 114: Etc 2410 Notes

Figure 89: Choosing a correlogram view of a variable

Figure 90: Correlogram of growth

114

Page 115: Etc 2410 Notes

Figure 91: Correlogram of debt/GDP

­4

­2

0

2

4

6

1975 1980 1985 1990 1995 2000 2005 2010

D(DEBT_GDP)

Figure 92: First difference of debt/GDP

115

Page 116: Etc 2410 Notes

Figure 93: Correlogram of ∆debtt

then it must be the case that

E (et|zt, yt−1, zt−1, . . . , y1, z1) = 0,

and hence E (et) = 0 by the LIE. Then

cov (et, et−j) = E (etet−j)

= E [E (etet−j |zt, yt−1, zt−1, . . . , y1, z1)] by the LIE= E [E (et|zt, yt−1, zt−1, . . . , y1, z1) et−j ]= 0,

so that et has no correlation with any lags of itself. The important step in this proof is from thesecond line to the third, where et−j is taken outside of the conditional expectation. This is possiblebecause et−j is a function of yt−j , zt−j , yt−j−1, zt−j−1, . . . , y1, z1, all of which are contained in theconditioning set zt, yt−1, zt−1, . . . , y1, z1 for any j > 0. The practical implication of this is thatany evidence of autocorrelation in the residuals of the SRF, which would imply cov (et, et−j) 6= 0,suggests that the PRF has been misspecified and requires additional lags. A convenient check forautocorrelation is provided by the residual correlogram in Eviews, see Figure 94.

Figures 95—97 show the results of FDL regressions for growth including zero, one and two lagsrespectively. The residual correlograms are shown in Figures 98—100.The last two columns of theresidual correlograms are useful for autocorrelation testing. The null hypothesis for the Q-stat inrow r is that there is no correlation between et and et−j for all j = 1, . . . , r, with the p value forthe test in the last column. For example, in Figure 98 we could set out a test for correlation atlags 1—4 as

116

Page 117: Etc 2410 Notes

Figure 94: Choosing a residual correlogram in an estimated equation

1. H0 : cov (et, et−j) = 0 for j = 1, 2, 3, 4

2. H0 : cov (et, et−j) 6= 0 for one of more of j = 1, 2, 3, 4

3. α = 0.05

4. Test statistic : p = 0.662

5. Decision rule : reject H0 for p < 0.05

6. Do not reject H0, so there is no evidence of autocorrelation at any lags less than or equalto four.

It is generally unnecessary to set out the full test like this for an autocorrelation check. It issuffi cient to look down the last column of p values and if any of them are less than 0.05 thenconsider the model as being misspecified and move on to another that attempts to address theproblem, using more lags for eg. In this case, there is no evidence of residuals autocorrelation inany of the three FDL models.

Since all three models pass the autocorrelation test, we can compare them using their AICvalues. The FDL model with a single lag has the smallest AIC value and so would be chosen fromamong these three models. .

ARDL models are also specified for growth, see Figures 101, 103, 105. These are, respec-tively, ARDL(1,0), ARDL(1,1), ARDL(1,2) models, implying each has a single lagged dependentvariable (the AR(1) part) and respectively 0, 1 and 2 lagged explanatory variables. The residualcorrelogram for the ARDL(1,0) model in Figure 102 shows a significant lag nine autocorrelation,so this model is excluded from further comparisons. The other two models pass the residualautocorrelation tests. The ARDL(1,1) model has a lower AIC than the ARDL(1,2) model and istherefore preferred. The ARDL(1,1) model is also preferred to the FDL(1) model in Figure 96according to the AIC. Of those six models considered here, the ARDL(1,1) would therefore be theone preferred overall. We will interpret both the ARDL(1,1) and FDL(1) models for illustrativepurposes.

117

Page 118: Etc 2410 Notes

Figure 95: FDL model for growth with no lags

Figure 96: FDL model for growth with one lag

118

Page 119: Etc 2410 Notes

Figure 97: FDL model for growth with two lags

Figure 98: Residual correlogram for the FDL model with zero lags

119

Page 120: Etc 2410 Notes

Figure 99: Residual correlogram for the FDL model with one lag

Figure 100: Residual correlogram for the FDL model with two lags

120

Page 121: Etc 2410 Notes

Figure 101: ARDL(1,0) model for growth

Figure 102: Residual correlogram for ARDL(1,0) model

121

Page 122: Etc 2410 Notes

Figure 103: ARDL(1,1) model for growth

Figure 104: Residual correlogram for ARDL(1,1) model

122

Page 123: Etc 2410 Notes

Figure 105: ARDL(1,2) model for growth

Figure 106: Residual correlogram for ARDL(1,2) model

123

Page 124: Etc 2410 Notes

10.5 Interpretation

The interpretation of models with lags is not quite as straightforward as in static regressions.

10.5.1 Interpretation of FDL models

Consider an FDL model

E (yt|xt, xt−1, . . . , x1) = α0 + δ0xt + δ1xt−1.

The individual coeffi cients have similar interpretations to usual regressions. If xt is increased byone unit then the conditional mean of yt changes by δ0 units, and in these regressions this is calledthe impact multiplier. If xt−1 is increased by one unit then the conditional mean of yt changes byδ1 units, so this change takes one time period before it occurs. The coeffi cient δ1 is called the lagone multiplier. These interpretations carry over to longer lags in FDL models.

Now suppose xt−1 were increased by one unit and this increase were allowed to remain in timet as well. In that case both xt and xt−1 have been increased by one unit, so the effect on theconditional mean of yt is δ0+δ1. This is called the long run multiplier. This joint interpretation ofthe coeffi cients (i.e. allowing both xt and xt−1 to increase, rather than increasing one and holdingthe other constant) makes practical sense. If xt were the offi cial interest rate and yt inflation forexample, the central bank would be interested to measure the effect on inflation if the interestrate were increased by 1% this month and the increase allowed to stay in place next month. Inthat context, the long run multiplier has more practical meaning than the lag one multiplier,which measures the effect of a 1% increase in interest rate in one month that is then reversed thefollowing month.

The long run multiplier can be estimated directly by re-writing the FDL regression as

E (yt|xt, xt−1, . . . , x1) = α0 + (δ0 + δ1)xt − δ1∆xt.

That is, regressing yt on an intercept and xt and ∆xt will give a direct estimate of the long runmultiplier, along with its standard error for t statistics and confidence intervals.

The FDL(1) model in Figure 96 can be written

growtht = 3.198(0.186)

− 0.586(0.153)

∆debtt + 0.563(0.161)

∆debtt−1,

n = 40, R2 = 0.333.

The two slope coeffi cients are significant at the 5% level (i.e. have p < 0 on their t statistics). Theimpact multiplier is −0.586, so that a 1% increase in the rate of change of the debt/GDP ratiopredicts a −0.586% fall in the growth rate. This is both statistically and economically significantand would be consistent with the “austerity”story. However, the lag one multiplier is +0.563, sothat one period later the effect of the increase in the rate of change of the debt/GDP ratio is ofopposite sign and approximately the same magnitude. The long run multiplier is −0.586+0.563 =−0.023% , so that an initial negative effect on growth is almost completely offset the followingyear by a positive effect of growth, with the net effect of the increase in debt being very smallon growth. The transformed regression to estimate the long run multiplier directly involves aregression of growth on an intercept and ∆debtt and ∆2debtt (i.e. the second difference of debt),the results of which are shown in Figure 107. The long run multipler estimate of −0.023% isinsignificant, implying very little long run effect of government debt changes on predictions foreconomic growth, despite the significant changes in the short run (at lags 0 and 1).

124

Page 125: Etc 2410 Notes

Figure 107: Direct estimation of long run multiplier on the FDL model for growth

10.5.2 Interpretation of ARDL models

Interpretation of ARDL models is more complicated because the dynamic effects are formedby a mixture of the lagged dependent and lagged explanatory variables. For the purposes ofinterpretations here, we will assume that zt is strictly exogenous. This makes the derivationssimpler and may be a reasonable assumption when zt is a policy variable such as government debtor an offi cial interest rate.

ARDL(1,0) Consider first the ARDL(1,0) model

E (yt|yt−1, . . . , y1, zn, . . . , z1) = α0 + φ1yt−1 + δ0zt. (105)

The conditioning on all of z1, . . . , zn, not just z1, . . . , zt, is possible because of the assumptionthat zt is strictly exogenous. The impact multiplier of a one unit increase in zt on the conditionalmean of yt is δ0. That is a standard interpretation.

Looking for effects at higher lags requires some derivations. First take (105) and lag by onetime period:

E (yt−1|yt−2, . . . , y1, zn, . . . , z1) = α0 + φ1yt−2 + δ0zt−1. (106)

Now the expectation of both sides of (105) conditional on yt−2, . . . , y1, zn, . . . , z1 and use the LIEon the left hand side and (106) on the right hand side to obtain

E (yt|yt−2, . . . , y1, zn, . . . , z1) = α0 + φ1E (yt−1|yt−2, . . . , y1, zn, . . . , z1) + δ0zt

= α0 + φ1 (α0 + φ1yt−2 + δ0zt−1) + δ0zt

= α0 (1 + φ1) + φ21yt−2 + φ1δ0zt−1 + δ0zt. (107)

This representation shows that the lag one multiplier for a one unit increase in zt is φ1δ0.To find the lag two multiplier, lagging (106) by another time period gives

E (yt−2|yt−3, . . . , y1, zn, . . . , z1) = α0 + φ1yt−3 + δ0zt−2,

125

Page 126: Etc 2410 Notes

and then taking the expectation of (107) conditional on yt−3, . . . , y1, zn, . . . , z1 gives

E (yt|yt−3, . . . , y1, zn, . . . , z1) = α0 (1 + φ1) + φ21E (yt−2|yt−3, . . . , y1, zn, . . . , z1) + φ1δ0zt−1 + δ0zt

= α0 (1 + φ1) + φ21 (α0 + φ1yt−3 + δ0zt−2) + φ1δ0zt−1 + δ0zt

= α0(1 + φ1 + φ21

)+ φ31yt−3 + φ21δ0zt−2 + φ1δ0zt−1 + δ0zt.

This shows that the lag two multiplier for a one unit increase in zt is φ21δ0.This process can be repeated as often as desired. The pattern is clearly that the lag j multiplier

for a one unit increase in zt is φj1δ0. The long run multiplier is the sum over all of the individual

lag j multipliers. It is both simple and conventional to sum over all j = 0, 1, 2, . . . without upperlimit, giving

δ0 + δ0φ1 + δ0φ21 + δ0φ

31 + . . . =

δ01− φ1

,

which uses the geometric series∑∞

j=0 φj1 = 1/ (1− φ1) for |φ1| < 1, the latter condition being

satisfied when yt is weakly dependent. This is the long run effect on the conditional mean of ytof a permanent one unit increase in zt.

ARDL(1,1) The same approach to interpretation applies to the ARDL(1,1) model

E (yt|yt−1, . . . , y1, zn, . . . , z1) = α0 + φ1yt−1 + δ0zt + δ1zt−1. (108)

Repeated lags and conditional expectations as above leads to

E (yt|yt−j , . . . , y1, zn, . . . , z1)

= α0

(1 + φ1 + . . .+ φj−11

)+ φj1yt−j

+δ0zt + (φ1δ0 + δ1) zt−1 + φ1 (φ1δ0 + δ1) zt−2 + φ21 (φ1δ0 + δ1) zt−3 + . . .

from which it can be seen that the lag j multiplier for a one unit increase in zt is δ0 for j = 0 andφj−11 (φ1δ0 + δ1) for j > 0. The long run multiplier is therefore

δ0 + (φ1δ0 + δ1)∞∑j=1

φj−11 = δ0 +φ1δ0 + δ1

1− φ1=δ0 + δ11− φ1

.

Returning to the ARDL(1,1) model in Figure 103, the SRF can be written

growtht = 4.070(0.470)

− 0.269(0.153)

growtht−1 − 0.726(0.181)

∆debtt + 0.617(0.176)

∆debtt−1,

n = 40, R2 = 0.383,

givingφ1 = −0.269, δ0 = −0.726, δ1 = 0.617.

The multipliers are therefore

Impact (lag 0) δ0 = −0.726

Lag 1 φ1δ0 + δ1 = −0.269×−0.726 + 0.617 = 0.812

Lag 2 φ1

(φ1δ0 + δ1

)= −0.269× 0.812 = −0.218

Lag 3 φ2

1

(φ1δ0 + δ1

)= −0.269×−0.218 = 0.059

......

126

Page 127: Etc 2410 Notes

Figure 108: Wald test of the long run multiplier

The long run multiplier is

δ0 + δ1

1− φ1=−0.726 + 0.617

1− (−0.269)= −0.086.

The evidence from this regression is that changes in government debt predict short run changesin economic growth, greatest in the first 2-3 years, but quickly decreasing to zero thereafter. Alsothe effects tend to oscillate in sign so that they cancel out when added, producing a long runeffect that is quite small and, according to the Wald test in Figure 108, statistically insignificant.

These results are about the predictions of economic growth on the basis of changes in govern-ment debt. All of the diffi culties outlined above in making causal interpretations applies here. Inparticular there are other variables besides government debt that may influence economic growth.Also there may be simultaneity between government debt and growth – for example, a slowdownin economic growth may increase government expenditures (unemployment benefits) and decreasetax receipts (reduced company and income taxes and GST because of reduced economic activity)and hence increase the government debt. So these results here are informative about the dynamicstructure of the conditional expectations that relate economics growth and government debt, butmust be treated with very great caution for causal inference.

11 Regression in matrix notation

Linear regression is far neater to present using matrix notation.

11.1 Definitions

A matrix is a rectangular arrangement of numbers. For example

A =

1 42 53 6

.

The dimension of a matrix is denoted r × c, where r is the numbers of rows and c is the numberof columns. The matrix A has dimension 3× 2.

127

Page 128: Etc 2410 Notes

The individual elements of the matrix A can be denoted ai,j for i = 1, . . . , r indexes the rowand j = 1, . . . , c indexes the column. So a2,1 = 2 and a3,2 = 6.

Two matrices are defined to be equal if they have the same dimensions and their individualelements are all equal.

A square matrix has the same number of columns as rows; that is, r = c.A column vector, or simply a vector, is a matrix consisting of a single column. A row vector

consists of a single row. For example, if we define

B =

(12

), C =

(1 2 3

),

then B is a (column) vector and C is a row vector. Their dimensions are 2×1 and 1×3 respectively.A scalar is a 1× 1 matrix, that is, a single number.The transpose of a matrix turns its columns into rows (equivalently rows into columns). The

transpose of A is denoted A′ (though some denote transpose as AT ). For example,

A′ =

(1 2 34 5 6

), B′ =

(1 2

), C ′ =

123

.

Note that (A′)′ = A for any matrix, and that the transpose of a scalar is just the scalar (eg.2′ = 2). If A has dimension r × c then A′ has dimension c× r.

A square matrix M is symmetric if M = M ′. For example,

M =

1 4 64 2 06 0 3

is a symmetric matrix. If M has elements mi,j then it is symmetric if mi,j = mj,i for all i and j.

The main diagonal of a square matrix are the elements running from the top left corner tothe bottom right corner of the matrix, denoted diag (M). In the example,

diag (M) =

123

.

That is diag (M) is the vector of elements mi,i for all i. A symmetric matrix is symmetric aboutits main diagonal, meaning those elements below the main diagonal are reflected above the maindiagonal.

11.2 Addition and Subtraction

Two matrices can be added or subtracted if they have the same dimensions; that is, if they areconformable for addition. Addition and subtraction are element-wise, for example if

D =

3 82 91 −2

,

then A and D are conformable for addition and

A+D =

4 124 144 4

, A−D =

−2 −40 −42 8

.

It is not possible to add or subtract non-conformable matrices. For example, A+B and A+Care not defined.

128

Page 129: Etc 2410 Notes

11.3 Multiplication

If x is a scalar then the product Ax means each element of A is multiplied by x. For example, ifx = 2 then

Ax =

2 84 106 12

.

Suppose we have two matrices A and B with respective dimensions rA× cA and rB × cB. Thematrix product AB can be defined if cA = rB; that is, if the number of columns of A matches thenumber of rows of B. In this case, AB is a matrix of dimension rA× cB, with individual elementsof the form

(AB)i,j =

cA∑k=1

ai,kbk,j .

For example, with A and B as defined above, we have cA = rB = 2 so that the product ABis defined, and the result will have dimension rA × cB = 3× 1 :

AB =

1 42 53 6

( 12

)=

1× 1 + 4× 22× 1 + 5× 23× 1 + 6× 2

=

91215

.

Unlike with scalars, matrix multiplication is not commutative; that is, AB 6= BA in general.In fact, AB may be defined but BA not defined. The current definitions of A and B illustratethis, since BA is not defined since B has one column and A has three rows. Even if AB and BAare both defined, they may not be the same dimension, and even if they are the same dimension,AB and BA will generally be different.

For example, ifE =

(2 3

),

then

BE =

(1 34 6

), EB = 8.

The transpose of a matrix product satisfies (AB)′ = B′A′.

11.4 The PRF

A multiple regression model has the general form

E (yi|x1,i, . . . xk,i) = β0 + β1x1,i + . . .+ βkxk,i.

The right hand side of this can be compactly written in matrix form. Define the (k + 1) × 1vectors

xi =

1x1,i...xk,i

, β =

β0β1...βk

.

Thenx′iβ = β0 + β1x1,i + . . .+ βkxk,i,

and the PRF can be writtenE (yi|xi) = x′iβ.

This representation is very useful for theoretical and computational purposes.

129

Page 130: Etc 2410 Notes

11.5 Matrix Inverse

The determinant of a square r × r matrix A is denoted |A|. If A is 2× 2 then

|A| =∣∣∣∣( a11 a12

a21 a22

)∣∣∣∣ = a11a22 − a12a21.

Formulae for higher order matrices are more involved.If |A| = 0 then A is called singular, while if |A| 6= 0 then A is non-singular. For example,∣∣∣∣( 1 2

2 4

)∣∣∣∣ = 0,

∣∣∣∣( 1 −22 4

)∣∣∣∣ = 8,

so the first matrix is singular and the second is non-singular. A singular matrix satisfies Ac = 0for some vector c 6= 0, while a non-singular matrix has Ac 6= 0 for all c 6= 0. For example,

A =

(1 22 4

), c =

(2−1

), Ac =

(00

),

while if

A =

(1 −22 4

)there is no vector c such that Ac = 0.

The identity matrix is a square matrix of dimension r, denoted Ir, such that

AIr = IrA = A

for any r × r matrix A. It has ones on the main diagonal and zeros elsewhere, that is

Ir =

1 0 . . . 00 1 0...

. . ....

0 0 . . . 1

.

The inverse, if it exists, of an r× r square matrix A is an r× r matrix denoted A−1 and satisfies

A−1A = AA−1 = Ir.

The inverse exists only if A is non-singular.If A is 2× 2 then

A−1 =

(a11 a12a21 a22

)−1=

1

|A|

(a22 −a12−a21 a11

).

For example, (1 −22 4

)−1=

1

8

(4 2−2 1

)=

(12

14

−1418

)but (

1 22 4

)−1does not exist.

In the linear system of equationsAx = b,

if A is non-singular then A−1 exists. Multiplying the system on either side by A−1 gives A−1Ax =A−1b so the solution is

x = A−1b.

130

Page 131: Etc 2410 Notes

11.6 OLS in matrix notation

For a PRF E (yi|xi) = x′iβ, the OLS estimator β is the choice of vector b that minimises the sumof squared residuals

SSR(b) =n∑i=1

(yi − x′ib

)2.

This can be shown to be

β =

(n∑i=1

xix′i

)−1 n∑i=1

xiyi.

The matrix∑n

i=1 xix′i is a square (k + 1)× (k + 1) matrix. To be non-singular, there must be

no vector c 6= 0 such thatn∑i=1

xix′ic = 0.

For there to be such a vector c,∑n

i=1 xix′ic = 0 would require x′ic = 0 for all i, which would imply

a perfect linear relationship among the elements of the xi vector, i.e. perfect multicollinearity. Sothe condition that there is no perfect multicollinearity implies that

∑ni=1 xix

′i is non-singular and

has an inverse, and hence that β can be computed.

11.6.1 Proof∗

The OLS estimator can be derived using vector calculus, or it can shown to minimise SSR (b) asfollows. Write

SSR (b) =

n∑i=1

((yi − x′iβ

)− xi

(b− β

))2=

n∑i=1

(yi − x′iβ

)2− 2

n∑i=1

xi

(yi − x′iβ

)+(b− β

)′ n∑i=1

xix′i

(b− β

).

The first term is SSR(β)

=∑n

i=1

(yi − x′iβ

)2, while the formula for β can be used to show that

the second term satisfiesn∑i=1

xi

(yi − x′iβ

)=

n∑i=1

xiyi −n∑i=1

xix′i

(n∑i=1

xix′i

)−1 n∑i=1

xiyi

=

n∑i=1

xiyi −n∑i=1

xiyi

= 0,

so

SSR (b) = SSR(β)

+(b− β

)′ n∑i=1

xix′i

(b− β

).

If b = β then(b− β

)′∑ni=1 xix

′i

(b− β

)= 0. If b 6= β then writing c = b− β gives

(b− β

)′ n∑i=1

xix′i

(b− β

)=

n∑i=1

c′xix′ic =

n∑i=1

z2i > 0

since zi = c′xi 6= 0 when there is no perfect multicollinearity. Thus SSR (b) > SSR(β)when

b 6= β. This shows that SSR (b) is minimised by b = β.

131

Page 132: Etc 2410 Notes

11.7 Unbiasedness of OLS

Suppose (yi, xi) are i.i.d. for i = 1, . . . , n and E (yi|xi) = x′iβ. Then the independence part ofi.i.d. implies that

E (yi|xi) = E (yi|x1, . . . , xn) .

Then

E[β]

= E

( n∑i=1

xix′i

)−1 n∑i=1

xiE (yi|x1, . . . , xn)

= E

( n∑i=1

xix′i

)−1 n∑i=1

xiE (yi|xi)

= E

( n∑i=1

xix′i

)−1 n∑i=1

xix′iβ

= β,

showing the OLS estimator is unbiased. The proof is clearly far simpler when expressed in matrixnotation.

11.8 Time series regressions

Matrix notation provides a convenient way to represent time series regressions. For example, theARDL model

E (yt|zt, yt−1, zt−1, . . . , y1, z1) = α0 + φ1yt−1 + . . .+ φpyt−p

+δ0zt + δ1zt−1 + . . . δqzt−q,

can be written asE (yt|xt, . . . , x1) = x′tβ,

where

xt =

1yt−1...

yt−pzt...

zt−q

, β =

α0φ1...φpδ0...δq

.

The strict exogeneity condition is

E (yt|xt, . . . , x1) = E (yt|xn, . . . , x1) ,

132

Page 133: Etc 2410 Notes

under which β is exactly unbiased.

E[β]

= E

( n∑t=1

xtx′t

)−1 n∑t=1

xtE (yt|xn, . . . , x1)

= E

( n∑t=1

xtx′t

)−1 n∑t=1

xtE (yt|xt, . . . , x1)

= E

( n∑t=1

xtx′t

)−1 n∑t=1

xtx′tβ

= β,

Without the strict exogeneity condition, β need only be asymptotically unbiased.

133