computational laboratory for economics

461
COMPUTATIONAL LABORATORY FOR ECONOMICS GABRIELE CANTALUPPI Notes for the students

Upload: fukuzorgie

Post on 20-Nov-2015

258 views

Category:

Documents


17 download

DESCRIPTION

COMPUTATIONAL LABORATORY FOR ECONOMICS

TRANSCRIPT

  • COMPUTATIONAL LABORATORYFOR ECONOMICS

    GABRIELE CANTALUPPI

    Notes for the students

  • COMPUTATIONAL LABORATORYFOR ECONOMICS

    Notes for the students

    Milano 2013

    GABRIELE CANTALUPPI

  • 2012-2013 EDUCatt - Ente per il Diritto allo Studio Universitario dell'Universit CattolicaLargo Gemelli 1, 20123 Milano - tel. 02.7234.22.35 - fax 02.80.53.215e-mail: [email protected] (produzione); [email protected] (distribuzione)web:ISBN edizione cartacea: 978-88-6780-021-6

    Ledizione cartacea di questo volume stata stampata nel mese di settembre 2013presso la Litografia Solari (Peschiera Borromeo - Milano)

    www.educatt.it/libri

    ISBN edizione elettronica: 978-88-6780-022-3

    2012-2013 2012-2013 2012-2013 2012-2013 2012-2013

  • CONTENTS

    Preface ix

    1 Some Elements of Statistical Inference 11.1 On the Properties of the Sample Mean 1

    1.1.1 The Normal Distribution Case 11.1.2 The Central Limit Theorem 5

    2 An Introduction to Linear Regression 92.1 Example: Individual wages (2.1.2) 9

    2.1.1 Data Reading and summary statistics 92.1.2 Some graphical representations and grouping statistics 112.1.3 Simple Linear Regression 142.1.4 Confidence intervals (Section 2.5.2) 16

    2.2 Multiple Linear Regression (Section 2.5.5) 172.2.1 Parameter estimation 172.2.2 ANOVA to compare the two models (Section 2.5.5) 18

    2.3 CAPM example (Section 2.7) 182.3.1 CAPM regressions (without intercept) (Table 2.3) 192.3.2 Testing an hypothesis on 1 232.3.3 CAPM regressions (with intercept) (Table 2.4) 252.3.4 CAPM regressions (with intercept and January dummy) (Table

    2.5) 262.4 The Worlds Largest Hedge Found (Section 2.7.3) 262.5 Dummy Variables Treatment and Multicollinearity (Section 2.8.1) 282.6 Missing Data, Outliers and Influential Observations 312.7 How to check the form of the distribution 31

    2.7.1 Data histogram with the theoretical density function 322.7.2 The 2 goodness-of-fit test 322.7.3 The Kolmorogov-Smirnov test 342.7.4 The PP-plot and the QQ-plot 372.7.5 Use of the function fit.cont 40

    2.8 Two tests for assessing normality 412.8.1 The Jarque-Bera test 412.8.2 The Shapiro-Wilk test 43

    2.9 Some further comments on the QQ-plot 43

  • 2.9.1 Positively skewed distributions 442.9.2 Negatively skewed distributions 452.9.3 Leptokurtic distributions 462.9.4 Platikurtic distributions 47

    3 Interpreting and comparing Linear Regression Models 553.1 Explaining House Prices (Section 3.4) 55

    3.1.1 Testing the functional form: construction of the RESET test 573.1.2 Testing the functional form: A direct function to perform the

    RESET test 603.1.3 Testing the functional form: the RESET test for the extended

    model 603.1.4 Testing the functional form: the interaction term 623.1.5 Prediction 633.1.6 Model with price instead of log(price) as dependent variable

    and lotsize instead log(lotsize) among the predictors 633.1.7 The PE test to compare a loglinear specification with the linear

    specification 643.2 Selection procedures: Predicting Stock Index Returns (Section 3.5) 66

    3.2.1 The full model 673.2.2 The max R2 criterion 683.2.3 Stepwise 723.2.4 An algorithm to perform a stepwise backward elimination of

    regressors 753.2.5 AIC 763.2.6 BIC 783.2.7 A better output to compare the results 793.2.8 Some remarks on the AIC and BIC values 803.2.9 Out of sample forecasting performance (Table 3.5) 82

    3.3 Explaining Individual Wages (Section 3.6) 833.3.1 Linear Models (Section 3.6.1) 833.3.2 Loglinear Models (Section 3.6.2) 853.3.3 The Effects of Gender (Section 3.6.3) 90

    4 Heteroscedasticity and Autocorrelation 934.1 Explaining Labour Demand (Section 4.5) 93

    4.1.1 Linear Model 934.1.2 Breusch-Pagan test - construction 944.1.3 Breusch-Pagan test - direct function 954.1.4 Loglinear model 954.1.5 White Heteroscedasticity test 964.1.6 Heteroscedasticity consistent covariance matrix 984.1.7 Estimated Generalized Least Squares 994.1.8 Types of Heteroscedasticity consistent covariance matrices 103

    4.2 The Demand for Ice Cream (Section 4.8) 1084.2.1 The Durbin-Watson statistic - construction 111

  • 4.2.2 The Durbin-Watson statistic - direct function 1124.2.3 Estimation of the first-order autocorrelation coefficient 1134.2.4 The Breusch-Godfrey test to test the presence of autocorrela-

    tion - construction 1154.2.5 The Breusch-Godfrey test to test the presence of autocorrela-

    tion - direct function 1174.2.6 Some remarks on the procedure presented by Verbeek on page

    113 1174.2.7 The EGLS (iterative Cochrane-Orcutt) procedure 1184.2.8 The model with the lagged temperature 120

    4.3 Risk Premia in Foreign Exchange Markets (Section 4.11) 1224.3.1 Tests for Risk Premia in the 1 month Market 1244.3.2 Tests for Risk Premia using Overlapping Samples 129

    5 Endogeneity, Instrumental Variables and GMM 1375.1 Estimating the Returns to Schooling (Section 5.4) 1375.2 Example of an application of the Generalized Method of Moments 1425.3 Estimating Intertemporal Asset Pricing Models (Section 5.7) 144

    6 Maximum Likelihood Estimation and Specification Tests 1516.1 Normal distribution 1526.2 Bernoulli distribution 1626.3 Exponential distribution 1666.4 Poisson distribution 1726.5 Linear model 1776.6 Individual wages (Section 2.5.5) 182

    7 Models with Limited Dependent Variables 1877.1 The Impact of Unemployment Benefits on Recipiency (Section 7.1.6) 187

    7.1.1 Estimation of the linear probability model 1887.1.2 Estimation of the Logit model 1897.1.3 Estimation of the Probit model 1917.1.4 A unique table for comparing model estimates 1927.1.5 Some additional goodness of fit measures 195

    7.2 Some remarks on the interpretation of a parameter in a logit model 1987.3 Explaining Firms Credit Ratings (Section 7.2.1) 2007.4 Willingness to Pay for Natural Areas (Section 7.2.4) 2047.5 Patent and R&D Expenditures (Section 7.3.2) 2117.6 Expenditures on Alcohool and Tobacco (Part 1) (Section 7.4.3) 2247.7 Expenditures on Alcohool and Tobacco (Part 2) (Section 7.5.4) 228

    8 Univariate Time Series Models 2398.1 Some examples of stochastic processes 239

    8.1.1 The Gaussian White Noise 2398.1.2 The Autoregressive Process 2408.1.3 The Moving Average Process 2448.1.4 Simulation of a realization from an AR(1) process with drift 247

  • 8.2 Autocorrelation, Partial autocorrelation functions and ARMA modelidentification 2498.2.1 Autocorrelation and Partial autocorrelation functions for an

    AR(1) process with drift 2498.2.2 Autocorrelation and Partial autocorrelation functions for some

    AR(p) processes with drift 2548.2.3 Autocorrelation and Partial autocorrelation functions for a

    MA(1) process 2628.2.4 Autocorrelation and Partial autocorrelation functions for some

    MA(p) processes 2628.2.5 Autocorrelation and Partial autocorrelation functions for an

    ARMA(1,1) process 2708.2.6 Problems in identifying an ARMA model for a time series 270

    8.3 On the bias of the OLS estimator of the autoregressive coefficient foran AR(1) process with AR(1) errors 2758.3.1 Some remarks on the use of the function curve 280

    8.4 Estimation of ARIMA Models with the function arima 2808.4.1 No unit roots in the characteristic equation p(z) = 0 2838.4.2 1 unit root in the characteristic equation p+1(z) = 0 2868.4.3 2 unit roots in the characteristic equation p+2(z) = 0 291

    8.5 Some other R functions for ARMA model parameter estimation 2958.5.1 The arima function 2968.5.2 The sarima function in the package astsa 2978.5.3 The Arima function in the package forecast 2998.5.4 The armaFit function 2998.5.5 The FitARMA function 3008.5.6 The ar function 3018.5.7 The arima function in the package TSA 302

    8.6 R functions for predicting with ARMA models 3028.7 Stock Prices and Earnings (Section 8.4.4) 306

    8.7.1 Dickey-Fuller test - construction 3078.7.2 Dickey-Fuller test - direct function 3088.7.3 How to produce the Dickey-Fuller statistic for different lags 3158.7.4 Other tests for unit roots detection 3158.7.5 Testing for multiple unitary roots 317

    8.8 Some remarks on the function ur.df 3228.8.1 The Dickey-Fuller test for a unit root, type "none" 3228.8.2 Dickey-Fuller test for a unit root, type drift 3238.8.3 Dickey-Fuller test for a unit root, type trend 3238.8.4 Example 3248.8.5 Exercise 3338.8.6 Exercise 336

    8.9 Long-run Purchasing Power Parity (Part 1) (Section 8.5) 3388.10 The Persistence of Inflation (Section 8.8) 345

    8.10.1 AR estimation 3508.10.2 The Ljung-Box statistic - construction 351

  • 8.10.3 The Ljung-Box statistic - direct function 3518.10.4 AR estimation via Maximum Likelihood 3528.10.5 AR(4) estimation 3528.10.6 ARMA estimation 3538.10.7 AR(6) estimation 3548.10.8 Non complete models 356

    8.11 The Expectations Theory of the Term Structure (Section 8.10) 3588.12 Autoregressive Conditional Heteroscedasticity 363

    8.12.1 A Brief Presentation of ARCH Processes 3638.12.2 A First Example 365

    8.13 Volatility in Daily Exchange Rates (Section 8.11.3) 378

    9 Multivariate Time Series Models 3919.1 Spurious Regression (Section 9.2.1) 3919.2 Long-run Purchasing Power Parity (Part 2) (Section 9.3) 3939.3 Long-run Purchasing Power Parity (Part 3) (Section 9.5.4) 3979.4 Money Demand and Inflation (Section 9.6) 399

    10 Models based on panel data 40910.1 Explaining Individual Wages (Section 10.3) 40910.2 Explaining Capital Structure (Section 10.5) 419

    11 431References 431

    A Some useful R functions 435A.1 How to Install R 436A.2 How to Install and Update Packages 436A.3 Data Reading 436

    A.3.1 zip files 436A.3.2 Reading from a text file 437A.3.3 Reading from a Stata file 438A.3.4 Reading from an EViews file 438A.3.5 Reading from a Microsoft Excel file 438

    A.4 formula{stats} 439A.5 linear model 441A.6 Deducer 444

    B Addendum 3rd edition 449B.1 Annual Price/Earnings Ratio (Section 8.4.4 third edition) 449

    B.1.1 Dickey-Fuller test 449B.1.2 Testing for multiple unitary roots 452

    B.2 Modelling the Price/Earnings Ratio (Section 8.7.5 third edition) 452B.2.1 AR estimation 454B.2.2 The Ljung-Box statistic 456B.2.3 AR estimation via Maximum Likelihood 457B.2.4 MA estimation 457

  • B.2.5 Non complete models 458B.3 Volatility in Daily Exchange Rates (Section 8.10.3 third edition) 463B.4 Long-run Purchasing Power Parity (Part 1) (Section 8.5 third edition) 471B.5 Long-run Purchasing Power Parity (Part 2) (Section 9.3) 477B.6 Long-run Purchasing Power Parity (Part 3) (Section 9.5.4) 480

  • PREFACE

    These Lecture Notes refer to the examples and illustrations proposed in the book AGuide to Modern Econometrics by Marno Verbeek (4th and 3rd editions).

    The source codes here described are written in the R language (R DevelopmentCore Team 2012) (R version 3.0.1 was used).

    Subjects are presented in the course Computational Laboratory for Economicsheld at Universita` Cattolica del Sacro Cuore, Graduate Program Economics. Thecourse runs in parallel with the course Empirical Economics where the methodologicalbackground is assessed.

    Attention was paid in order to obtain results first according to their mathematicalstructure, and then by using appropriate built-in R functions, anyway searching foran efficient and elegant programming style.

    The reader is assumed to possess the basic knowledge of R. An introduction to R byLonghow Lam, available on http://www.splusbook.com/RIntro/RCourse.pdf mayrepresent a good reference.

    Chapters from 2 to 10 recall the contents of Verbeeks Guide. Appendix A1 describeshow to read data from text, Stata and EViews files, which are the formats used byVerbeek on his book website, where data sets are available. Appendix B containsresults for examples which were present on the 3rd edition of Verbeeks Guide.

    Some companion materials to these Lecture Notes can be downloaded from thebooksite www.educatt.it/libri/materiali.

    I warmly thank Diego Zappa and Giuseppe Boari for having read parts of themanuscript. I wish to thank Stefano Iacus for his short course on an efficient andadvanced use of R, and Achim Zeileis, Giovanni Millo and Yves Croissant for havingimproved their packages lmtest and plm in order to properly fit some problems herepresented.

  • 1Some Elements of StatisticalInference

    1.1 On the Properties of the Sample Mean

    Consider a random variable X with mean E(X) = and variance V ar(X) = 2.Let (x1, . . . xn) be a realization of the n-dimensional random variable X1, . . . Xn,whose components are identically and independently distributed as X.The random variable sample mean

    X =1

    n

    ni=1

    Xi (1.1)

    has the properties:E(X) = (1.2)

    andV ar(X) = 2/n. (1.3)

    1.1.1 The Normal Distribution Case

    We consider, as an example giving evidence of the properties of the sample mean,the empirical distribution of the sample means for k = 100 replications of samples ofsize n = 5 of pseudo-random numbers from a Normal distribution, with mean = 4and variance 2 = 2. We remind that, since the sum of normally distributed randomvariables is a normally distributed random variable, also the sample mean X will benormally distributed.By means of the following code it is possible to create an array x whose elements arethe sample means evaluated for the k replications of n pseudo-random numbers fromX N( = 4, 2 = 2).

    > set.seed(1000)

    > k n mean sigma2

  • 2 Some Elements of Statistical Inference

    Table 1.1 Four samples of size 5 fromX N( = 4, 2 = 1)and their sample mean

    x1 x2 x3 x4 x5 x

    1 3.37 2.29 4.06 4.90 2.89 3.502 3.45 3.33 5.02 3.97 2.06 3.573 2.61 3.22 4.17 3.83 2.11 3.194 4.24 4.22 4.04 1.11 4.30 3.58

    > sd x set.seed(1000)

    > sampletable xbar sampletable

  • Some Elements of Statistical Inference 3

    x

    Freq

    uenc

    y

    2.5 3.0 3.5 4.0 4.5 5.0 5.5

    05

    1015

    Figure 1.1 Distribution of the sample mean from X N(4, 2);sample size: n = 5, number of replications: k = 100

    > set.seed(1000)

    > kvals nvals X for (k in kvals) {

    for (n in nvals) {

    set.seed(1000)

    X X X$k X$n library(lattice)

    > histogram(~xbar | k:n, data = X, breaks = seq(from = min(X$xbar),

  • 4 Some Elements of Statistical Inference

    to = max(X$xbar), length = 25), type = "density",

    as.table = TRUE, xlab = paste("n = ", paste(nvals,

    collapse = ", ")), ylab = paste("k = ", paste(rev(kvals),

    collapse = ", ")))

    > set.seed(1000)

    > kvals nvals X for (k in kvals) {

    for (n in nvals) {

    set.seed(1000)

    X X X$k X$n library(lattice)

    > histogram(~xbar | k:n, data = X, breaks = seq(from = min(X$xbar),

    to = max(X$xbar), length = 25), type = "density",

    as.table = TRUE, xlab = paste("n = ", paste(nvals,

    collapse = ", ")), ylab = paste("k = ", paste(rev(kvals),

    collapse = ", ")))

    kvals and nvals are arrays containing respectively the values of the variables k andn in the 16 situations depicted in Fig. 1.2.X

  • Some Elements of Statistical Inference 5

    n = 9, 25, 64, 100

    k =

    100

    0, 5

    00, 1

    00, 5

    0

    01234

    50:9

    3.0 3.5 4.0 4.5 5.0

    50:25 50:64

    3.0 3.5 4.0 4.5 5.0

    50:100

    100:9 100:25 100:64

    01234

    100:100

    01234

    500:9 500:25 500:64 500:100

    3.0 3.5 4.0 4.5 5.0

    1000:9 1000:25

    3.0 3.5 4.0 4.5 5.0

    1000:64

    01234

    1000:100

    Figure 1.2 Distribution of the sample mean from X N(4, 1);n: sample size, k: number of replications

    function histogram available in the package lattice.The function histogram is applied to represent the values of the sample means (xbar)classified according to the different levels of the interaction of k and n, see the R help?lattice::histogram for more information on the function.

    1.1.2 The Central Limit Theorem

    We now consider what happens whenX, random variable with E(X) = and varianceV ar(X) = 2, is not Normally distributed.

    If X is the sample mean from (x1, . . . xn), realization of the n-dimensional randomvariable X1, . . . Xn, whose components are identically and independently distributedas X, by invoking the central limit theorem we have asymptotically that:

    X n N(,

    2/n). (1.4)

    We remark that in this instance we have not required X to be normally distributed.

  • 6 Some Elements of Statistical Inference

    To give evidence to the central limit theorem result let X1, . . . Xn be identicallyand independently distributed as a Uniform(0,1) or an Exponential( = 4) randomvariable, whose density functions are respectively:

    Y U(0, 1) : f(y) ={

    1 0 < y < 10 elsewhere

    W Exp() : f(w;) ={ew 0 < w

  • Some Elements of Statistical Inference 7

    n = 9, 25, 64, 100

    k =

    100

    0, 5

    00, 1

    00, 5

    0

    0

    5

    10

    50:9

    0.2 0.3 0.4 0.5 0.6 0.7 0.8

    50:25 50:64

    0.2 0.3 0.4 0.5 0.6 0.7 0.8

    50:100

    100:9 100:25 100:64

    0

    5

    10

    100:100

    0

    5

    10

    500:9 500:25 500:64 500:100

    0.2 0.3 0.4 0.5 0.6 0.7 0.8

    1000:9 1000:25

    0.2 0.3 0.4 0.5 0.6 0.7 0.8

    1000:64

    0

    5

    10

    1000:100

    Figure 1.3 Distribution of the sample mean from Y U(0, 1);n: sample size, k: number of replications

  • 8 Some Elements of Statistical Inference

    n = 9, 25, 64, 100

    k =

    100

    0, 5

    00, 1

    00, 5

    0

    0

    5

    10

    1550:9

    0.1 0.2 0.3 0.4 0.5 0.6

    50:25 50:64

    0.1 0.2 0.3 0.4 0.5 0.6

    50:100

    100:9 100:25 100:64

    0

    5

    10

    15100:100

    0

    5

    10

    15500:9 500:25 500:64 500:100

    0.1 0.2 0.3 0.4 0.5 0.6

    1000:9 1000:25

    0.1 0.2 0.3 0.4 0.5 0.6

    1000:64

    0

    5

    10

    151000:100

    Figure 1.4 Distribution of the sample mean from W Exp(4);n: sample size, k: number of replications

  • 2An Introduction to LinearRegression

    2.1 Example: Individual wages (2.1.2)

    We have first to read the data, available in the file wages1.dat, included in thecompressed file ch02.zip.

    2.1.1 Data Reading and summary statistics

    The function read.table allows one to read from a text data set file, where datahave been stored in text format, and create a data.frame, see Appendix A.3. Thedata set file is assumed to be in a tabular form with one or more spaces or a tab asfield separator. The function unzip extracts a file from a compressed archive.

    > wages1 head(wages1)

    EXPER MALE SCHOOL WAGE

    1 9 0 13 6.315296

    2 12 0 12 5.479770

    3 11 0 11 3.642170

    4 9 0 14 4.593337

    5 8 0 14 2.418157

    6 9 0 14 2.094058

  • 10 An Introduction to Linear Regression

    > tail(wages1)

    EXPER MALE SCHOOL WAGE

    3289 5 1 8 5.512004

    3290 6 1 9 4.287114

    3291 5 1 9 7.145190

    3292 6 1 9 4.538784

    3293 10 1 8 2.909113

    3294 7 1 7 4.153974

    The function summary produces some statistics summarizing the columns (variables)of the data frame. The results may be compared with the sample statistics providedby Verbeek in the file wages1.txt.

    > summary(wages1)

    EXPER MALE SCHOOL

    Min. : 1.000 Min. :0.0000 Min. : 3.00

    1st Qu.: 7.000 1st Qu.:0.0000 1st Qu.:11.00

    Median : 8.000 Median :1.0000 Median :12.00

    Mean : 8.043 Mean :0.5237 Mean :11.63

    3rd Qu.: 9.000 3rd Qu.:1.0000 3rd Qu.:12.00

    Max. :18.000 Max. :1.0000 Max. :16.00

    WAGE

    Min. : 0.07656

    1st Qu.: 3.62157

    Median : 5.20578

    Mean : 5.75759

    3rd Qu.: 7.30451

    Max. :39.80892

    If you want all the sample statistics provided in the file wages1.txt you can use thefunction vsummary defined by the following code1:

    > vsummary0 vsummary vsummary(wages1)

    Obs Mean Std.Dev. Min Max na

    EXPER 3294 8.0434123 2.2906610 1.00000000 18.00000 0

    MALE 3294 0.5236794 0.4995148 0.00000000 1.00000 0

    SCHOOL 3294 11.6305404 1.6575447 3.00000000 16.00000 0

    WAGE 3294 5.7575850 3.2691858 0.07655561 39.80892 0

    1We add the information regarding the possible presence of missing values. The function is.nareturns the logical value TRUE if its argument is identified as not available (NA), otherwise FALSE.

  • An Introduction to Linear Regression 11

    llll

    l

    l

    l

    llll

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    ll

    ll

    l

    lll

    l

    ll

    l

    lllll

    lllll

    l

    l

    l

    l

    l

    ll

    ll

    ll

    lllll

    l

    ll

    l

    l

    l

    l

    l

    ll

    lll

    l

    l

    l

    l

    ll

    l

    l

    l

    l

    l

    l

    l

    ll

    ll

    l

    l

    l

    l

    l

    females males

    010

    2030

    40

    Figure 2.1 Box & Whiskers plot of wages for males and females

    2.1.2 Some graphical representations and grouping statistics

    Lets compare the wages for males and females. A useful graphical representation isthe Box & Whiskers plot, see Fig. 2.1. Recall that the levels of the three lines definingthe box correspond respectively to the first, the second and the third Quartile of thedata (the second Quartile is the median). The values placed outside the two whiskersmay be considered anomalous with respect to the other data, see Chambers et. al.(1983).

    We can obtain the graph by having recourse to the function boxplot. The firstargument in this function is a formula, see Appendix A.4, establishing that we arestudying the WAGE as a function (~) of the gender (dummy variable MALE). The secondargument is the name of the data.frame containing the involved variables. By usingthe third argument we attribute proper names to the values 0 and 1, that are assumedby the variable MALE, which will appear on the graph.

    > boxplot(WAGE ~ MALE, data = wages1, names = c("females",

    "males"))

    We can also represent the wage as a function of the years of experience, see Fig. 2.2

  • 12 An Introduction to Linear Regression

    l llll ll

    ll

    ll

    l

    l

    ll

    l

    lll

    llllll

    l

    ll lll

    ll

    l

    l

    lll ll l

    ll

    ll l l

    ll

    ll

    ll

    ll

    lll

    l

    ll

    l

    l

    l

    lll l ll l l

    l lll ll

    lll l

    lllll

    l

    l llll

    ll

    l

    llll

    ll lll ll

    ll

    ll

    lll

    l ll ll

    lll

    lll

    llll l

    l

    l

    l

    l ll

    ll

    lll ll l

    lll

    ll lll

    ll l ll

    ll

    l lll

    l lll l

    ll

    l

    llll

    ll

    ll l

    lllll llll ll

    ll

    lll llll l

    ll

    l

    l ll l

    lll lll

    ll l

    ll l

    llll l

    l

    l

    ll

    ll

    ll

    ll

    ll

    l ll

    llll lll l

    ll

    l

    lll ll l

    l ll

    lll

    lll

    l

    l lll

    lll

    l ll ll

    l

    lll

    l lll

    l lll

    l ll

    llll ll

    ll

    llll l

    l

    l

    llll l l

    l l lll l

    lll l ll

    lll

    ll l

    ll

    ll

    lll

    llll lll

    ll l l

    lll

    ll lll

    ll l l

    ll

    l l lllll

    l lll l

    l

    lll ll

    ll

    lll ll ll ll

    l lll

    l

    l

    l

    lll l ll

    ll

    ll l

    l llll

    ll

    lll

    ll

    ll ll l

    lll

    l ll

    ll

    lll l

    llll l

    lll ll

    ll

    llll

    lll l l

    llll ll

    ll ll ll ll ll

    llll

    l l

    l

    l llll ll

    ll ll

    ll ll l lll

    lll ll

    lllll

    l ll ll

    ll

    l

    llll

    l

    ll ll l ll l

    l

    l

    l

    ll

    l l llll l

    l

    l llll ll

    lll

    l

    ll

    ll

    l

    lll

    l

    l

    l l

    l

    l

    l

    l

    lll

    l ll

    ll

    ll

    lll

    ll

    lll l

    lll ll

    ll

    l llll

    l

    ll

    ll

    lll ll ll

    l

    l

    l ll l

    l lllll

    ll

    l lll

    l

    l

    l

    ll ll

    ll

    l ll

    lll

    lllll

    l

    l llll

    l

    l

    ll

    l

    llll

    ll

    l

    ll

    lllll

    ll

    l

    ll

    lll

    l

    l

    lllll

    ll

    l

    l

    ll

    l ll ll lll l

    ll ll

    ll l

    lll l

    l

    l

    l

    l

    l ll

    ll

    l ll

    ll

    l

    l

    l

    ll l

    ll l

    ll l

    ll l

    ll

    ll

    ll ll

    ll

    llll

    l ll

    l

    ll

    l

    ll l

    l l l

    ll

    ll

    ll

    lllll

    ll l ll

    l

    l

    llll

    l

    lll l

    ll

    ll ll

    l ll l

    ll

    lll

    l

    l

    l

    l

    l

    l

    ll

    ll

    ll

    ll

    l

    l

    l

    ll

    l

    ll l

    lll lll l

    ll ll

    l

    l

    l

    l

    l

    l

    l l llll l

    l

    llll ll

    ll

    lll

    ll

    ll

    l ll ll l l

    ll

    l

    ll

    llll

    lll

    lllll

    ll

    l

    lllll l

    l

    ll

    l

    l lll l

    l

    l lll

    l

    lll

    l

    ll

    ll

    l

    ll

    l

    l lllll l

    l

    ll ll ll

    l ll

    lll

    ll l

    lll

    ll

    l ll

    lll l

    ll ll l

    ll l

    llll

    l

    l

    ll

    lll ll

    ll llll

    lll

    l llll ll l lll

    ll

    l

    l

    lll

    ll

    ll

    l llllll l

    ll l

    ll

    l

    ll

    l

    llll ll ll

    l

    lll

    l l

    lllll ll

    ll

    l

    l

    l l

    ll

    l

    lll

    l ll

    ll

    lllll l

    ll

    l

    l

    l l ll

    l

    ll l ll

    l ll

    ll

    ll

    lll l ll ll

    lll

    l

    lll ll

    ll lll lll

    llll

    ll ll

    ll

    ll

    lll

    ll

    l

    ll

    ll l

    l llll l

    l

    llll ll l l

    lll llll

    ll

    l

    ll

    l

    lll

    lll lll l ll l

    llll lll l l

    ll lll llll

    l l lll

    ll ll

    l

    lll l

    lll ll

    ll

    l

    llll l lll

    l ll

    ll lllll

    l

    lllll

    lll lll

    lllll

    l

    lll

    ll l lll

    l lll ll ll l

    ll

    l

    ll

    l lll

    l l

    l l

    l

    l lll l

    ll

    l llll l

    l

    lll ll

    l

    l

    l

    ll l

    l ll lll

    ll

    ll l

    l

    llll

    ll

    l ll l l

    l

    ll ll

    l lll

    lll lll

    lll

    ll

    l ll l

    l lll l lll ll l ll lll

    lll

    ll l

    l

    l

    ll

    l

    l

    lll

    l

    l

    l

    ll

    ll

    lll ll

    l lll

    l lll lll

    l llll l lll

    ll lll

    ll ll

    ll ll

    ll

    ll

    ll

    l

    lll l

    ll l ll ll

    ll ll l

    ll

    ll llll llll l

    lllll

    l ll

    llll l

    lll lll ll l

    lllll

    l llll

    l

    l llll

    ll l

    ll

    l

    llllll

    ll

    ll

    l

    l

    ll l

    l

    l ll lll

    l

    ll

    lll llll

    l

    llll ll

    ll

    l

    ll

    ll

    l ll l

    l ll

    l lll

    l lll

    ll l l

    lll

    ll ll

    ll

    l

    l

    ll

    l

    l l

    l

    l

    lll

    lll

    l ll

    l llll

    llll l

    l

    ll

    l l l

    ll

    l

    l ll

    l

    lll

    ll

    lll

    lll

    l

    lll

    l

    lll lll lll l

    lll

    l

    l

    ll

    ll

    l lll

    lll

    l

    l l

    l

    ll lll

    l lll llll

    lllll

    ll

    l

    l

    ll

    lll l ll

    l l l

    l

    ll

    ll ll

    l

    l

    l

    ll

    lll ll

    l lll

    lll

    llll

    ll

    l

    l

    l

    l

    ll l l

    ll

    ll

    l lll

    ll

    lll lll l

    l lll lll

    l ll

    l

    l lll

    lll

    ll

    ll

    lll l

    l

    l

    ll

    l

    l

    l

    l llll

    l lll

    l

    l

    ll

    ll ll

    lllll

    l

    ll lll

    ll lll ll

    l

    lll

    l

    l ll l

    l ll ll

    l

    l

    l

    l llll

    lll

    l

    ll

    ll

    lll ll

    l lll

    ll

    lll l

    l

    l

    lll lll

    l ll

    llll

    ll ll

    lll l

    l

    l

    l

    ll ll l

    l ll

    lll

    lll

    lll lll

    l

    l

    l ll

    l

    l

    ll

    l

    ll l

    ll

    l

    lll

    ll

    l

    l

    ll l

    l l

    l ll

    l

    llll

    l

    lll l

    l llll

    l

    lll

    l

    ll l

    l

    ll

    ll

    l

    ll

    lll l

    l ll

    l

    ll

    lll ll

    ll

    l

    l

    lll

    l

    llll ll l ll

    ll ll l ll

    l

    l

    l

    llllll

    l

    l

    l

    lll

    ll ll

    l

    ll

    ll

    l

    lllll

    l

    l

    ll ll ll ll

    ll

    l

    lll

    ll l

    l ll

    l

    ll l

    l

    l

    ll

    ll lll

    l

    l

    l

    l lll

    lll l

    l lll

    l ll

    ll

    ll ll lll

    ll llll

    l

    ll

    l

    llll ll l

    ll

    l lllll

    l

    llll

    l

    l

    ll

    l

    ll ll

    ll

    l

    ll

    l l

    l

    ll

    l

    lllll

    ll

    lll lll

    l

    lll

    l ll

    ll

    l

    l

    l

    l

    l

    l

    ll

    l

    l l

    l

    ll

    lll

    ll

    lll

    l

    ll

    ll ll

    l

    ll

    l

    l

    ll

    l

    l

    ll

    ll

    ll l

    ll

    ll l

    llll l

    ll lll

    llll l

    l

    lll

    llll

    l ll ll

    l

    ll l

    ll

    ll

    ll

    ll

    ll

    l

    l

    l

    ll

    llll l

    l

    l

    l

    lllll

    l

    l

    lll ll

    lll

    lll

    l

    ll l l

    l

    ll ll

    ll

    llll

    ll

    lll

    ll

    l

    l l

    l

    ll

    ll ll

    l

    ll

    l

    lll llll

    l

    ll

    l

    l

    l

    ll ll

    l

    l lll l

    l

    ll

    lll

    ll

    lll

    l

    ll l

    l

    l

    lll

    ll

    ll

    l

    lll lll

    l

    l

    ll

    l

    l ll ll l

    l l

    l

    l

    ll

    l

    ll

    ll

    l

    lll

    l

    l

    lll

    ll

    l

    ll

    ll lllll

    lll

    ll

    lll

    l ll

    ll ll

    l

    l

    lll ll

    l

    lll

    l

    ll

    l

    ll

    l

    l

    l

    ll

    ll

    ll

    l

    lll

    l

    l llll

    ll l

    lll

    lll

    l

    ll

    l

    ll ll

    l lll

    ll

    ll l

    llll

    l

    l

    l

    l

    llll l

    ll

    ll ll

    l

    lll

    l

    lll

    l

    l

    l

    l

    ll

    llll

    ll

    l

    l

    lllll

    l

    l l

    ll l

    ll

    ll l

    llll ll

    l

    l

    l

    llll

    l lll

    ll

    llll

    l

    l

    ll l

    ll

    l

    l

    l lll

    ll

    l l

    l

    l ll

    ll l

    ll ll

    l

    lll l

    llll

    l

    l

    ll

    lll

    lll ll

    lllll

    ll

    ll

    ll

    l

    ll

    l l

    l

    ll l

    l

    l lllll

    ll ll

    l

    llll

    l

    l lll

    l

    llll

    ll

    ll

    l

    llll

    ll l l

    l

    l

    ll ll ll

    l

    l

    llll

    ll

    ll l l ll

    l

    l

    l

    lll

    ll

    ll

    ll

    l lll l l ll

    ll l

    llll

    l

    ll l

    lll

    llllll

    l

    l lll l

    ll

    lll l

    ll

    l lll llll l

    l lll

    ll

    ll l

    ll

    l

    l

    l

    l

    l l

    lll ll

    ll ll

    ll

    ll

    ll

    ll

    l

    l

    ll

    lll llll

    ll

    lll

    l

    lll

    ll l

    ll

    l

    ll

    l

    l

    l ll l lll l l

    l

    llll lll lll

    llll

    l

    llll

    l

    l ll l

    l

    ll

    l

    l ll

    ll

    l

    ll

    ll

    ll

    ll

    l lll

    ll

    l llll

    ll

    lllllll l

    ll ll l llll

    l

    l l ll

    l

    ll

    l

    ll

    l

    l

    ll

    l l

    l

    ll l

    l

    llll l

    l

    l

    ll

    ll

    ll

    l ll

    l

    lll

    ll lll

    l

    ll

    llll l ll l

    lll ll

    l

    ll

    llll l

    l lll

    l ll

    lll

    ll

    ll

    l

    l

    l lll

    ll l ll

    ll

    ll

    l

    l lll ll l

    l

    ll

    ll

    l

    ll

    lll l

    l

    ll

    l

    llll

    lll l

    ll

    l

    lll

    ll

    ll

    ll

    l

    l

    lll

    l

    l

    l

    ll

    ll l

    l ll

    l ll l

    l

    ll ll

    llll ll l

    ll

    l

    ll

    ll ll

    l lllll

    l l

    ll

    llll

    l

    lll

    l

    ll ll

    ll l

    l

    l

    l

    l

    ll

    l ll ll

    llll ll l

    l

    llll

    lllll

    l

    l

    ll l

    l

    l l ll

    l lll ll l

    lll

    ll

    lllll

    l

    llll

    l l ll

    ll

    llll

    l llll l l

    ll

    l

    ll l ll

    l

    ll l

    ll

    l ll

    llll ll

    lll

    lll

    ll ll

    ll

    ll

    l ll

    l ll

    5 10 15

    010

    2030

    40

    EXPER

    WAG

    E

    l l

    lllll

    l

    l

    l

    lllll

    lll

    lllll

    l

    lllllllllll lll

    l

    lll

    llllll

    llll

    l

    l l

    l

    l

    ll

    l

    lll

    l

    l

    l

    llllllll

    l

    lll

    l

    llll

    l

    llllllll lllll

    ll

    ll

    ll

    ll l

    l

    l

    1 2 3 4 5 6 7 8 9 10 12 14 16 18

    010

    2030

    40

    Figure 2.2 Scatterplot and Box & Whiskers plot of wages by the number of years ofexperience

    > layout(1:2)

    > plot(WAGE ~ EXPER, data = wages1)

    > boxplot(WAGE ~ EXPER, data = wages1)

    The function plot results in a scatter plot diagram of the involved variables. Thefunction layout(matrix) creates a multifigure environment, the numbers in thematrix (in our instance a column vector) define the pointer sequence specifying theorder the different graphs will appear.

    We may desire to produce different graphs, for males and females, representing thewage as a function of the years of experience, see Fig. 2.3. It is preferable to firstrecode the dummy variable MALE in a categorical one, e.g. GENDER, that is a factor,whose levels are f and m.

    > wages1$gender levels(wages1$gender)

  • An Introduction to Linear Regression 13

    l l

    lllll

    l

    l

    l

    lllll

    lll

    lllll

    l

    lllllllllll lll

    l

    lll

    llllll

    llll

    l

    l l

    l

    l

    ll

    l

    lll

    l

    l

    l

    llllllll

    l

    lll

    l

    llll

    l

    llllllll lllll

    ll

    ll

    ll

    ll l

    l

    l

    1 2 3 4 5 6 7 8 9 10 12 14 16 18

    010

    2030

    40

    lll

    lll

    llll

    lllll lllll

    l

    ll l

    l

    l

    l

    ll

    l

    ll

    lllll

    llll

    l

    l

    l

    llll

    l

    llll ll

    lll

    l

    llllll l

    llll

    ll

    l

    l

    l

    ll

    lll

    l

    ll

    l

    l

    l

    lll lll

    ll

    l

    l

    lll l

    l

    1.f 4.f 7.f 10.f 13.f 16.f 1.m 4.m 7.m 11.m 15.m

    010

    2030

    40

    Figure 2.3 Scatterplot and Box & Whiskers plot of wages by gender and the number ofyears of experience

    pairs of levels assumed by the two variables.

    > layout(1:2)

    > boxplot(WAGE ~ EXPER, data = wages1)

    > boxplot(WAGE ~ EXPER:gender, data = wages1)

    An easy way to obtain summary results for the variables in the data.frame, separatelyfor females and males is by means of the instruction by.

    The first argument is an array or a data.frame or a matrix on whose columnsthe function specified as third argument will be applied. The second argument is agrouping variable whose length must be equal to the number of rows of the objectgiven as first argument.

    We omit from the analysis the categorical variable gender (fifth column in thedata.frame wages1).

    > by(wages1[, -5], wages1$MALE, summary)

    wages1$MALE: 0

    EXPER MALE SCHOOL WAGE

  • 14 An Introduction to Linear Regression

    Min. : 1.000 Min. :0 Min. : 5.00 Min. : 0.07656

    1st Qu.: 6.000 1st Qu.:0 1st Qu.:11.00 1st Qu.: 3.17564

    Median : 8.000 Median :0 Median :12.00 Median : 4.69326

    Mean : 7.732 Mean :0 Mean :11.84 Mean : 5.14692

    3rd Qu.: 9.000 3rd Qu.:0 3rd Qu.:13.00 3rd Qu.: 6.53275

    Max. :16.000 Max. :0 Max. :16.00 Max. :32.49740

    ------------------------------------------------

    wages1$MALE: 1

    EXPER MALE SCHOOL WAGE

    Min. : 2.000 Min. :1 Min. : 3.00 Min. : 0.1535

    1st Qu.: 7.000 1st Qu.:1 1st Qu.:10.00 1st Qu.: 4.0290

    Median : 8.000 Median :1 Median :12.00 Median : 5.6543

    Mean : 8.326 Mean :1 Mean :11.44 Mean : 6.3130

    3rd Qu.:10.000 3rd Qu.:1 3rd Qu.:12.00 3rd Qu.: 7.8913

    Max. :18.000 Max. :1 Max. :16.00 Max. :39.8089

    2.1.3 Simple Linear Regression

    Lets study by a linear regression model how the mean level of the variable WAGEchanges as a function of the gender: we can regress the variable WAGE on the dummyvariable MALE, which assumes value 1 when the subject is male and 0 when she isfemale. We make use of the function linear model (lm); the first argument is theregression formula, where the ~ symbol separates the dependent variable from theindependent one. The intercept is enclosed by default. The data argument specifiesthe name of the data.frame containing the data.

    We are thus studying the model

    WAGE = 1 + 2 MALE + ERROR (2.1)whose parameter estimates are reported in Verbeeks Table 2.1.

    > regr2.1 summary(regr2.1)

    Call:

    lm(formula = WAGE ~ MALE, data = wages1)

    Residuals:

    Min 1Q Median 3Q Max

    -6.160 -2.102 -0.554 1.487 33.496

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 5.14692 0.08122 63.37

  • An Introduction to Linear Regression 15

    MALE 1.16610 0.11224 10.39 names(regr2.1)

    [1] "coefficients" "residuals" "effects"

    [4] "rank" "fitted.values" "assign"

    [7] "qr" "df.residual" "xlevels"

    [10] "call" "terms" "model"

    Thus the object regr2.1 is a list containing 12 elements. If we want to extract one ofits elements, e.g. the coefficients, we may invoke one of the 3 following commands:

    > regr2.1$coefficients

    (Intercept) MALE

    5.146924 1.166097

    > regr2.1["coefficients"]

    $coefficients

    (Intercept) MALE

    5.146924 1.166097

    > regr2.1[["coefficients"]]

    (Intercept) MALE

    5.146924 1.166097

    obtaining respectively a vector, a list and again a vector.Pay attention! The command3

    > regr2.1["coefficients"] %*% c(1,2)

    returns an Error, since the result of regr2.1["coefficients"] is a list and not avector and cannot be used as an argument of a matrix product. See Chapter 2 ofLonghow Lam (2010) for the definition of the Data Objects: list and vector.Remember to use always double square brackets to extract elements in form of vectorsfrom a list object. The following instructions are correct:

    > regr2.1[["coefficients"]] %*% c(1, 2)

    2We omit to report the call and the result of the function str(regr2.1).3See the help ?Arithmetic to have information on arithmetic operators in R: here %*% stands for

    the matrix product.

  • 16 An Introduction to Linear Regression

    [,1]

    [1,] 7.479118

    > regr2.1$coefficients %*% c(1, 2)

    [,1]

    [1,] 7.479118

    Other useful statistics resulting from a regression analysis are available in theobject obtained by applying the function summary to the result of lm; sonames(regr2.1) and names(summary(regr2.1)) give different information. Theresult of summary(regr2.1) is itself a list containing 11 elements.

    > output names(output)

    [1] "call" "terms" "residuals"

    [4] "coefficients" "aliased" "sigma"

    [7] "df" "r.squared" "adj.r.squared"

    [10] "fstatistic" "cov.unscaled"

    > output$fstatistic

    value numdf dendf

    107.9338 1.0000 3292.0000

    2.1.4 Confidence intervals (Section 2.5.2)

    To test whether the parameter 2 is zero, that is to test the null hypothesisH0 : 2 = 0, we can construct a confidence interval at level (1 ).

    We have first to recall the coefficient estimates, their standard errors and thedegrees of freedom, we must establish a value for and determine the correspondingpercentage points for the t random variable.

    As we have just recalled regr2.1$coefficients and regr2.1$df extractrespectively the coefficients and the degrees of freedom from the object regr2.1.

    output$cov.unscaled extracts from the object output the matrix (X X)1.The instruction is equivalent to summary(regr2.1)$cov.unscaled, remember-ing that we have assigned to the object output the result of summary(regr2.1).

    Finally the function diag extracts the main diagonal from a matrix and bymeans of qt(p,df) it is possible to define the p quantile of a t distributionwith df degrees of freedom.

    > regr2.1$coefficients

    (Intercept) MALE

    5.146924 1.166097

    > coefse coefse

    (Intercept) MALE

    0.08122482 0.11224216

  • An Introduction to Linear Regression 17

    > regr2.1$df

    [1] 3292

    > alpha qt(1 - alpha/2, regr2.1$df)

    [1] 1.960685

    The lower and upper bounds of the MALE coefficient result respectively:

    > regr2.1$coefficients[2] + c(-1, 1) * qt(1 - alpha/2,

    regr2.1$df) * output$sigma * output$cov.unscaled[2,

    2]^0.5

    [1] 0.946 1.386

    The confidence intervals, based on the t distribution, may also be obtained directlyfor all parameter estimates, by using the function confint:

    > confint(regr2.1, level = 1 - alpha)

    2.5 % 97.5 %

    (Intercept) 4.988 5.306

    MALE 0.946 1.386

    2.2 Multiple Linear Regression (Section 2.5.5)

    2.2.1 Parameter estimation

    We want to obtain the parameter estimates of the following linear model:

    WAGE = 1 + 2 MALE + 3 SCHOOL + 4 EXPER + ERROR (2.2)The function lm allows us to perform also a linear regression with more variables asregressors.

    As we have already stated, the symbol ~ separates in a formula the dependentvariable from the independent ones and the + symbol, preceding a variable, indicatesthe presence of that variable in the model. The intercept is enclosed by default. SeeAppendix A.4.

    With the following syntax we declare we desire to study, by making use of a linearmodel (lm), the relationship between the variable WAGE and the set of independentvariables MALE, SCHOOL and EXPER for the data.frame wages1.

    > regr2.2 summary(regr2.2)

    Call:

    lm(formula = WAGE ~ MALE + SCHOOL + EXPER, data = wages1)

    Residuals:

    Min 1Q Median 3Q Max

    -7.654 -1.967 -0.457 1.444 34.194

  • 18 An Introduction to Linear Regression

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) -3.38002 0.46498 -7.269 4.50e-13 ***

    MALE 1.34437 0.10768 12.485 < 2e-16 ***

    SCHOOL 0.63880 0.03280 19.478 < 2e-16 ***

    EXPER 0.12483 0.02376 5.253 1.59e-07 ***

    ---

    Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

    Residual standard error: 3.046 on 3290 degrees of freedom

    Multiple R-squared: 0.1326, Adjusted R-squared: 0.1318

    F-statistic: 167.6 on 3 and 3290 DF, p-value: < 2.2e-16

    2.2.2 ANOVA to compare the two models (Section 2.5.5)

    To establish if the variables SCHOOL and EXPER add a significant joint effect to thevariable MALE for explaining the dependent variable WAGE, we can compare the lattermodel we have estimated (2.2) with (2.1) by using the function anova which performsan analysis of variance in presence of nested models, see Verbeek p. 27. The firstargument of anova is the object resulting from lm applied to the simpler model, thesecond argument is the lm object from the estimation of the more complex model.

    > anova(regr2.1, regr2.2)

    Analysis of Variance Table

    Model 1: WAGE ~ MALE

    Model 2: WAGE ~ MALE + SCHOOL + EXPER

    Res.Df RSS Df Sum of Sq F Pr(>F)

    1 3292 34077

    2 3290 30528 2 3549 191.24 < 2.2e-16 ***

    ---

    Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

    2.3 CAPM example (Section 2.7)

    We can import as we made in Section 2.1 the data from the data set capm.dat.

    > capm

  • An Introduction to Linear Regression 19

    The data set contains information on stock market data, see the file capm.dat. Data,pertaining the following variables, were collected from January 1960 to December2006.

    foodrf: excess returns food industry

    durblrf: excess returns durables industry

    constrrf: excess returns construction industry

    rmrf: excess returns market portfolio

    rf: risk free return

    jan: dummy for January

    smb: excess return on the Fama-French size (small minus big) factor

    hml: excess return on the Fama-French value (high minus low) factor

    2.3.1 CAPM regressions (without intercept) (Table 2.3)

    Verbeek first considers the parameter estimation of the following three linearregression models where the intercept is not included.

    foodrf = 1rmrf + ERROR (2.3)

    durblrf = 1rmrf + ERROR (2.4)

    constrrf = 1rmrf + ERROR (2.5)

    Observe the presence of the element -1 in the following formulae, first arguments ofthe call to lm. It drops the intercept from the list of the regressors. See Appendix A.4.

    > regr2.3f regr2.3d regr2.3c summary(regr2.3f)

    Call:

    lm(formula = foodrf ~ -1 + rmrf, data = capm)

    Residuals:

    Min 1Q Median 3Q Max

    -13.539 -1.026 0.141 1.745 15.924

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    rmrf 0.75774 0.02579 29.39

  • 20 An Introduction to Linear Regression

    Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

    Residual standard error: 2.884 on 609 degrees of freedom

    Multiple R-squared: 0.5864, Adjusted R-squared: 0.5857

    F-statistic: 863.5 on 1 and 609 DF, p-value: < 2.2e-16

    Durables

    > summary(regr2.3d)

    Call:

    lm(formula = durblrf ~ -1 + rmrf, data = capm)

    Residuals:

    Min 1Q Median 3Q Max

    -9.6504 -1.9420 -0.3069 1.7332 17.8871

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    rmrf 1.04736 0.02775 37.74 summary(regr2.3c)

    Call:

    lm(formula = constrrf ~ -1 + rmrf, data = capm)

    Residuals:

    Min 1Q Median 3Q Max

    -12.9414 -1.7193 -0.1866 1.4458 11.6551

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    rmrf 1.16662 0.02535 46.01

  • An Introduction to Linear Regression 21

    How to produce results more appealing to read

    The three preceding outputs are useful to separately interpret the three models wehad to estimate, regarding respectively the food, durables and construction industries.

    We can present the results in a more efficient way to compare the three models,by making use of the function mtable that is available in the package memisc.The arguments to pass to mtable are the three objects we obtained applying theinstruction linear model lm to the food, durables and construction industries.

    > library(memisc)

    > mtable(regr2.3f, regr2.3d, regr2.3c)

    Calls:

    regr2.3f: lm(formula = foodrf ~ -1 + rmrf, data = capm)

    regr2.3d: lm(formula = durblrf ~ -1 + rmrf, data = capm)

    regr2.3c: lm(formula = constrrf ~ -1 + rmrf, data = capm)

    =============================================

    regr2.3f regr2.3d regr2.3c

    ---------------------------------------------

    rmrf 0.758*** 1.047*** 1.167***

    (0.026) (0.028) (0.025)

    ---------------------------------------------

    R-squared 0.586 0.700 0.777

    adj. R-squared 0.586 0.700 0.776

    sigma 2.884 3.105 2.836

    F 863.524 1424.100 2117.287

    p 0.000 0.000 0.000

    Log-likelihood -1511.236 -1556.104 -1500.924

    Deviance 5066.744 5869.713 4898.298

    AIC 3026.472 3116.207 3005.847

    BIC 3035.299 3125.034 3014.674

    N 610 610 610

    =============================================

  • 22 An Introduction to Linear Regression

    We can change the title and the labels in the preceding table, specify which statisticshave to appear in the final part of the table, and also relabel the name of theindependent variable rmrf:

    > mtable2.3fdc mtable2.3fdc mtable2.3fdc

    Calls:

    Food: lm(formula = foodrf ~ -1 + rmrf, data = capm)

    Durables: lm(formula = durblrf ~ -1 + rmrf, data = capm)

    Construction: lm(formula = constrrf ~ -1 + rmrf, data = capm)

    ============================================================

    Food Durables Construction

    ------------------------------------------------------------

    excess market return 0.758*** 1.047*** 1.167***

    (0.026) (0.028) (0.025)

    ------------------------------------------------------------

    R-squared 0.586 0.700 0.777

    sigma 2.884 3.105 2.836

    ============================================================

    Evaluation of the uncentered R2sAccording to relationship (2.43) in Verbeek the uncentered R2s is to be evaluatedwhen a linear model has no intercept. The uncentered R2s are automatically producedby R for the three models and figure in the previous output as R-squared (the Rsoftware takes into account the information that the models are constrained).

    > 1 - sum(regr2.3f$residuals^2)/sum(capm$foodrf^2)

    [1] 0.5864245

    > 1 - sum(regr2.3d$residuals^2)/sum(capm$durblrf^2)

    [1] 0.7004574

    > 1 - sum(regr2.3c$residuals^2)/sum(capm$constrrf^2)

    [1] 0.7766193

  • An Introduction to Linear Regression 23

    2.3.2 Testing an hypothesis on 1

    To test if the coefficients 1 in the linear models (2.3)-(2.5) can be assumed differentfrom 1 we have to evaluate the statistic:

    1 1se(1)

    .

    The estimate of the variance of 1 may be obtained by using the instruction vcov,which returns the covariance matrix of the parameter estimates. The matrix reducesin the present case to a scalar, since we are considering a linear model with only onepredictor and without the constant term.

    > vcov(regr2.3f)

    rmrf

    rmrf 0.0006649123

    We can thus evaluate the above statistic for the three situations:

    > sampletf sampletd sampletc paste("(Food) statistic: ", round(sampletf, 4),

    " p-value: ", round(2 * (1 - pt(abs(sampletf),

    regr2.3f$df)), 4))

    > paste("(Durables) statistic: ", round(sampletd,

    4), " p-value: ", round(2 * (1 - pt(abs(sampletd),

    regr2.3d$df)), 4))

    > paste("(Construction) statistic: ", round(sampletc,

    4), " p-value: ", round(2 * (1 - pt(abs(sampletc),

    regr2.3c$df)), 4))

    we obtain

    [1] "(Food) statistic: -9.3951 p-value: 0"

    [1] "(Durables) statistic: 1.7065 p-value: 0.0884"

    [1] "(Construction) statistic: 6.5719 p-value: 0"

    The function linearHypothesis in the package car performs directly an F test. Thefirst argument is the lm object and the second one specifies the hypothesis to be testedin matrix or symbolic form (see the help ?car::linearHypothesis).Observe that the values of the statistic F are equal to the squared values of the tstatistics obtained above, while the p-values do coincide, since the proposed tests aresimilar.

    > library(car)

    > linearHypothesis(regr2.3f, "rmrf=1")

  • 24 An Introduction to Linear Regression

    Linear hypothesis test

    Hypothesis:

    rmrf = 1

    Model 1: restricted model

    Model 2: foodrf ~ -1 + rmrf

    Res.Df RSS Df Sum of Sq F Pr(>F)

    1 610 5801.1

    2 609 5066.7 1 734.37 88.268 < 2.2e-16 ***

    ---

    Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

    > linearHypothesis(regr2.3d, "rmrf=1")

    Linear hypothesis test

    Hypothesis:

    rmrf = 1

    Model 1: restricted model

    Model 2: durblrf ~ -1 + rmrf

    Res.Df RSS Df Sum of Sq F Pr(>F)

    1 610 5897.8

    2 609 5869.7 1 28.067 2.912 0.08843 .

    ---

    Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

    > linearHypothesis(regr2.3c, "rmrf=1")

    Linear hypothesis test

    Hypothesis:

    rmrf = 1

    Model 1: restricted model

    Model 2: constrrf ~ -1 + rmrf

    Res.Df RSS Df Sum of Sq F Pr(>F)

    1 610 5245.7

    2 609 4898.3 1 347.39 43.19 1.068e-10 ***

    ---

    Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

  • An Introduction to Linear Regression 25

    2.3.3 CAPM regressions (with intercept) (Table 2.4)

    In Verbeek it is then considered the parameter estimation of the following three linearregression models:

    foodrf = 1 + 2rmrf + ERROR (2.6)

    durblrf = 1 + 2rmrf + ERROR (2.7)

    constrrf = 1 + 2rmrf + ERROR (2.8)

    > regr2.4f regr2.4d regr2.4c library(memisc)

    > mtable2.4 mtable2.4 mtable2.4

    Calls:

    Food: lm(formula = foodrf ~ rmrf, data = capm)

    Durables: lm(formula = durblrf ~ rmrf, data = capm)

    Construction: lm(formula = constrrf ~ rmrf, data = capm)

    ============================================================

    Food Durables Construction

    ------------------------------------------------------------

    constant 0.325** -0.131 -0.073

    (0.117) (0.126) (0.115)

    excess market return 0.751*** 1.050*** 1.168***

    (0.026) (0.028) (0.025)

    ------------------------------------------------------------

    R-squared 0.583 0.700 0.776

    sigma 2.869 3.104 2.837

    ============================================================

  • 26 An Introduction to Linear Regression

    2.3.4 CAPM regressions (with intercept and January dummy) (Table2.5)

    The following models are considered to verify the presence of the January effect:

    foodrf = 1 + 2jan + 3rmrf + ERROR (2.9)

    durblrf = 1 + 2jan + 3rmrf + ERROR (2.10)

    constrrf = 1 + 2jan + 3rmrf + ERROR (2.11)

    > regr2.5f regr2.5d regr2.5c library(memisc)

    > mtable2.5 mtable2.5 mtable2.5

    Calls:

    Food: lm(formula = foodrf ~ jan + rmrf, data = capm)

    Durables: lm(formula = durblrf ~ jan + rmrf, data = capm)

    Construction: lm(formula = constrrf ~ jan + rmrf, data = capm)

    ============================================================

    Food Durables Construction

    ------------------------------------------------------------

    constant 0.397** -0.143 -0.122

    (0.121) (0.132) (0.120)

    January dummy -0.878* 0.139 0.604

    (0.419) (0.455) (0.415)

    excess market return 0.753*** 1.050*** 1.167***

    (0.026) (0.028) (0.025)

    ------------------------------------------------------------

    R-squared 0.586 0.700 0.776

    sigma 2.861 3.107 2.835

    ============================================================

    2.4 The Worlds Largest Hedge Found (Section 2.7.3)

    Data are available in the file madoff.dat in the zip file ch02.zip.

    > madoff

  • An Introduction to Linear Regression 27

    The following variables are included:

    fsl: return (in %) on Fairfield Sentry

    fslrf: excess returns

    rf: risk free rate

    rmrf: excess return on the market portfolio

    hml: excess return on the Fama-French value (high minus low) factor

    smb: excess return on the Fama-French size (small minus big) factor

    Verbeek observes that a simple inspection of the return series produces somesuspiciuos results that are evident by considering some summary statistics:

    the mean and the standard deviation that can be obtained by using the functionsmean and sd

    > mean(madoff$fsl)

    [1] 0.8422326

    > sd(madoff$fsl)

    [1] 0.7086928

    the number of months with a negative return computed by summing up theelements of the logical variable resulting from madoff$fsl sum(madoff$fsl < 0)

    [1] 16

    and the fraction of months with a negative return over the whole consideredperiods, that is the ratio between the last result we obtained and the length ofthe series (number of periods)

    > sum(madoff$fsl < 0)/length(madoff$fsl)

    [1] 0.0744186

    A CAPM analysis is then performed, see Verbeeks Table 2.6, by considering thefollowing linear model

    fslrf = 1 + 2rmrf + ERROR

    > regr2.6 summary(regr2.6)

  • 28 An Introduction to Linear Regression

    Call:

    lm(formula = fslrf ~ rmrf, data = madoff)

    Residuals:

    Min 1Q Median 3Q Max

    -1.34773 -0.48005 -0.08337 0.38865 2.97276

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 0.50495 0.04570 11.049 < 2e-16 ***

    rmrf 0.04089 0.01072 3.813 0.00018 ***

    ---

    Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

    Residual standard error: 0.6658 on 213 degrees of freedom

    Multiple R-squared: 0.06388, Adjusted R-squared: 0.05949

    F-statistic: 14.54 on 1 and 213 DF, p-value: 0.0001801

    2.5 Dummy Variables Treatment and Multicollinearity(Section 2.8.1)

    With regard to the data set Wages in the USA it is now considered the parameterestimation of the following three equivalent linear regression models4:

    WAGE = const + M MALE + ERROR (2.12)WAGE = const + F FEMALE + ERROR (2.13)WAGE = M MALE + F FEMALE + ERROR (2.14)

    4Remind that the parameters in the model

    WAGE = const + M MALE + F FEMALE + ERROR,where MALE is a dummy variable with values 0 and 1 and FEMALE satisfies FEMALE = 1 - MALE, arenot identified, since there is exact collinearity among the constant and the dummy variables MALEand FEMALE; so one of the variables has to be omitted from the model.In (2.12) the substitution FEMALE = 1 - MALE has been performed, so dropping the variable FEMALE:

    (const + F ) + (M F ) MALE = const + M MALE,In (2.13) the substitution MALE = 1 - FEMALE has been performed, so dropping the variable MALE

    (const + M ) + (F M ) FEMALE = const + F FEMALE,In (2.14) the identity FEMALE + MALE = 1 has been taken into account, it follows that

    WAGE = const (MALE + FEMALE) + M MALE + F FEMALE + ERRORand:

    (const + M ) MALE + (const + F ) FEMALE = M MALE + F FEMALEFinally we have:

    const = F and const = M .

  • An Introduction to Linear Regression 29

    to which correspond the following three model formulae.

    WAGE ~ MALE

    WAGE ~ I(1 - MALE)

    WAGE ~ -1 + MALE + I(1 - MALE)

    Remember that the dummy variable MALE assumes value 1 when the statistical unitis male and 0 when she is female; so we can define a new dummy variable FEMALE as1 - MALE.

    To write the formula for the second regression model we have to use WAGE ~ I(1- MALE), unless we do explicit define the new variable FEMALE wages1 regr2.7A summary(regr2.7A)

    Call:

    lm(formula = WAGE ~ MALE, data = wages1)

    Residuals:

    Min 1Q Median 3Q Max

    -6.160 -2.102 -0.554 1.487 33.496

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 5.14692 0.08122 63.37

  • 30 An Introduction to Linear Regression

    > regr2.7B summary(regr2.7B)

    Call:

    lm(formula = WAGE ~ I(1 - MALE), data = wages1)

    Residuals:

    Min 1Q Median 3Q Max

    -6.160 -2.102 -0.554 1.487 33.496

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 6.31302 0.07747 81.50 |t|)

    MALE 6.31302 0.07747 81.50

  • An Introduction to Linear Regression 31

    Presenting the results in a nicer wayAs we have already recalled in Section 2.3.1 we can produce an output to comparethe results for the three models, like Verbeeks Table 2.7, by having recourse to thefunction mtable in the package memisc.

    > library(memisc)

    > mtable2.7 mtable2.7 mtable2.7

    Calls:

    A: lm(formula = WAGE ~ MALE, data = wages1)

    B: lm(formula = WAGE ~ I(1 - MALE), data = wages1)

    C: lm(formula = WAGE ~ -1 + MALE + I(1 - MALE), data = wages1)

    ========================================

    A B C

    ----------------------------------------

    constant 5.147*** 6.313***

    (0.081) (0.077)

    male 1.166*** 6.313***

    (0.112) (0.077)

    female -1.166*** 5.147***

    (0.112) (0.081)

    ----------------------------------------

    R-squared 0.032 0.032 0.764

    sigma 3.217 3.217 3.217

    ========================================

    2.6 Missing Data, Outliers and Influential Observations

    See Section 4.1.8.The Least Absolute Deviation approach to parameter estimation has been

    implemented in R by Koenker in the package quantreg, see Koenker (2012).

    2.7 How to check the form of the distribution

    In statistical analyses is important to check if data follow some distribution. Forexample the classical assumptions on the linear model require that errors aredistributed according to a Normal random variable. Thus, after having estimateda linear model, one has to check if this distributional hypothesis is not rejected forthe residuals. The same issue is present in the analysis of time series when e.g. adistributional assumption on white noise is made, see Chapter 8.

  • 32 An Introduction to Linear Regression

    Later on let data be a series with elements x1, . . . , xn. For the sake of simplicitywe work with data simulated from a normal distribution with mean equal to 50 andunitary variance.

    > set.seed(123)

    > data data.hist curve(dnorm(x, mean = mean(data), sd = sd(data)),

    add = TRUE)

    2.7.2 The 2 goodness-of-fit test

    The object data.hist contains all information necessary to create the histogram.Namely data.hist$breaks gives the limits of the intervals (classes) in the histogram,and data.hist$counts the count corresponding to each class.

    > data.hist$breaks

    [1] 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5

    > data.hist$counts

    [1] 1 3 10 11 23 22 13 9 5 3

    We can thus build the following table by considering the same classes of the histogram(the lowest and highest bounds of the histogram are replaced with and +respectively)

  • An Introduction to Linear Regression 33

    Histogram of data

    data

    Den

    sity

    48 49 50 51 52

    0.0

    0.1

    0.2

    0.3

    0.4

    Figure 2.4 Histogram of data with the theoretical density function under the hypothesisof normality

    > data.hist$breaks[1] data.hist$breaks[length(data.hist$breaks)] table table

    inf sup observed count theoretical count

    [1,] -Inf 48.0 1 1.100883

    [2,] 48.0 48.5 3 2.971850

    [3,] 48.5 49.0 10 7.540375

    [4,] 49.0 49.5 11 14.275082

    [5,] 49.5 50.0 23 20.167108

    [6,] 50.0 50.5 22 21.262835

    [7,] 50.5 51.0 13 16.730787

  • 34 An Introduction to Linear Regression

    [8,] 51.0 51.5 9 9.824401

    [9,] 51.5 52.0 5 4.304671

    [10,] 52.0 Inf 3 1.822008

    The first two columns of table contain the class bounds zj1 and zj . The third columncontains the observed frequencies and the fourth column the theoretical frequenciesunder the assumption of normality. These theoretical frequencies are obtained as npj ,where the probabilities pj are defined as

    pj =

    (zj xs

    )

    (zj1 x

    s

    )being zj1 and zj the class limits, and x, s2 the sample mean and the sample variance.

    For testing the null hyphotesis of Normality we can have recourse to the 2

    goodness-of-fit test, see Mood, Graybill and Boes (1974), which is based on thestatistic

    Qk =k+1j=1

    (nj npj)2npj

    where k + 1 is the number of the classes.Qk is distributed according to a

    2k random variable with k degrees of freedom. With

    reference to data we have

    > (qstat 1 - pchisq(qstat, nrow(table) - 1)

    [1] 0.9263825

    so we will not reject the null hypothesis that the elements of data are distributedaccording to a Normal random variable.

    2.7.3 The Kolmorogov-Smirnov test

    Let

    Fn(x) =#xi x

    n

    be the empirical cumulative distribution function (cdf) of data and F0(x) a theoreticalcumulative distribution function, see Fig. 2.5 where the empirical cdf is the stepfunction and the theoretical cdf is the continuous one.

    The Kolmogorov-Smirnov statistic to test the null hypothesis X F0(), whereF0() is some completely specified continuous cumulative distribution function is

    Kn = sup

  • An Introduction to Linear Regression 35

    47 48 49 50 51 52 53

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    F n(x)

    a

    nd

    F 0

    (x)

    ll

    ll

    ll

    lll

    ll

    lll

    lllll

    lll

    ll

    lllll

    lll

    lllll

    llllll

    lllll

    lll

    lllll

    lll

    llllllllll

    llll

    lll

    lll

    lll

    lll

    ll

    lll

    ll

    lll

    ll

    lll

    Figure 2.5 Empirical cumulative distribution function (the step function) and thetheoretical distribution function under the null hypothesis of normality

    This test can also be used to check if the observations in two data sets (x1, . . . , xnx)and (y1, . . . , yny ) come from the same distribution; in this case F0(x) is replaced withthe empirical cdf calculated on (y1, . . . , yny ).

    The Kolmogorov-Smirnov statistic is based on the maximum absolute distancebetween the empirical cdf Fn() and the theoretical one F0(), see Fig. 2.6.

    > plot(ecdf(data), xlim = c(47, 53), cex = 0.5, main = "",

    ylab = expression(F[n](x)~~and~~F[0](x)))

    > curve(pnorm(x, mean = mean(data), sd = sd(data)),

    add = TRUE)

    > x curve(ecdf(data)(x) - pnorm(x, mean = mean(data),

    sd = sd(data)), n = 10000, xlim = c(47, 53),

    ylim = c(-0.06, 0.06), ylab = "distance")

    > abline(h = 0)

  • 36 An Introduction to Linear Regression

    47 48 49 50 51 52 53

    0.

    06

    0.04

    0.

    020.

    000.

    020.

    040.

    06

    x

    dist

    ance

    Figure 2.6 Distance between the empirical cumulative distribution function and thetheoretical distribution function under the null hypothesis of normality

    The Kolmogorov-Smirnov test can be performed by having recourse to the functionks.test, whose arguments are: x the data whose distribution we want to test; yeither a numeric vector of data values (in case one wants to compare y to x), ora character string naming a cdf given by the user or one of the cdfs available inR such as pnorm (only continuous cdfs are valid); ... are additional argumentsspecifying the parameters of the distribution given (as a character string) by y;alternative indicates the alternative hypothesis and must be one of "two.sided"(default), "less", or "greater"; exact, which is NULL by default, can be a logicalindicating whether an exact p-value should be computed.

    Relationship (2.15) makes reference to the two.sided alternative hypothesis. Bysetting the option alternative = "less" the null hypothesis is specified as FX() FY (), that is X Y i.e. X is stochastically smaller than Y ; while if we set the optionalternative = "greater" the null hypothesis is specified as FX() FY (), that isX Y i.e. X is stochastically greater than Y .

    The corresponding Kolmogorov-Smirnov statistics are respectively

  • An Introduction to Linear Regression 37

    D = sup point arrows(point, 0, point, ecdf(data)(point), length = 0.1,

    angle = 22)

    > arrows(point, 0, point, pnorm(point, mean = mean(data),

    sd = sd(data)), length = 0.1, angle = 22)

  • 38 An Introduction to Linear Regression

    > arrows(point, ecdf(data)(point), xlim[1], ecdf(data)(point),

    length = 0.1, angle = 22)

    > arrows(point, pnorm(point, mean = mean(data), sd = sd(data)),

    xlim[1], pnorm(point, mean = mean(data), sd = sd(data)),

    length = 0.1, angle = 22)

    > text(point + 0.05, ecdf(data)(xlim[1]) + 0.01, expression(x[p]))

    > text(xlim[1] + xtextshift, ecdf(data)(point) + 0.01,

    expression(F[n](x[p])))

    > text(xlim[1] + xtextshift, pnorm(point, mean = mean(data),

    sd = sd(data)) + 0.01, expression(F[0](x[p])))

    > point point fpoint arrows(xlim[1], fpoint, qnorm(fpoint, mean = mean(data),

    sd = sd(data)), fpoint, length = 0.1, angle = 22)

    > arrows(xlim[1], fpoint, point, fpoint, length = 0.1,

    angle = 22)

    > arrows(point, fpoint, point, ecdf(data)(xlim[1]),

    length = 0.1, angle = 22)

    > arrows(qnorm(fpoint, mean = mean(data), sd = sd(data)),

    fpoint, qnorm(fpoint, mean = mean(data), sd = sd(data)),

    ecdf(data)(xlim[1]), length = 0.1, angle = 22)

    > text(xlim[1] + xtextshift/2, fpoint + 0.01, expression(tilde(p)))

    > text(point - xtextshift, ecdf(data)(xlim[1]) + 0.01,

    expression(x[tilde(p)]))

    > text(qnorm(fpoint, mean = mean(data), sd = sd(data)) +

    xtextshift, ecdf(data)(xlim[1]) + 0.01,

    expression(x[0][tilde(p)]))

    We have seen above that the Kolmogorov-Smirnov statistics are defined as a functionof the largest absolute, positive or negative difference between the two functions onvarying x.

    For each xp in the ordered data set let p = Fn(xp) be the value assumed by theempirical cdf, see Fig. 2.7: xp is the p percentage point in the data.

    Let now p = F0(xp) be the value assumed by the theoretical cdf in xp.One way to compare the empirical cdf with the theoretical cdf is to obtain a scatter

    plot diagram representing the pairs (p, p). This graphical representation is namedthe probability-probability plot (PP-plot), see Fig. 2.8.

    > p.orders plot(pnorm(sort(data), mean = mean(data), sd = sd(data)),

    p.orders,pch=16, xlab = "p* (theoretical probabilities)",

    ylab = "p (sample probabilities)")

    > abline(0, 1)

    Observe that if points are displaced on the straight line (0,0), (1,1) then F0 couldrepresent the data generating model. The PP-plot is particularly effective in detecting

  • An Introduction to Linear Regression 39

    47.0 47.5 48.0 48.5 49.0 49.5 50.0

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    F n(x)

    a

    nd

    F 0

    (x)

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    l

    xp

    Fn(xp)

    F0(xp)

    p~

    xp~ x0p~

    Figure 2.7 Empirical cumulative distribution function and theoretical cumulativedistribution function: introduction to the PP-plot and QQ-plot graphical representations

    deviatons from F0 in regions of high probability density (typically in the middle ofthe distribution), see Section 2.9.

    A dual way to compare the empirical cdf with the theoretical cdf is to start from ageneric value p assumed by the empirical cdf, see Fig. 2.7. We have two inverse imagesof p: the value xp whose image through the empirical cdf, Fn(x), is p and the valuex0p which has image p by using the theoretical cdf F0. The scatter plot diagram ofthe pairs (x0p, xp) is named Quantile-Quantile Plot (QQ-plot), see Fig. 2.9.

    > plot(qnorm(p.orders, mean = mean(data), sd = sd(data)),

    sort(data), pch = 16,

    xlab = expression(x[0][tilde(p)]~~(theoretical~quantiles)),

    ylab = expression(x[tilde(p)]~~(sample~quantiles)))

    > abline(0, 1)

    The same graph can be obtained by applying to data the function qqnormAlso inthis case if points are on a straight line then F0 could represent the data generatingmodel. The QQ-plot is particularly effective in detecting deviations from F0 on thetails of the distribution, see Section 2.9.

  • 40 An Introduction to Linear Regression

    llll

    llll

    llllll

    lllll l

    llll

    llllll

    llll

    lll l

    lllll l

    llll

    l lll

    llll

    lll l

    llllll

    llll

    lll l

    llll

    llll

    llll

    llll

    llllll

    llll

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p* (theoretical probabilities)

    p (s

    ample

    prob

    abilit

    ies)

    Figure 2.8 PP-plot

    2.7.5 Use of the function fit.cont

    The function fit.cont, available in the package rriskDistributions, givesseveral goodness-of-fit statistics (loglikelihood, AIC, BIC, Chi-squared, Anderson-Darling and Kolmogorov-Smirnov) to check if the data follow some theoreticalcdf. The Beta, Cauchy, chi-square, non-central chi-square, exponential, F, gamma,Gompertz, hypergeometric, lognormal, logistic, negative binomial, Normal, pert,Poisson, Students t, truncated normal, triangular, uniform and Weibull model areimplemented.A theoretical cdf appears in the output only when the procedure succeeds inestimating its parameters otherwise a warning message is returned. Have a look atthe help system for more information.

    The function fit.cont also produces the histogram with the theoretical density,the QQ-plot, the empirical and theoretical cdfs and the PP-plot, see Fig. 2.10.

    We observe that other statistical softwares draw the PP- and QQ-plots by switchingthe x and y axes, so theoretical probabilities and theoretical quantiles will appear onthe y axis.

  • An Introduction to Linear Regression 41

    l

    l

    ll

    l llllll

    lll

    lllllllll

    lllllll

    llllllll

    lllllll

    lllllll

    lllllll

    lllllllll

    lllll

    llllll

    llllll

    lllll

    llll l

    ll

    ll

    48 49 50 51 52

    4849

    5051

    52

    x0p~ (theoretical quantiles)

    x p~ (sa

    mpl

    e qu

    an

    tiles

    )

    Figure 2.9 QQ-plot

    2.8 Two tests for assessing normality

    We consider two tests for assessing the normality distributional assumption.

    2.8.1 The Jarque-Bera test

    The Jarque-Bera test, see Jarque and Bera (1987), is obtained as a LagrangeMultiplier statistic, see Verbeeks Chapter 6, and has the following forms:

    in case of a sample of n observations (x1, . . . , xn) the Jarque-Bera statistic isdefined as:

    JB = n

    [(b1)

    2

    6+

    (b2 3)224

    ]where:

    b1 =3

    3/22

    , b2 =422, j =

    1

    n

    ni=1

    (xi x)j and x = 1n

    ni=1

    xi.

  • 42 An Introduction to Linear Regression

    Figure 2.10 Fitting a continuous distribution by using the function fit.cont

    Observe that b1 and b2 are respectively the skewness and kurtosis samplecoefficients, which are null under the normality assumption.

  • An Introduction to Linear Regression 43

    in case of a sample of n OLS residuals (e1, . . . , en) the Jarque-Bera statistic isdefined as:

    JB = n

    [23632

    +1

    24

    (422 3)2]

    + n

    [32122 31

    22

    ]where:

    j =1

    n

    ni=1

    eji .

    When the linear model includes a constant the residuals have zero mean, thatis 1 = 0, and the Jarque-Bera statistics reduces to the former definition.

    In both cases the Jarque-Bera statistic is distributed as a 22 random variable with 2degrees of freedom.

    In the package tseries the function jarque.bera.test is available to perform theJarque-Bera test on a set of observations. By applying it on data we obtain

    > library(tseries)

    > jarque.bera.test(data)

    Jarque Bera Test

    data: data

    X-squared = 0.1691, df = 2, p-value = 0.9189

    and the null hypothesis of normality will not be rejected.

    2.8.2 The Shapiro-Wilk test

    The Shapiro-Wilk normality test, see Shapiro and Wilk (1965), is implemented in thefunction shapiro.test; applying this function to data we obtain

    > shapiro.test(data)

    Shapiro-Wilk normality test

    data: data

    W = 0.9939, p-value = 0.9349

    which does not reject the null hypothesis of normality.

    2.9 Some further comments on the QQ-plot

    We now consider the behaviour of the QQ-plot (and of the PP-plot), under the nullhypothesis of normality, in presence of data characterized by skewness, leptokurticand platikurtic behaviour.

  • 44 An Introduction to Linear Regression

    2.9.1 Positively skewed distributions

    Let X be distributed according to a Gamma distribution. The density function is

    f(x;, ) =1

    ()x1exI(0,)(x), > 0, > 0

    and we have E(X) = and V ar(X) =2 .

    Figure 2.11 shows the density functions and the cdfs of a Gamma random variable,X, with parameters = 4 and = 2 and of a Normal random variable, Y , with mean42 = 2 and variance

    422 = 1

    > layout(1:2)

    > par(mai = c(0.5, 0.82, 0.1, 0.42))

    > alpha = 4

    > lambda = 2

    > curve(dgamma(x, alpha, lambda), xlim = c(-2, 6),

    ylab = expression(f[X](x)~~and~~f[Y](x)))

    > curve(dnorm(x, mean = alpha/lambda), add = TRUE)

    > text(0.75, 0.4, expression(f[X](x)), cex = 0.75)

    > text(3, 0.35, expression(f[Y](x)), cex = 0.75)

    > curve(pgamma(x, alpha, lambda), xlim = c(-2, 6),

    ylab = expression(F[X](x)~~and~~F[Y](x)))

    > curve(pnorm(x, mean = alpha/lambda), add = TRUE)

    > text(2, 0.75, expression(F[X](x)), cex = 0.75)

    > text(2, 0.35, expression(F[Y](x)), cex = 0.75)

    We can establish the behaviour of the PP- and QQ-plots by considering the cumulativedistribution functions as was shown in Section 2.7.4

    > layout(1:2)

    > par(mai = c(0.9, 0.82, 0.1, 0.42))

    > x plot(pnorm(x, mean = alpha/lambda), pgamma(x, alpha,

    lambda), type = "l", xaxs = "i", yaxs = "i",

    xlab = "theoretical probabilities",

    ylab = "sample probabilities",

    ylim = c(0, 1))

    > abline(0, 1)

    > x plot(qnorm(x, mean = alpha/lambda), qgamma(x, alpha,

    lambda), xlim = c(-2, 6), ylim = c(-2, 6), type = "l",

    xlab = "theoretical quantiles", ylab = "sample quantiles")

    > abline(0, 1)

    > text(-0.75, 1.5, "left tail thinner than the normal tail",

    cex = 0.75)

    > text(3, 5.5, "right tail fatter than the normal tail",

    cex = 0.75)

  • An Introduction to Linear Regression 45

    In this situation the left tail of X is thinner than that of Y while the right tail of Xis fatter than that of Y . Thus the quantiles on the tails of the two distributions willhave the following behaviour: for any given p (close to 0 or to 1) the quantiles of Xare larger than those of Y . The behaviour is evident by examining the QQ plot.

    The PP-plot clearly detects a different behaviour of the two distributions in themiddle of the domain.

    We now apply the function fit.cont to some simulated data, see Fig. 2.15.

    > set.seed(123)

    > skew.data library(rriskDistributions)

    > fit.cont(skew.data)

    2.9.2 Negatively skewed distributions

    Figure 2.12 shows the density functions and the cdfs ofX = W , beingW the Gammarandom variable with parameters = 4 and = 2 considered in the previous section,and of a Normal random variable, Y , with mean 42 = 2 and variance 422 = 1

    > layout(1:2)

    > par(mai = c(0.5, 0.82, 0.1, 0.42))

    > alpha = 4

    > lambda = 2

    > curve(dgamma(-x, alpha, lambda), xlim = c(-6, 2),

    ylab = expression(f[X](x)~~and~~f[Y](x)))

    > curve(dnorm(x, mean = -alpha/lambda), add = TRUE)

    > text(-0.75, 0.4, expression(f[X](x)), cex = 0.75)

    > text(-3, 0.35, expression(f[Y](x)), cex = 0.75)

    > curve(1 - pgamma(-x, alpha, lambda), xlim = c(-6,

    2), ylab = expression(F[X](x)~~and~~F[Y](x)))

    > curve(pnorm(x, mean = -alpha/lambda), add = TRUE)

    > text(-1.75, 0.75, expression(F[X](x)), cex = 0.75)

    > text(-1.75, 0.35, expression(F[Y](x)), cex = 0.75)

    > layout(1:2)

    > par(mai = c(0.9, 0.82, 0.1, 0.42))

    > x plot(pnorm(x, mean = -alpha/lambda), 1 - pgamma(-x,

    alpha, lambda), type = "l", xaxs = "i", yaxs = "i",

    xlab = "theoretical probabilities",

    ylab = "sample probabilities",

    ylim = c(0, 1))

    > abline(0, 1)

    > x plot(qnorm(x, mean = -alpha/lambda), -qgamma(1 -

    x, alpha, lambda), xlim = c(-6, 2), ylim = c(-6,

  • 46 An Introduction to Linear Regression

    2), type = "l", xlab = "theoretical quantiles",

    ylab = "sample quantiles")

    > abline(0, 1)

    > text(-2.5, -5, "left tail fatter than the normal tail",

    cex = 0.75)

    > text(0.75, -1.75, "right tail thinner than the normal tail",

    cex = 0.75)

    In this situation the left tail of X is fatter than that of Y while the right tail of X isthinner than that of Y . Thus the quantiles on the tails of the two distributions willhave the following behaviour: for any given p (close to 0 or to 1) the quantiles of Xare smaller than those of Y . The behaviour is evident by examining the QQ plot.

    As above the PP-plot clearly detects a different behaviour of the two distributionsin the middle of the domain.

    We apply the function fit.cont to some simulated data, see Fig. 2.16.

    > set.seed(123)

    > skew.data library(rriskDistributions)

    > fit.cont(skew.data)

    2.9.3 Leptokurtic distributions

    Let X be distributed according to a tk distribution with k degrees of freedom. Wehave E(X) = 0 and V ar(X) = kk2 .

    The t distributions is used in finance since it is able to capture the fatter tails whichcharacterize the residuals distribution.

    Figure 2.13 shows the density functions and the cdfs of a t random variable withk = 4 degrees of freedom and of a Normal random variable, Y , with mean 0 andvariance 64 = 1.5

    > layout(1:2)

    > par(mai = c(0.5, 0.82, 0.1, 0.42))

    > k = 4

    > curve(dt(x, k), xlim = c(-8, 8),

    ylab = expression(f[X](x)~~and~~f[Y](x)))

    > curve(dnorm(x, mean = 0, sd = (k/(k - 2))^0.5), add = TRUE)

    > text(0.75, 0.35, expression(t[4]), cex = 0.75)

    > text(0, 0.24, "normal", cex = 0.75)

    > curve(pt(x, k), xlim = c(-8, 8),

    ylab = expression(F[X](x)~~and~~F[Y](x)))

    > curve(pnorm(x, mean = 0, sd = (k/(k - 2))^0.5), add = TRUE)

    > text(0, 0.2, expression(F[X](x)), cex = 0.75)

    > text(1.5, 0.7, expression(F[Y](x)), cex = 0.75)

  • An Introduction to Linear Regression 47

    > layout(1:2)

    > par(mai = c(0.9, 0.82, 0.1, 0.42))

    > x plot(pnorm(x, mean = 0, sd = (k/(k - 2))^0.5), pt(x,

    k), type = "l", xaxs = "i", yaxs = "i",

    xlab = "theoretical probabilities",

    ylab = "sample probabilities", ylim = c(0, 1))

    > abline(0, 1)

    > x plot(qnorm(x, mean = 0, sd = (k/(k - 2))^0.5), qt(x,

    k), xlim = c(-8, 8), ylim = c(-8, 8), type = "l",

    xlab = "theoretical quantiles", ylab = "sample quantiles")

    > abline(0, 1)

    > text(-1.25, -7.5, "left tail fatter than the normal tail",

    cex = 0.75)

    > text(1, 7.5, "right tail fatter than the normal tail",

    cex = 0.75)

    In this situation the tails of X are fatter than those of Y . Thus the quantiles on thetails of the two distributions will have the following behaviour: for any given p closeto 0 the quantiles of X are smaller than those of Y ; for any given p close to 1 thequantiles of X are larger than those of Y . The behaviour is evident by examining theQQ plot.

    The density functions are now symmetric and thus the PP-plot intersect the 0-1 lineat the center of the distributions; however it always can detect the different behaviourof the two distributions in the middle of their domain.

    We apply the function fit.cont to some simulated data, see Fig. 2.17.

    > set.seed(123)

    > leptokurtic.data library(rriskDistributions)

    > fit.cont(leptokurtic.data)

    2.9.4 Platikurtic distributions

    Let W be distributed according to a uniform distribution. We have E(X) = 0.5 andV ar(X) = 112 .

    Figure 2.14 shows the density functions and the cdfs of X and of a Normal randomvariable, Y , with mean 0.5 and variance 112

    > layout(1:2)

    > par(mai = c(0.5, 0.82, 0.1, 0.42))

    > curve(dunif(x), xlim = c(-1, 2), ylim = c(0, 1.5),

    ylab = expression(f[X](x)~~and~~f[Y](x)))

    > curve(dnorm(x, mean = 0.5, sd = 1/12^0.5), add = TRUE)

    > text(-0.1, 1, expression(f[X](x)), cex = 0.75)

    > text(0.75, 1.25, expression(f[Y](x)), cex = 0.75)

  • 48 An Introduction to Linear Regression

    > curve(punif(x), xlim = c(-1, 2),

    ylab = expression(F[X](x)~~and~~F[Y](x)))

    > curve(pnorm(x, mean = 0.5, sd = 1/12^0.5), add = TRUE)

    > text(0.9, 0.75, expression(F[X](x)), cex = 0.75)

    > text(0.4, 0.2, expression(F[Y](x)), cex = 0.75)

    > layout(1:2)

    > par(mai = c(0.9, 0.82, 0.1, 0.42))

    > x plot(pnorm(x, mean = 0.5, sd = 1/12^0.5), punif(x),

    type = "l", xaxs = "i", yaxs = "i",

    xlab = "theoretical probabilities",

    ylab = "sample probabilities", ylim = c(0, 1))

    > abline(0, 1)

    > x plot(qnorm(x, mean = 0.5, sd = 1/12^0.5), qunif(x),

    xlim = c(-1, 2), ylim = c(-1, 2), type = "l",

    xlab = "theoretical quantiles", ylab = "sample quantiles")

    > abline(0, 1)

    > text(-0.5, 0.5, "left tail thinner than the normal tail",

    cex = 0.75)

    > text(1.5, 0.5, "right tail thinner than the normal tail",

    cex = 0.75)

    In this situation the tails of Y are fatter than those of X. Thus the quantiles on thetails of the two distributions will have the following behaviour: for any given p closeto 0 the quantiles of X are larger than those of Y ; for any given p close to 1 thequantiles of X are smaller than those of Y . The behaviour is evident by examiningthe QQ plot.

    As above the density functions are now symmetric and thus the PP-plot intersectthe 0-1 line at the center of the distributions; however it can detect the differentbehaviour of the two distributions in the middle of their domain.

    We apply the function fit.cont to some simulated data, see Fig. 2.18.

    > set.seed(123)

    > platikurtic.data library(rriskDistributions)

    > fit.cont(platikurtic.data)

  • An Introduction to Linear Regression 49

    2 0 2 4 6

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    f X(x)

    a

    nd

    f Y(

    x)

    fX(x)fY(x)

    2 0 2 4 6

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    F X(x)

    a

    nd

    F Y

    (x)

    FX(x)

    FY(x)

    0.2 0.4 0.6 0.8

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    theoretical probabilities

    sam

    ple

    prob

    abilit