computational laboratory for economics

COMPUTATIONAL LABORATORYFOR ECONOMICS

GABRIELE CANTALUPPI

Notes for the students

COMPUTATIONAL LABORATORYFOR ECONOMICS

Notes for the students

Milano 2013

GABRIELE CANTALUPPI

2012-2013 EDUCatt - Ente per il Diritto allo Studio Universitario dell'Universit CattolicaLargo Gemelli 1, 20123 Milano - tel. 02.7234.22.35 - fax 02.80.53.215e-mail: [email protected] (produzione); [email protected] (distribuzione)web:ISBN edizione cartacea: 978-88-6780-021-6

Ledizione cartacea di questo volume stata stampata nel mese di settembre 2013presso la Litografia Solari (Peschiera Borromeo - Milano)

www.educatt.it/libri

ISBN edizione elettronica: 978-88-6780-022-3

2012-2013 2012-2013 2012-2013 2012-2013 2012-2013

CONTENTS

Preface ix

1 Some Elements of Statistical Inference 11.1 On the Properties of the Sample Mean 1

1.1.1 The Normal Distribution Case 11.1.2 The Central Limit Theorem 5

2 An Introduction to Linear Regression 92.1 Example: Individual wages (2.1.2) 9

2.1.1 Data Reading and summary statistics 92.1.2 Some graphical representations and grouping statistics 112.1.3 Simple Linear Regression 142.1.4 Confidence intervals (Section 2.5.2) 16

2.2 Multiple Linear Regression (Section 2.5.5) 172.2.1 Parameter estimation 172.2.2 ANOVA to compare the two models (Section 2.5.5) 18

2.3 CAPM example (Section 2.7) 182.3.1 CAPM regressions (without intercept) (Table 2.3) 192.3.2 Testing an hypothesis on 1 232.3.3 CAPM regressions (with intercept) (Table 2.4) 252.3.4 CAPM regressions (with intercept and January dummy) (Table

2.5) 262.4 The Worlds Largest Hedge Found (Section 2.7.3) 262.5 Dummy Variables Treatment and Multicollinearity (Section 2.8.1) 282.6 Missing Data, Outliers and Influential Observations 312.7 How to check the form of the distribution 31

2.7.1 Data histogram with the theoretical density function 322.7.2 The 2 goodness-of-fit test 322.7.3 The Kolmorogov-Smirnov test 342.7.4 The PP-plot and the QQ-plot 372.7.5 Use of the function fit.cont 40

2.8 Two tests for assessing normality 412.8.1 The Jarque-Bera test 412.8.2 The Shapiro-Wilk test 43

2.9 Some further comments on the QQ-plot 43

2.9.1 Positively skewed distributions 442.9.2 Negatively skewed distributions 452.9.3 Leptokurtic distributions 462.9.4 Platikurtic distributions 47

3 Interpreting and comparing Linear Regression Models 553.1 Explaining House Prices (Section 3.4) 55

3.1.1 Testing the functional form: construction of the RESET test 573.1.2 Testing the functional form: A direct function to perform the

RESET test 603.1.3 Testing the functional form: the RESET test for the extended

model 603.1.4 Testing the functional form: the interaction term 623.1.5 Prediction 633.1.6 Model with price instead of log(price) as dependent variable

and lotsize instead log(lotsize) among the predictors 633.1.7 The PE test to compare a loglinear specification with the linear

specification 643.2 Selection procedures: Predicting Stock Index Returns (Section 3.5) 66

3.2.1 The full model 673.2.2 The max R2 criterion 683.2.3 Stepwise 723.2.4 An algorithm to perform a stepwise backward elimination of

regressors 753.2.5 AIC 763.2.6 BIC 783.2.7 A better output to compare the results 793.2.8 Some remarks on the AIC and BIC values 803.2.9 Out of sample forecasting performance (Table 3.5) 82

3.3 Explaining Individual Wages (Section 3.6) 833.3.1 Linear Models (Section 3.6.1) 833.3.2 Loglinear Models (Section 3.6.2) 853.3.3 The Effects of Gender (Section 3.6.3) 90

4 Heteroscedasticity and Autocorrelation 934.1 Explaining Labour Demand (Section 4.5) 93

4.1.1 Linear Model 934.1.2 Breusch-Pagan test - construction 944.1.3 Breusch-Pagan test - direct function 954.1.4 Loglinear model 954.1.5 White Heteroscedasticity test 964.1.6 Heteroscedasticity consistent covariance matrix 984.1.7 Estimated Generalized Least Squares 994.1.8 Types of Heteroscedasticity consistent covariance matrices 103

4.2 The Demand for Ice Cream (Section 4.8) 1084.2.1 The Durbin-Watson statistic - construction 111

4.2.2 The Durbin-Watson statistic - direct function 1124.2.3 Estimation of the first-order autocorrelation coefficient 1134.2.4 The Breusch-Godfrey test to test the presence of autocorrela-

tion - construction 1154.2.5 The Breusch-Godfrey test to test the presence of autocorrela-

tion - direct function 1174.2.6 Some remarks on the procedure presented by Verbeek on page

113 1174.2.7 The EGLS (iterative Cochrane-Orcutt) procedure 1184.2.8 The model with the lagged temperature 120

4.3 Risk Premia in Foreign Exchange Markets (Section 4.11) 1224.3.1 Tests for Risk Premia in the 1 month Market 1244.3.2 Tests for Risk Premia using Overlapping Samples 129

5 Endogeneity, Instrumental Variables and GMM 1375.1 Estimating the Returns to Schooling (Section 5.4) 1375.2 Example of an application of the Generalized Method of Moments 1425.3 Estimating Intertemporal Asset Pricing Models (Section 5.7) 144

6 Maximum Likelihood Estimation and Specification Tests 1516.1 Normal distribution 1526.2 Bernoulli distribution 1626.3 Exponential distribution 1666.4 Poisson distribution 1726.5 Linear model 1776.6 Individual wages (Section 2.5.5) 182

7 Models with Limited Dependent Variables 1877.1 The Impact of Unemployment Benefits on Recipiency (Section 7.1.6) 187

7.1.1 Estimation of the linear probability model 1887.1.2 Estimation of the Logit model 1897.1.3 Estimation of the Probit model 1917.1.4 A unique table for comparing model estimates 1927.1.5 Some additional goodness of fit measures 195

7.2 Some remarks on the interpretation of a parameter in a logit model 1987.3 Explaining Firms Credit Ratings (Section 7.2.1) 2007.4 Willingness to Pay for Natural Areas (Section 7.2.4) 2047.5 Patent and R&D Expenditures (Section 7.3.2) 2117.6 Expenditures on Alcohool and Tobacco (Part 1) (Section 7.4.3) 2247.7 Expenditures on Alcohool and Tobacco (Part 2) (Section 7.5.4) 228

8 Univariate Time Series Models 2398.1 Some examples of stochastic processes 239

8.1.1 The Gaussian White Noise 2398.1.2 The Autoregressive Process 2408.1.3 The Moving Average Process 2448.1.4 Simulation of a realization from an AR(1) process with drift 247

8.2 Autocorrelation, Partial autocorrelation functions and ARMA modelidentification 2498.2.1 Autocorrelation and Partial autocorrelation functions for an

AR(1) process with drift 2498.2.2 Autocorrelation and Partial autocorrelation functions for some

AR(p) processes with drift 2548.2.3 Autocorrelation and Partial autocorrelation functions for a

MA(1) process 2628.2.4 Autocorrelation and Partial autocorrelation functions for some

MA(p) processes 2628.2.5 Autocorrelation and Partial autocorrelation functions for an

ARMA(1,1) process 2708.2.6 Problems in identifying an ARMA model for a time series 270

8.3 On the bias of the OLS estimator of the autoregressive coefficient foran AR(1) process with AR(1) errors 2758.3.1 Some remarks on the use of the function curve 280

8.4 Estimation of ARIMA Models with the function arima 2808.4.1 No unit roots in the characteristic equation p(z) = 0 2838.4.2 1 unit root in the characteristic equation p+1(z) = 0 2868.4.3 2 unit roots in the characteristic equation p+2(z) = 0 291

8.5 Some other R functions for ARMA model parameter estimation 2958.5.1 The arima function 2968.5.2 The sarima function in the package astsa 2978.5.3 The Arima function in the package forecast 2998.5.4 The armaFit function 2998.5.5 The FitARMA function 3008.5.6 The ar function 3018.5.7 The arima function in the package TSA 302

8.6 R functions for predicting with ARMA models 3028.7 Stock Prices and Earnings (Section 8.4.4) 306

8.7.1 Dickey-Fuller test - construction 3078.7.2 Dickey-Fuller test - direct function 3088.7.3 How to produce the Dickey-Fuller statistic for different lags 3158.7.4 Other tests for unit roots detection 3158.7.5 Testing for multiple unitary roots 317

8.8 Some remarks on the function ur.df 3228.8.1 The Dickey-Fuller test for a unit root, type "none" 3228.8.2 Dickey-Fuller test for a unit root, type drift 3238.8.3 Dickey-Fuller test for a unit root, type trend 3238.8.4 Example 3248.8.5 Exercise 3338.8.6 Exercise 336

8.9 Long-run Purchasing Power Parity (Part 1) (Section 8.5) 3388.10 The Persistence of Inflation (Section 8.8) 345

8.10.1 AR estimation 3508.10.2 The Ljung-Box statistic - construction 351

8.10.3 The Ljung-Box statistic - direct function 3518.10.4 AR estimation via Maximum Likelihood 3528.10.5 AR(4) estimation 3528.10.6 ARMA estimation 3538.10.7 AR(6) estimation 3548.10.8 Non complete models 356

8.11 The Expectations Theory of the Term Structure (Section 8.10) 3588.12 Autoregressive Conditional Heteroscedasticity 363

8.12.1 A Brief Presentation of ARCH Processes 3638.12.2 A First Example 365

8.13 Volatility in Daily Exchange Rates (Section 8.11.3) 378

9 Multivariate Time Series Models 3919.1 Spurious Regression (Section 9.2.1) 3919.2 Long-run Purchasing Power Parity (Part 2) (Section 9.3) 3939.3 Long-run Purchasing Power Parity (Part 3) (Section 9.5.4) 3979.4 Money Demand and Inflation (Section 9.6) 399

10 Models based on panel data 40910.1 Explaining Individual Wages (Section 10.3) 40910.2 Explaining Capital Structure (Section 10.5) 419

11 431References 431

A Some useful R functions 435A.1 How to Install R 436A.2 How to Install and Update Packages 436A.3 Data Reading 436

A.3.1 zip files 436A.3.2 Reading from a text file 437A.3.3 Reading from a Stata file 438A.3.4 Reading from an EViews file 438A.3.5 Reading from a Microsoft Excel file 438

A.4 formula{stats} 439A.5 linear model 441A.6 Deducer 444

B Addendum 3rd edition 449B.1 Annual Price/Earnings Ratio (Section 8.4.4 third edition) 449

B.1.1 Dickey-Fuller test 449B.1.2 Testing for multiple unitary roots 452

B.2 Modelling the Price/Earnings Ratio (Section 8.7.5 third edition) 452B.2.1 AR estimation 454B.2.2 The Ljung-Box statistic 456B.2.3 AR estimation via Maximum Likelihood 457B.2.4 MA estimation 457

B.2.5 Non complete models 458B.3 Volatility in Daily Exchange Rates (Section 8.10.3 third edition) 463B.4 Long-run Purchasing Power Parity (Part 1) (Section 8.5 third edition) 471B.5 Long-run Purchasing Power Parity (Part 2) (Section 9.3) 477B.6 Long-run Purchasing Power Parity (Part 3) (Section 9.5.4) 480

PREFACE

These Lecture Notes refer to the examples and illustrations proposed in the book AGuide to Modern Econometrics by Marno Verbeek (4th and 3rd editions).

The source codes here described are written in the R language (R DevelopmentCore Team 2012) (R version 3.0.1 was used).

Subjects are presented in the course Computational Laboratory for Economicsheld at Universita` Cattolica del Sacro Cuore, Graduate Program Economics. Thecourse runs in parallel with the course Empirical Economics where the methodologicalbackground is assessed.

Attention was paid in order to obtain results first according to their mathematicalstructure, and then by using appropriate built-in R functions, anyway searching foran efficient and elegant programming style.

The reader is assumed to possess the basic knowledge of R. An introduction to R byLonghow Lam, available on http://www.splusbook.com/RIntro/RCourse.pdf mayrepresent a good reference.

Chapters from 2 to 10 recall the contents of Verbeeks Guide. Appendix A1 describeshow to read data from text, Stata and EViews files, which are the formats used byVerbeek on his book website, where data sets are available. Appendix B containsresults for examples which were present on the 3rd edition of Verbeeks Guide.

Some companion materials to these Lecture Notes can be downloaded from thebooksite www.educatt.it/libri/materiali.

I warmly thank Diego Zappa and Giuseppe Boari for having read parts of themanuscript. I wish to thank Stefano Iacus for his short course on an efficient andadvanced use of R, and Achim Zeileis, Giovanni Millo and Yves Croissant for havingimproved their packages lmtest and plm in order to properly fit some problems herepresented.

1Some Elements of StatisticalInference

1.1 On the Properties of the Sample Mean

Consider a random variable X with mean E(X) = and variance V ar(X) = 2.Let (x1, . . . xn) be a realization of the n-dimensional random variable X1, . . . Xn,whose components are identically and independently distributed as X.The random variable sample mean

X =1

n

ni=1

Xi (1.1)

has the properties:E(X) = (1.2)

andV ar(X) = 2/n. (1.3)

1.1.1 The Normal Distribution Case

We consider, as an example giving evidence of the properties of the sample mean,the empirical distribution of the sample means for k = 100 replications of samples ofsize n = 5 of pseudo-random numbers from a Normal distribution, with mean = 4and variance 2 = 2. We remind that, since the sum of normally distributed randomvariables is a normally distributed random variable, also the sample mean X will benormally distributed.By means of the following code it is possible to create an array x whose elements arethe sample means evaluated for the k replications of n pseudo-random numbers fromX N( = 4, 2 = 2).

> set.seed(1000)

> k n mean sigma2

2 Some Elements of Statistical Inference

Table 1.1 Four samples of size 5 fromX N( = 4, 2 = 1)and their sample mean

x1 x2 x3 x4 x5 x

1 3.37 2.29 4.06 4.90 2.89 3.502 3.45 3.33 5.02 3.97 2.06 3.573 2.61 3.22 4.17 3.83 2.11 3.194 4.24 4.22 4.04 1.11 4.30 3.58

> sd x set.seed(1000)

> sampletable xbar sampletable

Some Elements of Statistical Inference 3

x

Freq

uenc

y

2.5 3.0 3.5 4.0 4.5 5.0 5.5

05

1015

Figure 1.1 Distribution of the sample mean from X N(4, 2);sample size: n = 5, number of replications: k = 100

> set.seed(1000)

> kvals nvals X for (k in kvals) {

for (n in nvals) {

set.seed(1000)

X X X$k X$n library(lattice)

> histogram(~xbar | k:n, data = X, breaks = seq(from = min(X$xbar),


to = max(X$xbar), length = 25), type = "density",

as.table = TRUE, xlab = paste("n = ", paste(nvals,

collapse = ", ")), ylab = paste("k = ", paste(rev(kvals),

collapse = ", ")))

> set.seed(1000)

> kvals nvals X for (k in kvals) {

for (n in nvals) {

set.seed(1000)

X X X$k X$n library(lattice)

> histogram(~xbar | k:n, data = X, breaks = seq(from = min(X$xbar),

to = max(X$xbar), length = 25), type = "density",

as.table = TRUE, xlab = paste("n = ", paste(nvals,

collapse = ", ")), ylab = paste("k = ", paste(rev(kvals),

collapse = ", ")))

kvals and nvals are arrays containing respectively the values of the variables k andn in the 16 situations depicted in Fig. 1.2.X


n = 9, 25, 64, 100

k =

100

0, 5

00, 1

00, 5

0

01234

50:9

3.0 3.5 4.0 4.5 5.0

50:25 50:64

3.0 3.5 4.0 4.5 5.0

50:100

100:9 100:25 100:64

01234

100:100

01234

500:9 500:25 500:64 500:100

3.0 3.5 4.0 4.5 5.0

1000:9 1000:25

3.0 3.5 4.0 4.5 5.0

1000:64

01234

1000:100

Figure 1.2 Distribution of the sample mean from X N(4, 1);n: sample size, k: number of replications

function histogram available in the package lattice.The function histogram is applied to represent the values of the sample means (xbar)classified according to the different levels of the interaction of k and n, see the R help?lattice::histogram for more information on the function.

1.1.2 The Central Limit Theorem

We now consider what happens whenX, random variable with E(X) = and varianceV ar(X) = 2, is not Normally distributed.

If X is the sample mean from (x1, . . . xn), realization of the n-dimensional randomvariable X1, . . . Xn, whose components are identically and independently distributedas X, by invoking the central limit theorem we have asymptotically that:

X n N(,

2/n). (1.4)

We remark that in this instance we have not required X to be normally distributed.


To give evidence to the central limit theorem result let X1, . . . Xn be identicallyand independently distributed as a Uniform(0,1) or an Exponential( = 4) randomvariable, whose density functions are respectively:

Y U(0, 1) : f(y) ={

1 0 < y < 10 elsewhere

W Exp() : f(w;) ={ew 0 < w


n = 9, 25, 64, 100

k =

100

0, 5

00, 1

00, 5

0

0

5

10

50:9

0.2 0.3 0.4 0.5 0.6 0.7 0.8

50:25 50:64

0.2 0.3 0.4 0.5 0.6 0.7 0.8

50:100

100:9 100:25 100:64

0

5

10

100:100

0

5

10

500:9 500:25 500:64 500:100

0.2 0.3 0.4 0.5 0.6 0.7 0.8

1000:9 1000:25

0.2 0.3 0.4 0.5 0.6 0.7 0.8

1000:64

0

5

10

1000:100

Figure 1.3 Distribution of the sample mean from Y U(0, 1);n: sample size, k: number of replications


n = 9, 25, 64, 100

k =

100

0, 5

00, 1

00, 5

0

0

5

10

1550:9

0.1 0.2 0.3 0.4 0.5 0.6

50:25 50:64

0.1 0.2 0.3 0.4 0.5 0.6

50:100

100:9 100:25 100:64

0

5

10

15100:100

0

5

10

15500:9 500:25 500:64 500:100

0.1 0.2 0.3 0.4 0.5 0.6

1000:9 1000:25

0.1 0.2 0.3 0.4 0.5 0.6

1000:64

0

5

10

151000:100

Figure 1.4 Distribution of the sample mean from W Exp(4);n: sample size, k: number of replications

2An Introduction to LinearRegression

2.1 Example: Individual wages (2.1.2)

We have first to read the data, available in the file wages1.dat, included in thecompressed file ch02.zip.

2.1.1 Data Reading and summary statistics

The function read.table allows one to read from a text data set file, where datahave been stored in text format, and create a data.frame, see Appendix A.3. Thedata set file is assumed to be in a tabular form with one or more spaces or a tab asfield separator. The function unzip extracts a file from a compressed archive.

> wages1 head(wages1)

EXPER MALE SCHOOL WAGE

1 9 0 13 6.315296

2 12 0 12 5.479770

3 11 0 11 3.642170

4 9 0 14 4.593337

5 8 0 14 2.418157

6 9 0 14 2.094058

10 An Introduction to Linear Regression

> tail(wages1)


3289 5 1 8 5.512004

3290 6 1 9 4.287114

3291 5 1 9 7.145190

3292 6 1 9 4.538784

3293 10 1 8 2.909113

3294 7 1 7 4.153974

The function summary produces some statistics summarizing the columns (variables)of the data frame. The results may be compared with the sample statistics providedby Verbeek in the file wages1.txt.

> summary(wages1)

EXPER MALE SCHOOL

Min. : 1.000 Min. :0.0000 Min. : 3.00

1st Qu.: 7.000 1st Qu.:0.0000 1st Qu.:11.00

Median : 8.000 Median :1.0000 Median :12.00

Mean : 8.043 Mean :0.5237 Mean :11.63

3rd Qu.: 9.000 3rd Qu.:1.0000 3rd Qu.:12.00

Max. :18.000 Max. :1.0000 Max. :16.00

WAGE

Min. : 0.07656

1st Qu.: 3.62157

Median : 5.20578

Mean : 5.75759

3rd Qu.: 7.30451

Max. :39.80892

If you want all the sample statistics provided in the file wages1.txt you can use thefunction vsummary defined by the following code1:

> vsummary0 vsummary vsummary(wages1)

Obs Mean Std.Dev. Min Max na

EXPER 3294 8.0434123 2.2906610 1.00000000 18.00000 0

MALE 3294 0.5236794 0.4995148 0.00000000 1.00000 0

SCHOOL 3294 11.6305404 1.6575447 3.00000000 16.00000 0

WAGE 3294 5.7575850 3.2691858 0.07655561 39.80892 0

1We add the information regarding the possible presence of missing values. The function is.nareturns the logical value TRUE if its argument is identified as not available (NA), otherwise FALSE.

An Introduction to Linear Regression 11

llll

l

l

l

llll

l

l

l

l

l

l

l

l

l

l

l

l

ll

ll

l

lll

l

ll

l

lllll

lllll

l

l

l

l

l

ll

ll

ll

lllll

l

ll

l

l

l

l

l

ll

lll

l

l

l

l

ll

l

l

l

l

l

l

l

ll

ll

l

l

l

l

l

females males

010

2030

40

Figure 2.1 Box & Whiskers plot of wages for males and females

2.1.2 Some graphical representations and grouping statistics

Lets compare the wages for males and females. A useful graphical representation isthe Box & Whiskers plot, see Fig. 2.1. Recall that the levels of the three lines definingthe box correspond respectively to the first, the second and the third Quartile of thedata (the second Quartile is the median). The values placed outside the two whiskersmay be considered anomalous with respect to the other data, see Chambers et. al.(1983).

We can obtain the graph by having recourse to the function boxplot. The firstargument in this function is a formula, see Appendix A.4, establishing that we arestudying the WAGE as a function (~) of the gender (dummy variable MALE). The secondargument is the name of the data.frame containing the involved variables. By usingthe third argument we attribute proper names to the values 0 and 1, that are assumedby the variable MALE, which will appear on the graph.

> boxplot(WAGE ~ MALE, data = wages1, names = c("females",

"males"))

We can also represent the wage as a function of the years of experience, see Fig. 2.2


l llll ll

ll

ll

l

l

ll

l

lll

llllll

l

ll lll

ll

l

l

lll ll l

ll

ll l l

ll

ll

ll

ll

lll

l

ll

l

l

l

lll l ll l l

l lll ll

lll l

lllll

l

l llll

ll

l

llll

ll lll ll

ll

ll

lll

l ll ll

lll

lll

llll l

l

l

l

l ll

ll

lll ll l

lll

ll lll

ll l ll

ll

l lll

l lll l

ll

l

llll

ll

ll l

lllll llll ll

ll

lll llll l

ll

l

l ll l

lll lll

ll l

ll l

llll l

l

l

ll

ll

ll

ll

ll

l ll

llll lll l

ll

l

lll ll l

l ll

lll

lll

l

l lll

lll

l ll ll

l

lll

l lll

l lll

l ll

llll ll

ll

llll l

l

l

llll l l

l l lll l

lll l ll

lll

ll l

ll

ll

lll

llll lll

ll l l

lll

ll lll

ll l l

ll

l l lllll

l lll l

l

lll ll

ll

lll ll ll ll

l lll

l

l

l

lll l ll

ll

ll l

l llll

ll

lll

ll

ll ll l

lll

l ll

ll

lll l

llll l

lll ll

ll

llll

lll l l

llll ll

ll ll ll ll ll

llll

l l

l

l llll ll

ll ll

ll ll l lll

lll ll

lllll

l ll ll

ll

l

llll

l

ll ll l ll l

l

l

l

ll

l l llll l

l

l llll ll

lll

l

ll

ll

l

lll

l

l

l l

l

l

l

l

lll

l ll

ll

ll

lll

ll

lll l

lll ll

ll

l llll

l

ll

ll

lll ll ll

l

l

l ll l

l lllll

ll

l lll

l

l

l

ll ll

ll

l ll

lll

lllll

l

l llll

l

l

ll

l

llll

ll

l

ll

lllll

ll

l

ll

lll

l

l

lllll

ll

l

l

ll

l ll ll lll l

ll ll

ll l

lll l

l

l

l

l

l ll

ll

l ll

ll

l

l

l

ll l

ll l

ll l

ll l

ll

ll

ll ll

ll

llll

l ll

l

ll

l

ll l

l l l

ll

ll

ll

lllll

ll l ll

l

l

llll

l

lll l

ll

ll ll

l ll l

ll

lll

l

l

l

l

l

l

ll

ll

ll

ll

l

l

l

ll

l

ll l

lll lll l

ll ll

l

l

l

l

l

l

l l llll l

l

llll ll

ll

lll

ll

ll

l ll ll l l

ll

l

ll

llll

lll

lllll

ll

l

lllll l

l

ll

l

l lll l

l

l lll

l

lll

l

ll

ll

l

ll

l

l lllll l

l

ll ll ll

l ll

lll

ll l

lll

ll

l ll

lll l

ll ll l

ll l

llll

l

l

ll

lll ll

ll llll

lll

l llll ll l lll

ll

l

l

lll

ll

ll

l llllll l

ll l

ll

l

ll

l

llll ll ll

l

lll

l l

lllll ll

ll

l

l

l l

ll

l

lll

l ll

ll

lllll l

ll

l

l

l l ll

l

ll l ll

l ll

ll

ll

lll l ll ll

lll

l

lll ll

ll lll lll

llll

ll ll

ll

ll

lll

ll

l

ll

ll l

l llll l

l

llll ll l l

lll llll

ll

l

ll

l

lll

lll lll l ll l

llll lll l l

ll lll llll

l l lll

ll ll

l

lll l

lll ll

ll

l

llll l lll

l ll

ll lllll

l

lllll

lll lll

lllll

l

lll

ll l lll

l lll ll ll l

ll

l

ll

l lll

l l

l l

l

l lll l

ll

l llll l

l

lll ll

l

l

l

ll l

l ll lll

ll

ll l

l

llll

ll

l ll l l

l

ll ll

l lll

lll lll

lll

ll

l ll l

l lll l lll ll l ll lll

lll

ll l

l

l

ll

l

l

lll

l

l

l

ll

ll

lll ll

l lll

l lll lll

l llll l lll

ll lll

ll ll

ll ll

ll

ll

ll

l

lll l

ll l ll ll

ll ll l

ll

ll llll llll l

lllll

l ll

llll l

lll lll ll l

lllll

l llll

l

l llll

ll l

ll

l

llllll

ll

ll

l

l

ll l

l

l ll lll

l

ll

lll llll

l

llll ll

ll

l

ll

ll

l ll l

l ll

l lll

l lll

ll l l

lll

ll ll

ll

l

l

ll

l

l l

l

l

lll

lll

l ll

l llll

llll l

l

ll

l l l

ll

l

l ll

l

lll

ll

lll

lll

l

lll

l

lll lll lll l

lll

l

l

ll

ll

l lll

lll

l

l l

l

ll lll

l lll llll

lllll

ll

l

l

ll

lll l ll

l l l

l

ll

ll ll

l

l

l

ll

lll ll

l lll

lll

llll

ll

l

l

l

l

ll l l

ll

ll

l lll

ll

lll lll l

l lll lll

l ll

l

l lll

lll

ll

ll

lll l

l

l

ll

l

l

l

l llll

l lll

l

l

ll

ll ll

lllll

l

ll lll

ll lll ll

l

lll

l

l ll l

l ll ll

l

l

l

l llll

lll

l

ll

ll

lll ll

l lll

ll

lll l

l

l

lll lll

l ll

llll

ll ll

lll l

l

l

l

ll ll l

l ll

lll

lll

lll lll

l

l

l ll

l

l

ll

l

ll l

ll

l

lll

ll

l

l

ll l

l l

l ll

l

llll

l

lll l

l llll

l

lll

l

ll l

l

ll

ll

l

ll

lll l

l ll

l

ll

lll ll

ll

l

l

lll

l

llll ll l ll

ll ll l ll

l

l

l

llllll

l

l

l

lll

ll ll

l

ll

ll

l

lllll

l

l

ll ll ll ll

ll

l

lll

ll l

l ll

l

ll l

l

l

ll

ll lll

l

l

l

l lll

lll l

l lll

l ll

ll

ll ll lll

ll llll

l

ll

l

llll ll l

ll

l lllll

l

llll

l

l

ll

l

ll ll

ll

l

ll

l l

l

ll

l

lllll

ll

lll lll

l

lll

l ll

ll

l

l

l

l

l

l

ll

l

l l

l

ll

lll

ll

lll

l

ll

ll ll

l

ll

l

l

ll

l

l

ll

ll

ll l

ll

ll l

llll l

ll lll

llll l

l

lll

llll

l ll ll

l

ll l

ll

ll

ll

ll

ll

l

l

l

ll

llll l

l

l

l

lllll

l

l

lll ll

lll

lll

l

ll l l

l

ll ll

ll

llll

ll

lll

ll

l

l l

l

ll

ll ll

l

ll

l

lll llll

l

ll

l

l

l

ll ll

l

l lll l

l

ll

lll

ll

lll

l

ll l

l

l

lll

ll

ll

l

lll lll

l

l

ll

l

l ll ll l

l l

l

l

ll

l

ll

ll

l

lll

l

l

lll

ll

l

ll

ll lllll

lll

ll

lll

l ll

ll ll

l

l

lll ll

l

lll

l

ll

l

ll

l

l

l

ll

ll

ll

l

lll

l

l llll

ll l

lll

lll

l

ll

l

ll ll

l lll

ll

ll l

llll

l

l

l

l

llll l

ll

ll ll

l

lll

l

lll

l

l

l

l

ll

llll

ll

l

l

lllll

l

l l

ll l

ll

ll l

llll ll

l

l

l

llll

l lll

ll

llll

l

l

ll l

ll

l

l

l lll

ll

l l

l

l ll

ll l

ll ll

l

lll l

llll

l

l

ll

lll

lll ll

lllll

ll

ll

ll

l

ll

l l

l

ll l

l

l lllll

ll ll

l

llll

l

l lll

l

llll

ll

ll

l

llll

ll l l

l

l

ll ll ll

l

l

llll

ll

ll l l ll

l

l

l

lll

ll

ll

ll

l lll l l ll

ll l

llll

l

ll l

lll

llllll

l

l lll l

ll

lll l

ll

l lll llll l

l lll

ll

ll l

ll

l

l

l

l

l l

lll ll

ll ll

ll

ll

ll

ll

l

l

ll

lll llll

ll

lll

l

lll

ll l

ll

l

ll

l

l

l ll l lll l l

l

llll lll lll

llll

l

llll

l

l ll l

l

ll

l

l ll

ll

l

ll

ll

ll

ll

l lll

ll

l llll

ll

lllllll l

ll ll l llll

l

l l ll

l

ll

l

ll

l

l

ll

l l

l

ll l

l

llll l

l

l

ll

ll

ll

l ll

l

lll

ll lll

l

ll

llll l ll l

lll ll

l

ll

llll l

l lll

l ll

lll

ll

ll

l

l

l lll

ll l ll

ll

ll

l

l lll ll l

l

ll

ll

l

ll

lll l

l

ll

l

llll

lll l

ll

l

lll

ll

ll

ll

l

l

lll

l

l

l

ll

ll l

l ll

l ll l

l

ll ll

llll ll l

ll

l

ll

ll ll

l lllll

l l

ll

llll

l

lll

l

ll ll

ll l

l

l

l

l

ll

l ll ll

llll ll l

l

llll

lllll

l

l

ll l

l

l l ll

l lll ll l

lll

ll

lllll

l

llll

l l ll

ll

llll

l llll l l

ll

l

ll l ll

l

ll l

ll

l ll

llll ll

lll

lll

ll ll

ll

ll

l ll

l ll

5 10 15

010

2030

40

EXPER

WAG

E

l l

lllll

l

l

l

lllll

lll

lllll

l

lllllllllll lll

l

lll

llllll

llll

l

l l

l

l

ll

l

lll

l

l

l

llllllll

l

lll

l

llll

l

llllllll lllll

ll

ll

ll

ll l

l

l

1 2 3 4 5 6 7 8 9 10 12 14 16 18

010

2030

40

Figure 2.2 Scatterplot and Box & Whiskers plot of wages by the number of years ofexperience

> layout(1:2)

> plot(WAGE ~ EXPER, data = wages1)

> boxplot(WAGE ~ EXPER, data = wages1)

The function plot results in a scatter plot diagram of the involved variables. Thefunction layout(matrix) creates a multifigure environment, the numbers in thematrix (in our instance a column vector) define the pointer sequence specifying theorder the different graphs will appear.

We may desire to produce different graphs, for males and females, representing thewage as a function of the years of experience, see Fig. 2.3. It is preferable to firstrecode the dummy variable MALE in a categorical one, e.g. GENDER, that is a factor,whose levels are f and m.

> wages1$gender levels(wages1$gender)


l l

lllll

l

l

l

lllll

lll

lllll

l

lllllllllll lll

l

lll

llllll

llll

l

l l

l

l

ll

l

lll

l

l

l

llllllll

l

lll

l

llll

l

llllllll lllll

ll

ll

ll

ll l

l

l

1 2 3 4 5 6 7 8 9 10 12 14 16 18

010

2030

40

lll

lll

llll

lllll lllll

l

ll l

l

l

l

ll

l

ll

lllll

llll

l

l

l

llll

l

llll ll

lll

l

llllll l

llll

ll

l

l

l

ll

lll

l

ll

l

l

l

lll lll

ll

l

l

lll l

l

1.f 4.f 7.f 10.f 13.f 16.f 1.m 4.m 7.m 11.m 15.m

010

2030

40

Figure 2.3 Scatterplot and Box & Whiskers plot of wages by gender and the number ofyears of experience

pairs of levels assumed by the two variables.

> layout(1:2)

> boxplot(WAGE ~ EXPER, data = wages1)

> boxplot(WAGE ~ EXPER:gender, data = wages1)

An easy way to obtain summary results for the variables in the data.frame, separatelyfor females and males is by means of the instruction by.

The first argument is an array or a data.frame or a matrix on whose columnsthe function specified as third argument will be applied. The second argument is agrouping variable whose length must be equal to the number of rows of the objectgiven as first argument.

We omit from the analysis the categorical variable gender (fifth column in thedata.frame wages1).

> by(wages1[, -5], wages1$MALE, summary)

wages1$MALE: 0



Min. : 1.000 Min. :0 Min. : 5.00 Min. : 0.07656

1st Qu.: 6.000 1st Qu.:0 1st Qu.:11.00 1st Qu.: 3.17564

Median : 8.000 Median :0 Median :12.00 Median : 4.69326

Mean : 7.732 Mean :0 Mean :11.84 Mean : 5.14692

3rd Qu.: 9.000 3rd Qu.:0 3rd Qu.:13.00 3rd Qu.: 6.53275

Max. :16.000 Max. :0 Max. :16.00 Max. :32.49740

------------------------------------------------

wages1$MALE: 1


Min. : 2.000 Min. :1 Min. : 3.00 Min. : 0.1535

1st Qu.: 7.000 1st Qu.:1 1st Qu.:10.00 1st Qu.: 4.0290

Median : 8.000 Median :1 Median :12.00 Median : 5.6543

Mean : 8.326 Mean :1 Mean :11.44 Mean : 6.3130

3rd Qu.:10.000 3rd Qu.:1 3rd Qu.:12.00 3rd Qu.: 7.8913

Max. :18.000 Max. :1 Max. :16.00 Max. :39.8089

2.1.3 Simple Linear Regression

Lets study by a linear regression model how the mean level of the variable WAGEchanges as a function of the gender: we can regress the variable WAGE on the dummyvariable MALE, which assumes value 1 when the subject is male and 0 when she isfemale. We make use of the function linear model (lm); the first argument is theregression formula, where the ~ symbol separates the dependent variable from theindependent one. The intercept is enclosed by default. The data argument specifiesthe name of the data.frame containing the data.

We are thus studying the model

WAGE = 1 + 2 MALE + ERROR (2.1)whose parameter estimates are reported in Verbeeks Table 2.1.

> regr2.1 summary(regr2.1)

Call:

lm(formula = WAGE ~ MALE, data = wages1)

Residuals:

Min 1Q Median 3Q Max

-6.160 -2.102 -0.554 1.487 33.496

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.14692 0.08122 63.37


MALE 1.16610 0.11224 10.39 names(regr2.1)

[1] "coefficients" "residuals" "effects"

[4] "rank" "fitted.values" "assign"

[7] "qr" "df.residual" "xlevels"

[10] "call" "terms" "model"

Thus the object regr2.1 is a list containing 12 elements. If we want to extract one ofits elements, e.g. the coefficients, we may invoke one of the 3 following commands:

> regr2.1$coefficients

(Intercept) MALE

5.146924 1.166097

> regr2.1["coefficients"]

$coefficients

(Intercept) MALE

5.146924 1.166097

> regr2.1[["coefficients"]]

(Intercept) MALE

5.146924 1.166097

obtaining respectively a vector, a list and again a vector.Pay attention! The command3

> regr2.1["coefficients"] %*% c(1,2)

returns an Error, since the result of regr2.1["coefficients"] is a list and not avector and cannot be used as an argument of a matrix product. See Chapter 2 ofLonghow Lam (2010) for the definition of the Data Objects: list and vector.Remember to use always double square brackets to extract elements in form of vectorsfrom a list object. The following instructions are correct:

> regr2.1[["coefficients"]] %*% c(1, 2)

2We omit to report the call and the result of the function str(regr2.1).3See the help ?Arithmetic to have information on arithmetic operators in R: here %*% stands for

the matrix product.


[,1]

[1,] 7.479118

> regr2.1$coefficients %*% c(1, 2)

[,1]

[1,] 7.479118

Other useful statistics resulting from a regression analysis are available in theobject obtained by applying the function summary to the result of lm; sonames(regr2.1) and names(summary(regr2.1)) give different information. Theresult of summary(regr2.1) is itself a list containing 11 elements.

> output names(output)

[1] "call" "terms" "residuals"

[4] "coefficients" "aliased" "sigma"

[7] "df" "r.squared" "adj.r.squared"

[10] "fstatistic" "cov.unscaled"

> output$fstatistic

value numdf dendf

107.9338 1.0000 3292.0000

2.1.4 Confidence intervals (Section 2.5.2)

To test whether the parameter 2 is zero, that is to test the null hypothesisH0 : 2 = 0, we can construct a confidence interval at level (1 ).

We have first to recall the coefficient estimates, their standard errors and thedegrees of freedom, we must establish a value for and determine the correspondingpercentage points for the t random variable.

As we have just recalled regr2.1$coefficients and regr2.1$df extractrespectively the coefficients and the degrees of freedom from the object regr2.1.

output$cov.unscaled extracts from the object output the matrix (X X)1.The instruction is equivalent to summary(regr2.1)$cov.unscaled, remember-ing that we have assigned to the object output the result of summary(regr2.1).

Finally the function diag extracts the main diagonal from a matrix and bymeans of qt(p,df) it is possible to define the p quantile of a t distributionwith df degrees of freedom.

> regr2.1$coefficients

(Intercept) MALE

5.146924 1.166097

> coefse coefse

(Intercept) MALE

0.08122482 0.11224216


> regr2.1$df

[1] 3292

> alpha qt(1 - alpha/2, regr2.1$df)

[1] 1.960685

The lower and upper bounds of the MALE coefficient result respectively:

> regr2.1$coefficients[2] + c(-1, 1) * qt(1 - alpha/2,

regr2.1$df) * output$sigma * output$cov.unscaled[2,

2]^0.5

[1] 0.946 1.386

The confidence intervals, based on the t distribution, may also be obtained directlyfor all parameter estimates, by using the function confint:

> confint(regr2.1, level = 1 - alpha)

2.5 % 97.5 %

(Intercept) 4.988 5.306

MALE 0.946 1.386

2.2 Multiple Linear Regression (Section 2.5.5)

2.2.1 Parameter estimation

We want to obtain the parameter estimates of the following linear model:

WAGE = 1 + 2 MALE + 3 SCHOOL + 4 EXPER + ERROR (2.2)The function lm allows us to perform also a linear regression with more variables asregressors.

As we have already stated, the symbol ~ separates in a formula the dependentvariable from the independent ones and the + symbol, preceding a variable, indicatesthe presence of that variable in the model. The intercept is enclosed by default. SeeAppendix A.4.

With the following syntax we declare we desire to study, by making use of a linearmodel (lm), the relationship between the variable WAGE and the set of independentvariables MALE, SCHOOL and EXPER for the data.frame wages1.


Call:

lm(formula = WAGE ~ MALE + SCHOOL + EXPER, data = wages1)

Residuals:


-7.654 -1.967 -0.457 1.444 34.194


Coefficients:


(Intercept) -3.38002 0.46498 -7.269 4.50e-13 ***

MALE 1.34437 0.10768 12.485 < 2e-16 ***

SCHOOL 0.63880 0.03280 19.478 < 2e-16 ***

EXPER 0.12483 0.02376 5.253 1.59e-07 ***

---

Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

Residual standard error: 3.046 on 3290 degrees of freedom

Multiple R-squared: 0.1326, Adjusted R-squared: 0.1318

F-statistic: 167.6 on 3 and 3290 DF, p-value: < 2.2e-16

2.2.2 ANOVA to compare the two models (Section 2.5.5)

To establish if the variables SCHOOL and EXPER add a significant joint effect to thevariable MALE for explaining the dependent variable WAGE, we can compare the lattermodel we have estimated (2.2) with (2.1) by using the function anova which performsan analysis of variance in presence of nested models, see Verbeek p. 27. The firstargument of anova is the object resulting from lm applied to the simpler model, thesecond argument is the lm object from the estimation of the more complex model.

> anova(regr2.1, regr2.2)

Analysis of Variance Table

Model 1: WAGE ~ MALE

Model 2: WAGE ~ MALE + SCHOOL + EXPER

Res.Df RSS Df Sum of Sq F Pr(>F)

1 3292 34077

2 3290 30528 2 3549 191.24 < 2.2e-16 ***

---

Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

2.3 CAPM example (Section 2.7)

We can import as we made in Section 2.1 the data from the data set capm.dat.

> capm


The data set contains information on stock market data, see the file capm.dat. Data,pertaining the following variables, were collected from January 1960 to December2006.

foodrf: excess returns food industry

durblrf: excess returns durables industry

constrrf: excess returns construction industry

rmrf: excess returns market portfolio

rf: risk free return

jan: dummy for January

smb: excess return on the Fama-French size (small minus big) factor

hml: excess return on the Fama-French value (high minus low) factor

2.3.1 CAPM regressions (without intercept) (Table 2.3)

Verbeek first considers the parameter estimation of the following three linearregression models where the intercept is not included.

foodrf = 1rmrf + ERROR (2.3)

durblrf = 1rmrf + ERROR (2.4)

constrrf = 1rmrf + ERROR (2.5)

Observe the presence of the element -1 in the following formulae, first arguments ofthe call to lm. It drops the intercept from the list of the regressors. See Appendix A.4.

> regr2.3f regr2.3d regr2.3c summary(regr2.3f)

Call:

lm(formula = foodrf ~ -1 + rmrf, data = capm)

Residuals:


-13.539 -1.026 0.141 1.745 15.924

Coefficients:


rmrf 0.75774 0.02579 29.39


Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1



F-statistic: 863.5 on 1 and 609 DF, p-value: < 2.2e-16

Durables

> summary(regr2.3d)

Call:

lm(formula = durblrf ~ -1 + rmrf, data = capm)

Residuals:


-9.6504 -1.9420 -0.3069 1.7332 17.8871

Coefficients:


rmrf 1.04736 0.02775 37.74 summary(regr2.3c)

Call:

lm(formula = constrrf ~ -1 + rmrf, data = capm)

Residuals:


-12.9414 -1.7193 -0.1866 1.4458 11.6551

Coefficients:


rmrf 1.16662 0.02535 46.01


How to produce results more appealing to read

The three preceding outputs are useful to separately interpret the three models wehad to estimate, regarding respectively the food, durables and construction industries.

We can present the results in a more efficient way to compare the three models,by making use of the function mtable that is available in the package memisc.The arguments to pass to mtable are the three objects we obtained applying theinstruction linear model lm to the food, durables and construction industries.

> library(memisc)

> mtable(regr2.3f, regr2.3d, regr2.3c)

Calls:

regr2.3f: lm(formula = foodrf ~ -1 + rmrf, data = capm)

regr2.3d: lm(formula = durblrf ~ -1 + rmrf, data = capm)

regr2.3c: lm(formula = constrrf ~ -1 + rmrf, data = capm)

=============================================

regr2.3f regr2.3d regr2.3c

---------------------------------------------

rmrf 0.758*** 1.047*** 1.167***

(0.026) (0.028) (0.025)

---------------------------------------------

R-squared 0.586 0.700 0.777

adj. R-squared 0.586 0.700 0.776

sigma 2.884 3.105 2.836

F 863.524 1424.100 2117.287

p 0.000 0.000 0.000

Log-likelihood -1511.236 -1556.104 -1500.924

Deviance 5066.744 5869.713 4898.298

AIC 3026.472 3116.207 3005.847

BIC 3035.299 3125.034 3014.674

N 610 610 610

=============================================


We can change the title and the labels in the preceding table, specify which statisticshave to appear in the final part of the table, and also relabel the name of theindependent variable rmrf:

> mtable2.3fdc mtable2.3fdc mtable2.3fdc

Calls:

Food: lm(formula = foodrf ~ -1 + rmrf, data = capm)

Durables: lm(formula = durblrf ~ -1 + rmrf, data = capm)

Construction: lm(formula = constrrf ~ -1 + rmrf, data = capm)

============================================================

Food Durables Construction

------------------------------------------------------------

excess market return 0.758*** 1.047*** 1.167***

(0.026) (0.028) (0.025)

------------------------------------------------------------

R-squared 0.586 0.700 0.777

sigma 2.884 3.105 2.836

============================================================

Evaluation of the uncentered R2sAccording to relationship (2.43) in Verbeek the uncentered R2s is to be evaluatedwhen a linear model has no intercept. The uncentered R2s are automatically producedby R for the three models and figure in the previous output as R-squared (the Rsoftware takes into account the information that the models are constrained).

> 1 - sum(regr2.3f$residuals^2)/sum(capm$foodrf^2)

[1] 0.5864245

> 1 - sum(regr2.3d$residuals^2)/sum(capm$durblrf^2)

[1] 0.7004574

> 1 - sum(regr2.3c$residuals^2)/sum(capm$constrrf^2)

[1] 0.7766193


2.3.2 Testing an hypothesis on 1

To test if the coefficients 1 in the linear models (2.3)-(2.5) can be assumed differentfrom 1 we have to evaluate the statistic:

1 1se(1)

.

The estimate of the variance of 1 may be obtained by using the instruction vcov,which returns the covariance matrix of the parameter estimates. The matrix reducesin the present case to a scalar, since we are considering a linear model with only onepredictor and without the constant term.

> vcov(regr2.3f)

rmrf

rmrf 0.0006649123

We can thus evaluate the above statistic for the three situations:

> sampletf sampletd sampletc paste("(Food) statistic: ", round(sampletf, 4),

" p-value: ", round(2 * (1 - pt(abs(sampletf),

regr2.3f$df)), 4))

> paste("(Durables) statistic: ", round(sampletd,

4), " p-value: ", round(2 * (1 - pt(abs(sampletd),

regr2.3d$df)), 4))

> paste("(Construction) statistic: ", round(sampletc,

4), " p-value: ", round(2 * (1 - pt(abs(sampletc),

regr2.3c$df)), 4))

we obtain

[1] "(Food) statistic: -9.3951 p-value: 0"

[1] "(Durables) statistic: 1.7065 p-value: 0.0884"

[1] "(Construction) statistic: 6.5719 p-value: 0"

The function linearHypothesis in the package car performs directly an F test. Thefirst argument is the lm object and the second one specifies the hypothesis to be testedin matrix or symbolic form (see the help ?car::linearHypothesis).Observe that the values of the statistic F are equal to the squared values of the tstatistics obtained above, while the p-values do coincide, since the proposed tests aresimilar.

> library(car)

> linearHypothesis(regr2.3f, "rmrf=1")


Linear hypothesis test

Hypothesis:

rmrf = 1

Model 1: restricted model

Model 2: foodrf ~ -1 + rmrf


1 610 5801.1

2 609 5066.7 1 734.37 88.268 < 2.2e-16 ***

---

Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

> linearHypothesis(regr2.3d, "rmrf=1")


Hypothesis:

rmrf = 1


Model 2: durblrf ~ -1 + rmrf


1 610 5897.8

2 609 5869.7 1 28.067 2.912 0.08843 .

---

Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1

> linearHypothesis(regr2.3c, "rmrf=1")


Hypothesis:

rmrf = 1


Model 2: constrrf ~ -1 + rmrf


1 610 5245.7

2 609 4898.3 1 347.39 43.19 1.068e-10 ***

---

Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1


2.3.3 CAPM regressions (with intercept) (Table 2.4)

In Verbeek it is then considered the parameter estimation of the following three linearregression models:

foodrf = 1 + 2rmrf + ERROR (2.6)

durblrf = 1 + 2rmrf + ERROR (2.7)

constrrf = 1 + 2rmrf + ERROR (2.8)

> regr2.4f regr2.4d regr2.4c library(memisc)

> mtable2.4 mtable2.4 mtable2.4

Calls:

Food: lm(formula = foodrf ~ rmrf, data = capm)

Durables: lm(formula = durblrf ~ rmrf, data = capm)

Construction: lm(formula = constrrf ~ rmrf, data = capm)

============================================================


------------------------------------------------------------

constant 0.325** -0.131 -0.073

(0.117) (0.126) (0.115)


(0.026) (0.028) (0.025)

------------------------------------------------------------

R-squared 0.583 0.700 0.776

sigma 2.869 3.104 2.837

============================================================


2.3.4 CAPM regressions (with intercept and January dummy) (Table2.5)

The following models are considered to verify the presence of the January effect:

foodrf = 1 + 2jan + 3rmrf + ERROR (2.9)

durblrf = 1 + 2jan + 3rmrf + ERROR (2.10)

constrrf = 1 + 2jan + 3rmrf + ERROR (2.11)

> regr2.5f regr2.5d regr2.5c library(memisc)


Calls:

Food: lm(formula = foodrf ~ jan + rmrf, data = capm)

Durables: lm(formula = durblrf ~ jan + rmrf, data = capm)

Construction: lm(formula = constrrf ~ jan + rmrf, data = capm)

============================================================


------------------------------------------------------------

constant 0.397** -0.143 -0.122

(0.121) (0.132) (0.120)

January dummy -0.878* 0.139 0.604

(0.419) (0.455) (0.415)


(0.026) (0.028) (0.025)

------------------------------------------------------------

R-squared 0.586 0.700 0.776

sigma 2.861 3.107 2.835

============================================================

2.4 The Worlds Largest Hedge Found (Section 2.7.3)

Data are available in the file madoff.dat in the zip file ch02.zip.

> madoff


The following variables are included:

fsl: return (in %) on Fairfield Sentry

fslrf: excess returns

rf: risk free rate

rmrf: excess return on the market portfolio

hml: excess return on the Fama-French value (high minus low) factor

smb: excess return on the Fama-French size (small minus big) factor

Verbeek observes that a simple inspection of the return series produces somesuspiciuos results that are evident by considering some summary statistics:

the mean and the standard deviation that can be obtained by using the functionsmean and sd

> mean(madoff$fsl)

[1] 0.8422326

> sd(madoff$fsl)

[1] 0.7086928

the number of months with a negative return computed by summing up theelements of the logical variable resulting from madoff$fsl sum(madoff$fsl < 0)

[1] 16

and the fraction of months with a negative return over the whole consideredperiods, that is the ratio between the last result we obtained and the length ofthe series (number of periods)

> sum(madoff$fsl < 0)/length(madoff$fsl)

[1] 0.0744186

A CAPM analysis is then performed, see Verbeeks Table 2.6, by considering thefollowing linear model

fslrf = 1 + 2rmrf + ERROR



Call:

lm(formula = fslrf ~ rmrf, data = madoff)

Residuals:


-1.34773 -0.48005 -0.08337 0.38865 2.97276

Coefficients:


(Intercept) 0.50495 0.04570 11.049 < 2e-16 ***

rmrf 0.04089 0.01072 3.813 0.00018 ***

---

Signif. codes: 0 "***" 0.001 "**" 0.01 "*" 0.05 "." 0.1 " " 1



F-statistic: 14.54 on 1 and 213 DF, p-value: 0.0001801

2.5 Dummy Variables Treatment and Multicollinearity(Section 2.8.1)

With regard to the data set Wages in the USA it is now considered the parameterestimation of the following three equivalent linear regression models4:

WAGE = const + M MALE + ERROR (2.12)WAGE = const + F FEMALE + ERROR (2.13)WAGE = M MALE + F FEMALE + ERROR (2.14)

4Remind that the parameters in the model

WAGE = const + M MALE + F FEMALE + ERROR,where MALE is a dummy variable with values 0 and 1 and FEMALE satisfies FEMALE = 1 - MALE, arenot identified, since there is exact collinearity among the constant and the dummy variables MALEand FEMALE; so one of the variables has to be omitted from the model.In (2.12) the substitution FEMALE = 1 - MALE has been performed, so dropping the variable FEMALE:

(const + F ) + (M F ) MALE = const + M MALE,In (2.13) the substitution MALE = 1 - FEMALE has been performed, so dropping the variable MALE

(const + M ) + (F M ) FEMALE = const + F FEMALE,In (2.14) the identity FEMALE + MALE = 1 has been taken into account, it follows that

WAGE = const (MALE + FEMALE) + M MALE + F FEMALE + ERRORand:

(const + M ) MALE + (const + F ) FEMALE = M MALE + F FEMALEFinally we have:

const = F and const = M .


to which correspond the following three model formulae.

WAGE ~ MALE

WAGE ~ I(1 - MALE)

WAGE ~ -1 + MALE + I(1 - MALE)

Remember that the dummy variable MALE assumes value 1 when the statistical unitis male and 0 when she is female; so we can define a new dummy variable FEMALE as1 - MALE.

To write the formula for the second regression model we have to use WAGE ~ I(1- MALE), unless we do explicit define the new variable FEMALE wages1 regr2.7A summary(regr2.7A)

Call:

lm(formula = WAGE ~ MALE, data = wages1)

Residuals:


-6.160 -2.102 -0.554 1.487 33.496

Coefficients:


(Intercept) 5.14692 0.08122 63.37


> regr2.7B summary(regr2.7B)

Call:

lm(formula = WAGE ~ I(1 - MALE), data = wages1)

Residuals:


-6.160 -2.102 -0.554 1.487 33.496

Coefficients:


(Intercept) 6.31302 0.07747 81.50 |t|)

MALE 6.31302 0.07747 81.50


Presenting the results in a nicer wayAs we have already recalled in Section 2.3.1 we can produce an output to comparethe results for the three models, like Verbeeks Table 2.7, by having recourse to thefunction mtable in the package memisc.

> library(memisc)


Calls:

A: lm(formula = WAGE ~ MALE, data = wages1)

B: lm(formula = WAGE ~ I(1 - MALE), data = wages1)

C: lm(formula = WAGE ~ -1 + MALE + I(1 - MALE), data = wages1)

========================================

A B C

----------------------------------------

constant 5.147*** 6.313***

(0.081) (0.077)

male 1.166*** 6.313***

(0.112) (0.077)

female -1.166*** 5.147***

(0.112) (0.081)

----------------------------------------

R-squared 0.032 0.032 0.764

sigma 3.217 3.217 3.217

========================================

2.6 Missing Data, Outliers and Influential Observations

See Section 4.1.8.The Least Absolute Deviation approach to parameter estimation has been

implemented in R by Koenker in the package quantreg, see Koenker (2012).

2.7 How to check the form of the distribution

In statistical analyses is important to check if data follow some distribution. Forexample the classical assumptions on the linear model require that errors aredistributed according to a Normal random variable. Thus, after having estimateda linear model, one has to check if this distributional hypothesis is not rejected forthe residuals. The same issue is present in the analysis of time series when e.g. adistributional assumption on white noise is made, see Chapter 8.


Later on let data be a series with elements x1, . . . , xn. For the sake of simplicitywe work with data simulated from a normal distribution with mean equal to 50 andunitary variance.

> set.seed(123)

> data data.hist curve(dnorm(x, mean = mean(data), sd = sd(data)),

add = TRUE)

2.7.2 The 2 goodness-of-fit test

The object data.hist contains all information necessary to create the histogram.Namely data.hist$breaks gives the limits of the intervals (classes) in the histogram,and data.hist$counts the count corresponding to each class.

> data.hist$breaks

[1] 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5

> data.hist$counts

[1] 1 3 10 11 23 22 13 9 5 3

We can thus build the following table by considering the same classes of the histogram(the lowest and highest bounds of the histogram are replaced with and +respectively)


Histogram of data

data

Den

sity

48 49 50 51 52

0.0

0.1

0.2

0.3

0.4

Figure 2.4 Histogram of data with the theoretical density function under the hypothesisof normality

> data.hist$breaks[1] data.hist$breaks[length(data.hist$breaks)] table table

inf sup observed count theoretical count

[1,] -Inf 48.0 1 1.100883

[2,] 48.0 48.5 3 2.971850

[3,] 48.5 49.0 10 7.540375

[4,] 49.0 49.5 11 14.275082

[5,] 49.5 50.0 23 20.167108

[6,] 50.0 50.5 22 21.262835

[7,] 50.5 51.0 13 16.730787


[8,] 51.0 51.5 9 9.824401

[9,] 51.5 52.0 5 4.304671

[10,] 52.0 Inf 3 1.822008

The first two columns of table contain the class bounds zj1 and zj . The third columncontains the observed frequencies and the fourth column the theoretical frequenciesunder the assumption of normality. These theoretical frequencies are obtained as npj ,where the probabilities pj are defined as

pj =

(zj xs

)

(zj1 x

s

)being zj1 and zj the class limits, and x, s2 the sample mean and the sample variance.

For testing the null hyphotesis of Normality we can have recourse to the 2

goodness-of-fit test, see Mood, Graybill and Boes (1974), which is based on thestatistic

Qk =k+1j=1

(nj npj)2npj

where k + 1 is the number of the classes.Qk is distributed according to a

2k random variable with k degrees of freedom. With

reference to data we have

> (qstat 1 - pchisq(qstat, nrow(table) - 1)

[1] 0.9263825

so we will not reject the null hypothesis that the elements of data are distributedaccording to a Normal random variable.

2.7.3 The Kolmorogov-Smirnov test

Let

Fn(x) =#xi x

n

be the empirical cumulative distribution function (cdf) of data and F0(x) a theoreticalcumulative distribution function, see Fig. 2.5 where the empirical cdf is the stepfunction and the theoretical cdf is the continuous one.

The Kolmogorov-Smirnov statistic to test the null hypothesis X F0(), whereF0() is some completely specified continuous cumulative distribution function is

Kn = sup


47 48 49 50 51 52 53

0.0

0.2

0.4

0.6

0.8

1.0

x

F n(x)

a

nd

F 0

(x)

ll

ll

ll

lll

ll

lll

lllll

lll

ll

lllll

lll

lllll

llllll

lllll

lll

lllll

lll

llllllllll

llll

lll

lll

lll

lll

ll

lll

ll

lll

ll

lll

Figure 2.5 Empirical cumulative distribution function (the step function) and thetheoretical distribution function under the null hypothesis of normality

This test can also be used to check if the observations in two data sets (x1, . . . , xnx)and (y1, . . . , yny ) come from the same distribution; in this case F0(x) is replaced withthe empirical cdf calculated on (y1, . . . , yny ).

The Kolmogorov-Smirnov statistic is based on the maximum absolute distancebetween the empirical cdf Fn() and the theoretical one F0(), see Fig. 2.6.

> plot(ecdf(data), xlim = c(47, 53), cex = 0.5, main = "",

ylab = expression(F[n](x)~~and~~F[0](x)))

> curve(pnorm(x, mean = mean(data), sd = sd(data)),

add = TRUE)

> x curve(ecdf(data)(x) - pnorm(x, mean = mean(data),

sd = sd(data)), n = 10000, xlim = c(47, 53),

ylim = c(-0.06, 0.06), ylab = "distance")

> abline(h = 0)


47 48 49 50 51 52 53

0.

06

0.04

0.

020.

000.

020.

040.

06

x

dist

ance

Figure 2.6 Distance between the empirical cumulative distribution function and thetheoretical distribution function under the null hypothesis of normality

The Kolmogorov-Smirnov test can be performed by having recourse to the functionks.test, whose arguments are: x the data whose distribution we want to test; yeither a numeric vector of data values (in case one wants to compare y to x), ora character string naming a cdf given by the user or one of the cdfs available inR such as pnorm (only continuous cdfs are valid); ... are additional argumentsspecifying the parameters of the distribution given (as a character string) by y;alternative indicates the alternative hypothesis and must be one of "two.sided"(default), "less", or "greater"; exact, which is NULL by default, can be a logicalindicating whether an exact p-value should be computed.

Relationship (2.15) makes reference to the two.sided alternative hypothesis. Bysetting the option alternative = "less" the null hypothesis is specified as FX() FY (), that is X Y i.e. X is stochastically smaller than Y ; while if we set the optionalternative = "greater" the null hypothesis is specified as FX() FY (), that isX Y i.e. X is stochastically greater than Y .

The corresponding Kolmogorov-Smirnov statistics are respectively


D = sup point arrows(point, 0, point, ecdf(data)(point), length = 0.1,

angle = 22)

> arrows(point, 0, point, pnorm(point, mean = mean(data),

sd = sd(data)), length = 0.1, angle = 22)


> arrows(point, ecdf(data)(point), xlim[1], ecdf(data)(point),

length = 0.1, angle = 22)

> arrows(point, pnorm(point, mean = mean(data), sd = sd(data)),

xlim[1], pnorm(point, mean = mean(data), sd = sd(data)),


> text(point + 0.05, ecdf(data)(xlim[1]) + 0.01, expression(x[p]))

> text(xlim[1] + xtextshift, ecdf(data)(point) + 0.01,

expression(F[n](x[p])))

> text(xlim[1] + xtextshift, pnorm(point, mean = mean(data),

sd = sd(data)) + 0.01, expression(F[0](x[p])))

> point point fpoint arrows(xlim[1], fpoint, qnorm(fpoint, mean = mean(data),

sd = sd(data)), fpoint, length = 0.1, angle = 22)

> arrows(xlim[1], fpoint, point, fpoint, length = 0.1,

angle = 22)

> arrows(point, fpoint, point, ecdf(data)(xlim[1]),


> arrows(qnorm(fpoint, mean = mean(data), sd = sd(data)),

fpoint, qnorm(fpoint, mean = mean(data), sd = sd(data)),

ecdf(data)(xlim[1]), length = 0.1, angle = 22)

> text(xlim[1] + xtextshift/2, fpoint + 0.01, expression(tilde(p)))

> text(point - xtextshift, ecdf(data)(xlim[1]) + 0.01,

expression(x[tilde(p)]))

> text(qnorm(fpoint, mean = mean(data), sd = sd(data)) +

xtextshift, ecdf(data)(xlim[1]) + 0.01,

expression(x[0][tilde(p)]))

We have seen above that the Kolmogorov-Smirnov statistics are defined as a functionof the largest absolute, positive or negative difference between the two functions onvarying x.

For each xp in the ordered data set let p = Fn(xp) be the value assumed by theempirical cdf, see Fig. 2.7: xp is the p percentage point in the data.

Let now p = F0(xp) be the value assumed by the theoretical cdf in xp.One way to compare the empirical cdf with the theoretical cdf is to obtain a scatter

plot diagram representing the pairs (p, p). This graphical representation is namedthe probability-probability plot (PP-plot), see Fig. 2.8.

> p.orders plot(pnorm(sort(data), mean = mean(data), sd = sd(data)),

p.orders,pch=16, xlab = "p* (theoretical probabilities)",

ylab = "p (sample probabilities)")

> abline(0, 1)

Observe that if points are displaced on the straight line (0,0), (1,1) then F0 couldrepresent the data generating model. The PP-plot is particularly effective in detecting


47.0 47.5 48.0 48.5 49.0 49.5 50.0

0.0

0.1

0.2

0.3

0.4

x

F n(x)

a

nd

F 0

(x)

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

xp

Fn(xp)

F0(xp)

p~

xp~ x0p~

Figure 2.7 Empirical cumulative distribution function and theoretical cumulativedistribution function: introduction to the PP-plot and QQ-plot graphical representations

deviatons from F0 in regions of high probability density (typically in the middle ofthe distribution), see Section 2.9.

A dual way to compare the empirical cdf with the theoretical cdf is to start from ageneric value p assumed by the empirical cdf, see Fig. 2.7. We have two inverse imagesof p: the value xp whose image through the empirical cdf, Fn(x), is p and the valuex0p which has image p by using the theoretical cdf F0. The scatter plot diagram ofthe pairs (x0p, xp) is named Quantile-Quantile Plot (QQ-plot), see Fig. 2.9.

> plot(qnorm(p.orders, mean = mean(data), sd = sd(data)),

sort(data), pch = 16,

xlab = expression(x[0][tilde(p)]~~(theoretical~quantiles)),

ylab = expression(x[tilde(p)]~~(sample~quantiles)))

> abline(0, 1)

The same graph can be obtained by applying to data the function qqnormAlso inthis case if points are on a straight line then F0 could represent the data generatingmodel. The QQ-plot is particularly effective in detecting deviations from F0 on thetails of the distribution, see Section 2.9.


llll

llll

llllll

lllll l

llll

llllll

llll

lll l

lllll l

llll

l lll

llll

lll l

llllll

llll

lll l

llll

llll

llll

llll

llllll

llll

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p* (theoretical probabilities)

p (s

ample

prob

abilit

ies)

Figure 2.8 PP-plot

2.7.5 Use of the function fit.cont

The function fit.cont, available in the package rriskDistributions, givesseveral goodness-of-fit statistics (loglikelihood, AIC, BIC, Chi-squared, Anderson-Darling and Kolmogorov-Smirnov) to check if the data follow some theoreticalcdf. The Beta, Cauchy, chi-square, non-central chi-square, exponential, F, gamma,Gompertz, hypergeometric, lognormal, logistic, negative binomial, Normal, pert,Poisson, Students t, truncated normal, triangular, uniform and Weibull model areimplemented.A theoretical cdf appears in the output only when the procedure succeeds inestimating its parameters otherwise a warning message is returned. Have a look atthe help system for more information.

The function fit.cont also produces the histogram with the theoretical density,the QQ-plot, the empirical and theoretical cdfs and the PP-plot, see Fig. 2.10.

We observe that other statistical softwares draw the PP- and QQ-plots by switchingthe x and y axes, so theoretical probabilities and theoretical quantiles will appear onthe y axis.


l

l

ll

l llllll

lll

lllllllll

lllllll

llllllll

lllllll

lllllll

lllllll

lllllllll

lllll

llllll

llllll

lllll

llll l

ll

ll

48 49 50 51 52

4849

5051

52

x0p~ (theoretical quantiles)

x p~ (sa

mpl

e qu

an

tiles

)

Figure 2.9 QQ-plot

2.8 Two tests for assessing normality

We consider two tests for assessing the normality distributional assumption.

2.8.1 The Jarque-Bera test

The Jarque-Bera test, see Jarque and Bera (1987), is obtained as a LagrangeMultiplier statistic, see Verbeeks Chapter 6, and has the following forms:

in case of a sample of n observations (x1, . . . , xn) the Jarque-Bera statistic isdefined as:

JB = n

[(b1)

2

6+

(b2 3)224

]where:

b1 =3

3/22

, b2 =422, j =

1

n

ni=1

(xi x)j and x = 1n

ni=1

xi.


Figure 2.10 Fitting a continuous distribution by using the function fit.cont

Observe that b1 and b2 are respectively the skewness and kurtosis samplecoefficients, which are null under the normality assumption.


in case of a sample of n OLS residuals (e1, . . . , en) the Jarque-Bera statistic isdefined as:

JB = n

[23632

+1

24

(422 3)2]

+ n

[32122 31

22

]where:

j =1

n

ni=1

eji .

When the linear model includes a constant the residuals have zero mean, thatis 1 = 0, and the Jarque-Bera statistics reduces to the former definition.

In both cases the Jarque-Bera statistic is distributed as a 22 random variable with 2degrees of freedom.

In the package tseries the function jarque.bera.test is available to perform theJarque-Bera test on a set of observations. By applying it on data we obtain

> library(tseries)

> jarque.bera.test(data)

Jarque Bera Test

data: data

X-squared = 0.1691, df = 2, p-value = 0.9189

and the null hypothesis of normality will not be rejected.

2.8.2 The Shapiro-Wilk test

The Shapiro-Wilk normality test, see Shapiro and Wilk (1965), is implemented in thefunction shapiro.test; applying this function to data we obtain

> shapiro.test(data)

Shapiro-Wilk normality test

data: data

W = 0.9939, p-value = 0.9349

which does not reject the null hypothesis of normality.

2.9 Some further comments on the QQ-plot

We now consider the behaviour of the QQ-plot (and of the PP-plot), under the nullhypothesis of normality, in presence of data characterized by skewness, leptokurticand platikurtic behaviour.


2.9.1 Positively skewed distributions

Let X be distributed according to a Gamma distribution. The density function is

f(x;, ) =1

()x1exI(0,)(x), > 0, > 0

and we have E(X) = and V ar(X) =2 .

Figure 2.11 shows the density functions and the cdfs of a Gamma random variable,X, with parameters = 4 and = 2 and of a Normal random variable, Y , with mean42 = 2 and variance

422 = 1

> layout(1:2)

> par(mai = c(0.5, 0.82, 0.1, 0.42))

> alpha = 4

> lambda = 2

> curve(dgamma(x, alpha, lambda), xlim = c(-2, 6),

ylab = expression(f[X](x)~~and~~f[Y](x)))

> curve(dnorm(x, mean = alpha/lambda), add = TRUE)

> text(0.75, 0.4, expression(f[X](x)), cex = 0.75)

> text(3, 0.35, expression(f[Y](x)), cex = 0.75)

> curve(pgamma(x, alpha, lambda), xlim = c(-2, 6),

ylab = expression(F[X](x)~~and~~F[Y](x)))

> curve(pnorm(x, mean = alpha/lambda), add = TRUE)

> text(2, 0.75, expression(F[X](x)), cex = 0.75)

> text(2, 0.35, expression(F[Y](x)), cex = 0.75)

We can establish the behaviour of the PP- and QQ-plots by considering the cumulativedistribution functions as was shown in Section 2.7.4

> layout(1:2)

> par(mai = c(0.9, 0.82, 0.1, 0.42))

> x plot(pnorm(x, mean = alpha/lambda), pgamma(x, alpha,

lambda), type = "l", xaxs = "i", yaxs = "i",

xlab = "theoretical probabilities",

ylab = "sample probabilities",

ylim = c(0, 1))

> abline(0, 1)

> x plot(qnorm(x, mean = alpha/lambda), qgamma(x, alpha,

lambda), xlim = c(-2, 6), ylim = c(-2, 6), type = "l",

xlab = "theoretical quantiles", ylab = "sample quantiles")

> abline(0, 1)

> text(-0.75, 1.5, "left tail thinner than the normal tail",

cex = 0.75)

> text(3, 5.5, "right tail fatter than the normal tail",

cex = 0.75)


In this situation the left tail of X is thinner than that of Y while the right tail of Xis fatter than that of Y . Thus the quantiles on the tails of the two distributions willhave the following behaviour: for any given p (close to 0 or to 1) the quantiles of Xare larger than those of Y . The behaviour is evident by examining the QQ plot.

The PP-plot clearly detects a different behaviour of the two distributions in themiddle of the domain.

We now apply the function fit.cont to some simulated data, see Fig. 2.15.

> set.seed(123)

> skew.data library(rriskDistributions)

> fit.cont(skew.data)

2.9.2 Negatively skewed distributions

Figure 2.12 shows the density functions and the cdfs ofX = W , beingW the Gammarandom variable with parameters = 4 and = 2 considered in the previous section,and of a Normal random variable, Y , with mean 42 = 2 and variance 422 = 1

> layout(1:2)

> par(mai = c(0.5, 0.82, 0.1, 0.42))

> alpha = 4

> lambda = 2

> curve(dgamma(-x, alpha, lambda), xlim = c(-6, 2),


> curve(dnorm(x, mean = -alpha/lambda), add = TRUE)

> text(-0.75, 0.4, expression(f[X](x)), cex = 0.75)

> text(-3, 0.35, expression(f[Y](x)), cex = 0.75)

> curve(1 - pgamma(-x, alpha, lambda), xlim = c(-6,

2), ylab = expression(F[X](x)~~and~~F[Y](x)))

> curve(pnorm(x, mean = -alpha/lambda), add = TRUE)

> text(-1.75, 0.75, expression(F[X](x)), cex = 0.75)

> text(-1.75, 0.35, expression(F[Y](x)), cex = 0.75)

> layout(1:2)

> par(mai = c(0.9, 0.82, 0.1, 0.42))

> x plot(pnorm(x, mean = -alpha/lambda), 1 - pgamma(-x,

alpha, lambda), type = "l", xaxs = "i", yaxs = "i",


ylab = "sample probabilities",

ylim = c(0, 1))

> abline(0, 1)

> x plot(qnorm(x, mean = -alpha/lambda), -qgamma(1 -

x, alpha, lambda), xlim = c(-6, 2), ylim = c(-6,


2), type = "l", xlab = "theoretical quantiles",

ylab = "sample quantiles")

> abline(0, 1)

> text(-2.5, -5, "left tail fatter than the normal tail",

cex = 0.75)

> text(0.75, -1.75, "right tail thinner than the normal tail",

cex = 0.75)

In this situation the left tail of X is fatter than that of Y while the right tail of X isthinner than that of Y . Thus the quantiles on the tails of the two distributions willhave the following behaviour: for any given p (close to 0 or to 1) the quantiles of Xare smaller than those of Y . The behaviour is evident by examining the QQ plot.

As above the PP-plot clearly detects a different behaviour of the two distributionsin the middle of the domain.

We apply the function fit.cont to some simulated data, see Fig. 2.16.

> set.seed(123)

> skew.data library(rriskDistributions)

> fit.cont(skew.data)

2.9.3 Leptokurtic distributions

Let X be distributed according to a tk distribution with k degrees of freedom. Wehave E(X) = 0 and V ar(X) = kk2 .

The t distributions is used in finance since it is able to capture the fatter tails whichcharacterize the residuals distribution.

Figure 2.13 shows the density functions and the cdfs of a t random variable withk = 4 degrees of freedom and of a Normal random variable, Y , with mean 0 andvariance 64 = 1.5

> layout(1:2)

> par(mai = c(0.5, 0.82, 0.1, 0.42))

> k = 4

> curve(dt(x, k), xlim = c(-8, 8),


> curve(dnorm(x, mean = 0, sd = (k/(k - 2))^0.5), add = TRUE)

> text(0.75, 0.35, expression(t[4]), cex = 0.75)

> text(0, 0.24, "normal", cex = 0.75)

> curve(pt(x, k), xlim = c(-8, 8),


> curve(pnorm(x, mean = 0, sd = (k/(k - 2))^0.5), add = TRUE)

> text(0, 0.2, expression(F[X](x)), cex = 0.75)

> text(1.5, 0.7, expression(F[Y](x)), cex = 0.75)


> layout(1:2)

> par(mai = c(0.9, 0.82, 0.1, 0.42))

> x plot(pnorm(x, mean = 0, sd = (k/(k - 2))^0.5), pt(x,

k), type = "l", xaxs = "i", yaxs = "i",


ylab = "sample probabilities", ylim = c(0, 1))

> abline(0, 1)

> x plot(qnorm(x, mean = 0, sd = (k/(k - 2))^0.5), qt(x,

k), xlim = c(-8, 8), ylim = c(-8, 8), type = "l",


> abline(0, 1)

> text(-1.25, -7.5, "left tail fatter than the normal tail",

cex = 0.75)

> text(1, 7.5, "right tail fatter than the normal tail",

cex = 0.75)

In this situation the tails of X are fatter than those of Y . Thus the quantiles on thetails of the two distributions will have the following behaviour: for any given p closeto 0 the quantiles of X are smaller than those of Y ; for any given p close to 1 thequantiles of X are larger than those of Y . The behaviour is evident by examining theQQ plot.

The density functions are now symmetric and thus the PP-plot intersect the 0-1 lineat the center of the distributions; however it always can detect the different behaviourof the two distributions in the middle of their domain.


> set.seed(123)

> leptokurtic.data library(rriskDistributions)

> fit.cont(leptokurtic.data)

2.9.4 Platikurtic distributions

Let W be distributed according to a uniform distribution. We have E(X) = 0.5 andV ar(X) = 112 .

Figure 2.14 shows the density functions and the cdfs of X and of a Normal randomvariable, Y , with mean 0.5 and variance 112

> layout(1:2)

> par(mai = c(0.5, 0.82, 0.1, 0.42))

> curve(dunif(x), xlim = c(-1, 2), ylim = c(0, 1.5),


> curve(dnorm(x, mean = 0.5, sd = 1/12^0.5), add = TRUE)

> text(-0.1, 1, expression(f[X](x)), cex = 0.75)

> text(0.75, 1.25, expression(f[Y](x)), cex = 0.75)


> curve(punif(x), xlim = c(-1, 2),


> curve(pnorm(x, mean = 0.5, sd = 1/12^0.5), add = TRUE)

> text(0.9, 0.75, expression(F[X](x)), cex = 0.75)

> text(0.4, 0.2, expression(F[Y](x)), cex = 0.75)

> layout(1:2)

> par(mai = c(0.9, 0.82, 0.1, 0.42))

> x plot(pnorm(x, mean = 0.5, sd = 1/12^0.5), punif(x),

type = "l", xaxs = "i", yaxs = "i",


ylab = "sample probabilities", ylim = c(0, 1))

> abline(0, 1)

> x plot(qnorm(x, mean = 0.5, sd = 1/12^0.5), qunif(x),

xlim = c(-1, 2), ylim = c(-1, 2), type = "l",


> abline(0, 1)

> text(-0.5, 0.5, "left tail thinner than the normal tail",

cex = 0.75)

> text(1.5, 0.5, "right tail thinner than the normal tail",

cex = 0.75)

In this situation the tails of Y are fatter than those of X. Thus the quantiles on thetails of the two distributions will have the following behaviour: for any given p closeto 0 the quantiles of X are larger than those of Y ; for any given p close to 1 thequantiles of X are smaller than those of Y . The behaviour is evident by examiningthe QQ plot.

As above the density functions are now symmetric and thus the PP-plot intersectthe 0-1 line at the center of the distributions; however it can detect the differentbehaviour of the two distributions in the middle of their domain.


> set.seed(123)

> platikurtic.data library(rriskDistributions)

> fit.cont(platikurtic.data)


2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

x

f X(x)

a

nd

f Y(

x)

fX(x)fY(x)

2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

F X(x)

a

nd

F Y

(x)

FX(x)

FY(x)

0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

theoretical probabilities

sam

ple

prob

abilit

computational laboratory for economics

Documents