slide set 3 regression models and the classical linear ... · slideset3 regression models and the...

Slide Set 3Regression Models and the

Classical Linear Regression Model (CLRM)

Pietro [email protected]

EconometricsMaster in Economics and Finance (MEF)

Università degli Studi di Napoli “Federico II”

Version: Tuesday 21st January, 2020 (h11:31)

P. Coretto • MEF Regression Models and the Classical Linear Regression Model (CLRM) 1 / 26

Regression analysis

Regression analysis is a set of statistical techniques for modeling andanalyzing the relationship between a dependent variable Y and one ormore independent variables X. Typically X = (X1, X2, . . . , XK)′ is avector of variables.

Depending on the context, and the field of application, we have differentnames

Y : dependent variable, response variable, outcome variable, outputvariable, target variable etc.X: independent variable, regressor, covariate, explanatory variable,predictor, feature, etc.

In regression analysis we assume a certain mechanism linking the X to theY . We want to use observed data to understand the link.


Notes

Notes

Regression function and types of regression models

The link is formalized in terms of a regression function. The latter modelsthe relationship between Y and X

Y ≈ f(X)

Building a regression model requires to specifyhow the f(·) transforms Xin which sense the f(·) approximates Y


Depending on how the f(·) transforms X we havenonparametric modelsparametric models

Nonparametric modelsThe f(·) is treated as the unknown. Therefore the object of interestbelongs to an infinite dimensional space. Usually we restrict our quest tosome well defined class, for instance we may restrict the analysis to{

f : f is real valued, smooth and∫|f(x)|dx < +∞

}

Nonparametric models allow for a lot flexibility, and this comes at a price.


Notes

Notes

Parametric modelsf is assumed to have a specific form controlled by a scalar parameter β ora vector of parameters β = (β1, β2, . . . , βp)′. Therefore the object ofinterest is β.

Exampleslinear parametric regession function: f(X;β) = β1X1 + β2X2

nonlinear parametric regession function:f(X;β) = sin(β1X1) + eβ2X2

Some nonlinear regression function can be linearized. Example:

f(X;β) = eX1+βX2 −→ log(f(X;β)) = X1 + βX2

Sometimes a regression function is not linear in the original X, but it islinear in a transformation of X

f(X;β) = β1X21 + β2X2


Depending on what kind of approximation f(·) provides about Y(conditional) mean regression: f(X) = E[Y |X](conditional) quantile regression:f(X) = median[Y |X]f(X) = quantileα[Y |X]etc...

Conditional mean regression functions are central in regression analysis forseveral reasons:

approximating Y by an average it’s intuitivemost theoretical models are expressed in terms of expectations“Optimal Predictor Theorem”


Notes

Notes

The quality of the approximation Y ≈ f(X) can be measure by thequadratic risk or MSE

E[(Y − f(X))2]

Theorem (Optimal Predictor)

Under suitable regularity conditions

inff

E[(Y − f(X))2] = E[Y |X]

In other wordsf(X) = E[Y |X] gives the best approximation to Y in terms of MSEIf we want to guess Y based on information generated by X,f(X) = E[Y |X] is the best guess in terms of MSE


Proof. We need to show that for any function f(X)

E[(Y − E[Y |X])2] ≤ E[(Y − f(X))2]

E[(Y − f(X))2] = E

Y−E[Y |X]︸︷︷︸

A

+ E[Y |X]− f(X)︸︷︷︸B

2

computing (A+B)2 and using using the lineary of expectations

E[A2] + E[B2] + 2 E[AB] = E[(Y − f(X))2] (3.1)

E[B2] = E[(E[Y |X]− f(X))2] ≥ 0, therefore (3.1) becomes

E[A2] + 2 E[AB] ≤ E[(Y − f(X))2] (3.2)


Notes

Notes

E[AB] = E[ E[AB |X] ] (law of iterated expectations )= E

[(Y − E[Y |X])(E[Y |X]− f(X))

∣∣ X]= (E[Y |X]− f(X)) E

[(Y − E[Y |X])

∣∣ X] (pull out)= (E[Y |X]− f(X)) { E[Y |X]− E[E[Y |X] |X]}= (E[Y |X]− f(X))× 0

Therefore, (3.2) becomes

E[A2] ≤ E[(Y − f(X))2]

with E[A2] = E[(Y − E[Y |X])2], which proves the result �


In this course we focus on conditional mean regression models where theregression function has a linear parametric form:

Y ≈ E[Y |X] = f(X;β) = β1X1 + β2X2 + . . .+ βKXK

The reason why this class of regression models is so popular is thatbecause they can reproduce correlation between Y and Xs. Going back tothe example of Slide Set #1


Notes

Notes

5 10 15 20

50100

150

200

250

X

Y


X

Y

x = 5

y|x

=13

5.1


Notes

Notes

X

Y

x = 10

y|x

=17

1.8


X

Y

x = 15

y|x

=20

8.6


Notes

Notes

The model postulates that Y ≈ E[Y |X], but we cannot observe E[Y |X].

For each sample unit i = 1, 2, . . . , n we observe the samples(Yi, Xi1, Xi2, . . . , Xik). Therefore, we need an additional term whichsummarizes the difference between Yi and its conditional mean E[Yi |X]

A way to reproduce the previous sampling mechanism is to add an errorterm εi, which is a random variable that “summarizes” the deviations ofYi from its conditional mean E[Yi |X]. Therefore

Yi = f(Xi;β) + εi

= E[Yi |X] + εi

= β1X1 + β2X2 + . . .+ βKXK + εi

The short name for this class of models is linear regression models = linearparametric regression function plus an additive error term.


Partial or marginal effects

The partial/marginal effect is a measure of the effect on the regressionfunction determined by a change in a regressor Xj holding all the otherregressors constant (waee = “with all else equal”).

Let us focus on conditional mean models. Assuming differentiability, thepartial/marginal effect of a change ∆Xj is given by

∆ E[Y |X] = ∂ E[Y |X]∂Xj

∆Xj hoding fixed X1, . . . , Xj−1, Xj+1, . . . , XK

Computing marginal/partial effects make sense only when the model has acausal interpretation (see later).


Notes

Notes

For the linear regression model

∂ E[Y |X]∂Xj

= βj

Therefore, the unknown parameters βj coincides with partial effect of anunit change in Xj waee.

For a discrete regressor Xj , partial effects are computed as the variationsin E[Y |X] obtained by changing the level of Xj and waee.

Suppose Xk ∈ {a, b, c}. The partial effect when Xk changes from level ato b (waee) is given by

E[Y |Xk = b,X] − E[Y |Xk = a,X]

Another measure of regressors’ effect on the Y is the partial/marginalelasticity (see homeworks).


Notations

Indexes and constants:n: number of sample unitsK: number of covariates/regressors/features measured on each of then sample unitsi = 1, 2, . . . , n: indexes sample unitsk = 1, 2, . . . ,K: indexes regressors

y ∈ Rn: column vector of the dependent/response variable

y =

y1y2...yn


Notes

Notes

xi ∈ RK : column vector containing the K regressors measured on theith sample unit

xi =

xi1xi2...

xiK

So called design matrix, is the (n×K) matrix whose rows contain sampleunits and columns contain regressors

X =

x11 x12 . . . x1Kx21 x22 . . . x2K...

... . . . ...xn1 xn2 . . . xnK

=

x1′

x2′

...xn′


ε ∈ Rn: column vector containing the error term for each unit

ε =

ε1ε2...εn


Notes

Notes

Classical linear regression model (CLRM)

A1: linearityFor all i = 1, 2, ..., n observed data are generated by the following linearmodel

yi = β1xi1 + β2xi2 + ...+ βKxiK + εi (3.3)= x′iβ + εi (3.4)

where β ∈ RK is a vector of coefficients. In matrix form

y = Xβ + ε (3.5)

Remark 1: linearity of the model is wrt parameters not regressors. Forx2=log(consumption), the model is still linear wrt to log(consumption).


Remark 2: often a constant/intercept term is introduced in the model

yi = β1 + β2xi2 + ...+ βKxiK + εi for i = 1, 2, ..., n

In this case we consider conventionally (3.3) as if xi1 = 1 for all i in thesample.

If the model includes a constant/intercept term, then the first column ofX is the unit column vector, that is X·1 = 1n = (1, 1, . . . , 1)′

X =

1 x12 . . . x1K1 x22 . . . x2K...

... . . . ...1 xn2 . . . xnK


Notes

Notes

A2: strict exogeneityE[εi |X] = 0 for all i = 1, 2, . . . , n, or E[ε |X] = 0

Implications:1 E[εi] = 0 for all i = 1, 2, . . . , n.2 All regressors are orthogonal to the error term for all units:

E[xjkεi] = 0 for all i, j = 1, 2, . . . , n and k = 1, 2, . . . ,K3 Orthogonality implies the zero-correlation conditions

Cov[xjk, εi] = 0 for all i, j = 1, 2, . . . , n and k = 1, 2, . . . ,K

If i =time (time series data), strict exogeneity implies that the errorterm is orthogonal to the past, current, and future regressors. For mosttime-series data, this condition is not satisfied, so the finite sample theorybased on strict exogeneity is rarely applicable in time-series contexts.


Proof of the implications:1. It follows from the law of iterated expectations:

E[εi] = E[E[εi |X]] = E[0] = 0

2. By the law of iterated expectations write

E[xjiεi] = E[E[xjkεi |xjk]]

And by the linearity of the conditional expectation (“pull out what’sknown”), write

E[E[xjkεi |xjk]] = E[xjkE[εi |xjk]]

But A2 says E[εi |xjk] = 0, which proves the desired result.3. This follows from previous results:

Cov[xjk, εi] = E[xjkεi]− E[xjk]E[εi] = E[xjkεi] = 0


Notes

Notes

A3: absence of multicollinearityThe (n×K) design matrix X has rank K with probability 1.

This assumption implies that X has full column rank, which means thatthe columns of X are linearly independent

A3 also requires that n ≥ K.

Essentially A3 is a technical assumption that guarantees that mostcomputations can be performed... see this later.


A4: spherical error varianceFor all i, j = 1, 2, . . . , n and i 6= j

1 E[ε2i |X] = σ2 > 0 (homoskedasticity)

2 E[εiεj |X] = 0 (units are uncorrelated)

A vector random variable is said to have a spherical distribution if itscovariance matrix is a scalar multiple of the identity matrix. The sphericityhere is shown as follows:

Var[εi |X] = E[ε2i |X]− E[εi |X]2 = E[ε2

i |X] = σ2

and

Cov[εi, εj |X] = E[εiεj |X]− E[εi |X]E[εj |X] = E[εiεj |X] = 0

Now it is easy to show (do it as an exercise) that

E[εε′ |X] = Var[ε |X] = σ2In


Notes

Notes

slide set 3 regression models and the classical linear ... · slideset3 regression models and the...

Documents