linear regression models, by marcelo moreira

28
Economics G6411 Marcelo J. Moreira Fall 2011 Columbia Universit y Lecture 16: Linear Regress ion Models In Economics we are often interested in making assessments about how much of the value of one random variable can be explained by the value of other variables. A commonly chosen way to do so is through the estimation of a linear model. W e start by conside ring a rather simple model of this sort. Bivariate Linear Regression Consider the relation between random variables X and Y in a bivariate population. Assume that: 1. y i = x i β + u i 2. E (u i ) = 0 3. {u i : i = 1,...,n} is a set of mutually independent random variables. 4. X i is deterministic, for every i = 1,...,n. 5. u i iid N (0, 1) The rst of the above assumptions determines that variable X is the independent variable, and Y is the dependent variable. We do not know the value of the parameter β , so we are interested in estimating it. One way to do so is by using a method already known by us: the maximum lik eli hood est ima tion. From the last assumption of the model, u i iid N (0, 1), we get f (y; β ), the joint pdf of y, and g i (y i ; β ), the pdf of y i , for i = 1,...,n. f (y; β ) = n i=1 g i (y i ; β ) = n i=1 (2π) 1 2 exp { (y i x i β ) 2 2 } = (2π) n 2 exp n i=1 (y i x i β ) 2 2 1

Upload: joaodefaria

Post on 14-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 1/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Lecture 16: Linear Regression Models

In Economics we are often interested in making assessments about howmuch of the value of one random variable can be explained by the value of other variables. A commonly chosen way to do so is through the estimationof a linear model. We start by considering a rather simple model of this sort.

Bivariate Linear RegressionConsider the relation between random variables X and Y in a bivariate

population. Assume that:

1. yi = xiβ + ui

2. E (ui) = 0

3. {ui : i = 1, . . . , n} is a set of mutually independent random variables.

4. X i is deterministic, for every i = 1, . . . , n.

5. ui iid∼ N (0, 1)

The first of the above assumptions determines that variable X  is theindependent variable, and Y  is the dependent variable. We do not know thevalue of the parameter β , so we are interested in estimating it. One way todo so is by using a method already known by us: the maximum likelihood

estimation. From the last assumption of the model, uiiid∼ N (0, 1), we get

f (y; β ), the joint pdf of  y, and gi(yi; β ), the pdf of  yi, for i = 1, . . . , n.

f (y; β ) =n

∏i=1

gi(yi; β )

=n∏

i=1

(2π)−1

2 exp

{−(yi − xiβ )2

2

}

= (2π)−n

2 exp

ni=1

(yi − xiβ )2

2

1

Page 2: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 2/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

The maximum likelihood estimator of  β  is then given by:

β  =argmaxβ

−n

2ln(2π)− 1

2

ni=1

(yi − xiβ )2

=argminβ

n

i=1

(yi

−xiβ )2

⇒β  = ∑

xiyi

∑ x

2

i

Since we know the distribution of ui, we could use small sample results to testhypothesis. If  ui is normally distributed, so is β . However, we can choosea more general approach, that uses asymptotic results to make inferencesconcerning the unknown parameter. For that, we will need the followingtheorem:

Result 1 (Lindeberg-Feller CLT) For each  n, let  X n1, . . . , X  nn be indepen-dent random variables such that:

1. E [X nt] = 02.∑n

i=1 E [X 2nt] = 1

3. limn→∞∑n

t=1 E [X 2nt · I (|X nt| > ϵ)] = 0

then {∑n

t=1 X nt}nconverges in distribution to a standard normally distributed 

random variable.

Notice that to use the Lindeberg-Feller CLT we do not require the va-riables to be independent and identically distributed. We only require in-dependence and the validity of the three conditions above. This is a very

important theorem, which we will use repeatedly.Now let us return to our model. We wish to apply the Lindeberg-Feller

CLT to obtain the asymptotic distribution of our statistic. To do so, we mustcheck if all the assumptions of the theorem are valid.

β − β  =

∑xtut∑

x2t

=n

t=1

xt∑

x2t

ut

2

Page 3: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 3/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Consistent with the notation of the Lindeberg-Feller CLT, we can manipulateour equation to obtain the desired variables X nt and ant:

√ x2

t (β − β ) =n

t=1

ant   xt

(∑ x2t )

1

2

ut

    X nt

=n

t=1

X nt =n

t=1

antut

Now, we check if the required conditions hold:

nt=1

a2nt =

nt=1

xt

(∑

x2t )

1

2

2

=

∑x2

t∑x2

t

= 1 ⇒n

t=1

E [X 2nt] =n

t=1

a2ntE [u2

t ] = 1 (1)

E [X nt] = E [antut] = antE [ut] = 0 (2)

Assuming that limn→∞∑n

t=1 E [X nt] = δ 2, our final condition:

limn→∞

nt=1

E [a2ntu2

t · I (|antut| > ϵ)] = 0 (3)

follows from an application of the dominated convergence theorem. Equations(??), (??) and (??) guarantees us that we can apply the Lindeberg-Fellertheorem to obtain the asymptotic distribution of our statistic of interest.

Multivariate Linear RegressionIn the previous section, we studied a model with only one explanatoryvariable X . However, we can consider the bivariate model as a particularcase of the multivariate model, and seek results valid for every linear modelwith any given number k of explanatory variables, in a multivariate fra-mework. Given n random variables Y 1, . . . , Y  n, a multiple linear regressionmodel stipulates a dependence relation between these random variables andk explanatory variables. For each of these random variables Y i, i = 1, . . . , n,

3

Page 4: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 4/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

there is a corresponding k-dimensional vector of explanatory variables xi,that we assume related to the random variable in the following manner:

yi = xi1β 1 + · · · + xikβ k + ui

= x′iβ + ui i = 1, . . . , n

y = Xβ + u (4)

where the matrix of values of the explanatory variables, X, is given by:

X =n×k

x11 x12 . . . x1k

......

. . ....

xn1 xn2 . . . xnk

=

x′1...

x′n

and the other matrices are:

y =

y1...

yn

xi =

xi1...

xik

β =

β 1...

β k

u =

u1...

uk

Notice that equation (??) is simply the multivariate version of the equationshown in the first assumption of our previous bivariate model. The classicalmultiple regression model has four basic assumptions:

1. E (y) = Xβ

2. X is nonstochastic

3. V(y) = σ2In

4. X has rank k

For any given sample (y, X), the parameters β and σ2 are unknown andwe are interested in estimating them. To do so, we can start from equation(??) and choose β that minimizes the sum of the square of the residualsu′u = (y − Xβ)′(y − Xβ).

β = arg minβ

1

2n

(yi − xiβ)2 = arg min

β

1

2n(y − Xβ)′(y − Xβ)

4

Page 5: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 5/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

This estimator is called the Ordinary Least Squares (OLS) estimator. Re-garding the objective function, notice that:

(y − Xβ)′(y − Xβ) = y′y − y′Xβ − β′X′y + β′X′Xβ

= y′y − 2y′Xβ + β′X′Xβ

The first-order condition is:

∂ 

∂ β

1

2n(y − Xβ)′(y − Xβ)

= 0 ⇒ X′y = X′Xβ

If the matrix X′X has full rank, it is invertible and we can explicitly expressβ as:

β = (X′X)−1X′y (5)

Example 1 Consider the linear model below:

yi = β 1 + iβ 2 + ui = [ 1 i ] β 1β 2 +ui = xiβ + ui

where 

xi =2×1

1

i

,

β =2×1

β 1

β 2

,

X =n×2

1 1...

...1 n

To find the ordinary least squares estimators, we apply equation  (??) using the above matrices. So, we have that:

X′X =n×n

1 . . . 1

1 . . . n 1 1

......

1 n

= ∑ 1 ∑ i∑i∑

i2

(X′X)−1 =n×n

n 1

2n(n + 1)

12

n(n + 1) 16

n(n + 1)(2n + 1)

−1

5

Page 6: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 6/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Therefore, our estimator of  β  is given by:

β =1

det(X′X)

16

n(n + 1)(2n + 1) −12

n(n + 1)

−12

n(n + 1) n

∑yi∑iyi

= 1det(X′X)

1

6 n2

(n + 1)(2n + 1)1n ∑ yi −

1

2 n2

(n + 1)1n ∑ ixi

−12

n2(n + 1) 1n

∑xi + n2 1

n

∑ixi

= β 1

β 2

where the determinant of  X′X is 

det(X′X) =n2(n + 1)(2n + 1)

6− n2(n + 1)2

4

The OLS estimator has some important properties that make it interes-ting. For example, it is easy to see that it is an unbiased estimator of β:

E (β) = E 

(X′X)−1X′y

= (X′X)−1X′E (y) = (X′X)−1X′Xβ = β

In addition to that, under the assumptions of the classic regression model,the variance of  β is:

V(β) = (X′X)−1X′V(y)X(X′X)−1

= (X′X)−1X′σ2InX(X′X)−1

= σ2(X′X)−1X′X(X′X)−1

V(β) = σ2(X′X)−1 (6)

However, σ2 is not observed. And if we wish to estimate V(β) consistently,we must find a consistent estimator of  σ2. To do so, let us define the twofollowing auxiliary matrices:

N = X(X′X)−1X′ (7)

6

Page 7: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 7/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

M = In − N = In − X(X′X)−1X′ (8)

Using these matrices, we can write the fitted-value vector as:

y = Ny = X (X′X)−1X′y

   β

=

x11 x12 . . . x1k

......

. . ....

xn1

xn2

. . . xnk

n×k

β 1...

ˆβ n

k×1

and the vector of residuals as:

e = y − y = My = (In − N)y =

y1 −

∑k

 j=1 x1 jβ  j...

yn −∑k

 j=1 xnj β  j

n×1

Additionally, these matrices have the following properties:

1. N′ = (X(X′X)−1X′)′ = X(X(X′X)−1)′ = X(X′X)−1X′ = N

2. N′N = X(X′X)−1X′X(X′X)−1X′ = X(X′X)−1X′ = N

3. M′ = (I − N)′ = I′ − N′ = I − N = M

4. M′M = I − N − N + N′N = I − N − N + N = I − N = M

5. M′N = (I − N)N = N − N′N = N − N = 0

6. MX = (In − N)X = X− X(X′X)−1X′X = X− X = 0

7. My = M(Xβ + u) = 0 + Mu

Now we can return to our consistent estimation of  σ2. Applying the aboveproperties, we can compute the expected value of the random variable e′e

7

Page 8: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 8/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

(the sum of squared residuals), which we use in the estimation of  σ2:

E (e′e) = E 

tr(e′e)

= tr

E (e′e)

= tr

E (Myy′M′)

= tr

ME (uu′)M′

= tr

Mσ2InM′

= σ2tr(M) = σ2tr(In − N)

= σ2tr(In)− tr(N) = σ

2

(n − k)

where the last equality follows from:

tr(N) = tr

X(X′X)−1X′

= tr

(X′X)−1X′X

= tr(Ik) = k

Therefore, we can define the adjusted mean squared residual σ2 as:

σ2 =e′e

n − k(9)

which gives us an estimator of  σ2 that is unbiased:

E (σ2) =E (e′e)

n − k= σ2

Now that we have a consistent estimator of  σ2, given by equation (??), wecan use it to estimate the variance matrix from equation (??).

 V(β) = σ2(X′X)−1

Since X is assumed to be nonstochastic and σ2 is unbiased, our estimator forthe variance matrix is also unbiased.

Result 2 (Gauss-Markov Theorem) In the framework of the classical regres-sion model, the vector of OLS coefficients  β is the minimum variance linear unbiased estimator of the parameter vector β.

Proof: Any linear estimator of  β can be writen as  β = Ay, where  A

is an  k × n nonstochastic matrix. Since we also want the estimator to be unbiased, we must have:

E (β) = AE (y) = AXβ = β ⇔ AX = In

8

Page 9: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 9/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Define the matrices  A = (X′X)−1X′ and  D = A− A. It follows that:

AX = (A + D)X = I

But since the OLS estimator of  β, denoted by  β, is also unbiased, we have that  AX = Ik. Therefore, DX = 0. The variance of  β is given by:

V(β) = AV(y)A′ = (A + D)σ2In(A + D)′ = σ2AA′ + AD′ + AD′ + DD′But notice that:

AD′ = DA′ = DX(X′X)−1   

=0

Therefore:

V(β) = σ2

AA′ + DD′

= V(β) + σ2DD ⇒ V(β)− V(β) = σ2DD

Since  σ2 is a positive scalar and  DD′ is a positive semi-definite matrix, we have that  V(β) ≥ V(β)

This important result concerning the OLS estimator of the regressioncoefficients tells us that within the call of unbiased estimators of  β, we willnot find any other estimator with greater precision. It is a result applicablenot only to the coefficients themselves, but also to linear combinations of them. Suppose we are interested in the estimation of a parameter θ, suchthat:

θ = α′β1

×k k

×1

for any given α, it seems reasonable to use as an estimator θ = α′β, where βis the OLS estimator that by now we are already familiar with. As a matterof fact, if we consider any other estimator of the type θ = α′β, we wouldhave that:

V (θ) = α′V(β)α = α′

V(β) + σ2DD′α = α′V(β)α   

V  (θ)

+σ2α′DD′α

9

Page 10: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 10/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

which brings us to:

V (θ)− V (θ) = σ2α′DD′α ≥ 0

where α′DD′α is a positive semi-definite matrix and σ2 a positive scalar,which confirms our optimality result.

Other interesting properties of the OLS estimators are shown in the fol-lowing examples.

Example 2 Consider the standard linear equation:

y = Xβ + u

Suppose we subtract the  n × 1 vector  Xα from both sides, obtaining a new regression equation:

y − Xα

   y

= X (β −α)

   β

+u ⇒ y = Xβ + u

How does it affect our OLS estimator? Notice that the OLS estimator of our newly defined coefficient  β is given by:

β = (X′X)−1X′y= (X′X)−1X′(y − Xα)= (X′X)−1X′y − (X′X)−1X′Xα

= β + α

which gives us a rather intuitive result.

Example 3 Once again, consider the linear model:

y = Xβ + u =[

X1 X2] β1

β2

+u = X1β1 + X2β2 + u

But now suppose that β2 is known by us. How can we properly estimate β1? One way to do that is by defining a new model, in which we subtract  X2β2

 from both sides of the old equation.

y − X2β2 = y = X1β1 + u

10

Page 11: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 11/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

We can then form a vector composed both from our OLS estimates of β1 and  from the known value of β2.

β∗∗ =

β1

β2

where ˆβ1 = (X′1X1)−

1

X′1y

11

Page 12: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 12/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Long and Short Regressions

Given k explanatory variables, we can create a partition of the matrix X

to regress y on only the first k1 explanatory variables and compare the OLScoefficients (b1) of the regression with a short list of variables with the OLScoefficients of the regression with a longer list of explanatory variables (β1).

X = X1 X2n×k1 n×(k−k1)

Long Regression: is the regression of  y on the k variables, that we havebeen doing so far.

y = Xβ + u =[

X1 X2

]β1

β2

+u = X1β1 + X2β2 + e

Short Regression: is the regression of  y on a smaller number of explana-tory variables k1 < k.

y = X1

b1

+ e∗1

The OLS estimator of coefficients from the short regression is given by:

b1 = (X′1X1)−1X′

1y

= (X′1X1)−1X′

1

[X1β1 + X2β2 + u

]= β1 + (X′

1X1)−1X′1X2β2 (10)

From equation (??) we have that the OLS estimator b1 of the short regressionwill be equal to the OLS estimator β1 of the long regression if and only if one of two conditions holds:

1. β2 = 0

2. X′1X2 = 0. The matrix (X′

1X1)−1X′1X2 contains in each column j the

k1 estimators of the regression coefficients of the variable in the j-thcolumn of X2 on X1. If these coefficients are equal to zero, the variablesin X1 are orthogonal to those in X2.

12

Page 13: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 13/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

The conditions under which we have equality between short and longregression estimators become more clear when we look at the whole matrixX. By now we already know that the OLS estimator β is given by:

β = (X′X)−1X′y =[

(X1, X2)′(X1, X2)]−1

(X1, X2)′y

With some matrix algebra, we get to:

β =

X′

1X1 X′1X2

X′2X1 X′

2X2

−1 X′

1y

X′2y

=

X′

1X1 0

0 X′2X2

−1 X′

1y

X′2y

=

(X′

1X1)−1 0

0 (X′2X2)−1

X′

1y

X′2y

Notice that we used the assumption of matrix orthogonality to transform thematrix X′X into a block-diagonal matrix.

Therefore, it follows that:

β =

β1

β2

=

(X1

′X1)−1X′1y

(X2′X2)−1X′

2y

=

b1

b2

A different way to write the OLS estimator of  β, using our partition of 

X, is by defining the two following matrices:

Ni = Xi(X′iXi)−1X′

i (11)

Mi = I − Ni (12)

where i is the index of the short regression. Using these newly defined ma-trices, we can manipulate the regression equation in a very useful manner:

y = Xβ + u

= X1β1 + X2β2 + u

= X1β1 + (M1X2 + N1X2)β2

= X1(β1 + (X′1X1)−1X′

1X2β2) + M1X2β2 + u

= X1(β1 + (X′1X1)−1X′

1X2β2) + X∗2β2 + u

13

Page 14: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 14/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

where X∗2 = M1X2. In addition, notice that:

M1X1 = (I − X1(X′1X1)−1X1)X1 = 0

So, we have that:X∗

2′X1 = X′

2M1X1 = 0

We wish to show that:

b∗2 = (X∗2′X∗2)−1X∗

2′y = β2

Indeed, it follows that:

b∗2 = (X∗2′X∗

2)−1X∗2′(X1β1 + X2β2 + e)

= (X∗2′X∗

2)−1X∗2′X1β1   

=0

+(X∗2′X∗

2)−1X∗2′X∗

2β2 + (X∗2′X∗

2)−1X∗2′e   

=0

= β2

Because M1M = (I

−N1)(I

−N) = I

−N1

−N + N1N = I

−N = M

Analogously, b∗1 = (X∗1′X∗

1)−1X∗1′y = β1, where X∗

1 = M2X1. Using thepartition of  X that we have established, we can write the variance of theOLS estimators as:

V

β1

β2

= σ2

X′

1X1 X′1X2

X′2X1 X′

2X2

It is also possible to write the variance of a subvector of the OLS estimatoras a function of  X∗

2:

V (β2) = σ2(X∗2′X∗2)−1X∗2′ I X∗2(X∗2′X∗2)−1 = σ2(X∗2′X∗2)−1

Let us denote by e∗1 = M1y the residual matrix of the short regression.Therefore, we have that:

e∗1 = M1(X1β1 + X2β2 + e) = e + M1X2β2 = e + X∗2β2

It follows thate∗′1e∗1 = e′e + β

′2X∗

2′X∗

2β2 (13)

14

Page 15: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 15/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Consequently, the sum of squared residuals of the long regression can notexceed the sum of squared residuals of the short regression. So, we can notimprove the fit by shortening the list of explanatory variables.

Example 4 Let us consider the matrix representation of the bivariate re-gression model, with an intercept:

y = 1n β 1 + β 2X2 + un×1 n×1 1×1 1×1 n×1 n×1

For this particular model, our partition will give us the matrix  X1 = 1n and the  n × 1 matrix  X2 with the observations of the explanatory variable.

y = X1β 1 + X2β 2 + u

X1 =

1...

1

, X = (X1, X2)

The OLS estimator for the long regression is given by:

β =

β1

β2

= (X′X)−1X′y =

1′n1n 1′nX2

X′21n X′

2X2

−1 1′ny

X′2y

=

n

∑i x2i

∑i x2i ∑i x22i

−1 ∑i yi

∑x2iyi“

β =1

n∑

i x22i − (

∑x2i)2 ×

∑i x2

2i −∑i x2i

−∑i x2i n

∑i yi∑

x2iyi

Since we are particularly interested in  β 2, we have that:

β 2 =n∑

x2iyi −∑

x2i

∑yi

n∑

i x22i − (

∑x2i)2 =

x2y − x2y

x22 − x2

2

(14)

15

Page 16: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 16/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

where the last equality we get by dividing both the numerator and the deno-minator by  n2. The alternative formula for the OLS estimator of  β 2 is:

b∗2 = (X∗2′X∗

2)−1X∗2′y

where X∗

2 =M

1X

2 = (I−

1n(

1′n

1n)

−11′n)

X2 =

X2 −

x2

In

X∗2′X∗

2 =∑

x22i − n−1 (

∑x2i)2

X∗2′y =

∑x2iyi − n−1

∑x2i

∑yi

So, it follows that:

b∗2 =

∑x2iyi − x2

∑yi∑

i x22i − x2

∑x2i

=x2y − x2y

x22 − x2

2

(15)

where the last equality comes from dividing both the denominator and the numerator by  n. Notice the equivalence between formulas  (??) and  (??). As 

expected, we obtained the same expression for  β 2 and  b∗2.

16

Page 17: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 17/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Inference in Linear Regression Models

Once we have estimated the parameter values, we might be interested inturning to hypothesis testing. If we know the distribution of the residuals,we can use small sample results to test the null hypothesis. However, a moregeneral approach, that does not make assumptions concerning the residualsdistribution, consists in using the asymptotic distribution of the test statistics

to draw a conclusion about H 0. We start by constructing our test statisticfrom the model equation.

y = Xβ + u ⇒ y − Xβ = u

If we premultiply both sides by (X′X)−1X′, we have:

(X′X)−1

X′y   β

−(X′X)−1X′Xβ = (X′X)−1X′u

β

−β = (X′X)

−1X′u (16)

Or, equivalently:

√ n(β − β) =

1

nX′X

−11√ 

nXu

=

1

nX′X

− 1

2

(X′X)−1

2 X′u

To be able to use asymptotic results, we need to make an important assump-tion concerning our k × k matrix X′X:

1n

X′X p−→ B (17)

where B is a positive semidefinite matrix. If this assumption is valid, we canapply a multivariate version of the Lindeberg-Feller Central Limit Theoremto obtain:

(X′X)−1

2

1√ n

Xud−→ N (0, σ2B) (18)

17

Page 18: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 18/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

We can combine results (??) and (??) to get to:1

nX′X

− 1

2 1√ n

Xud−→ B− 1

2 N (0, σ2B)

which gives us the result we wanted:

(X′X)−1

2 X′ud

−→ N (0, σ2Ik) (19)

And if result (??) is valid, the Continuous Mapping Theorem assures us that:

u′X(X′X)−1

2 (X′X)−1

2 X′u

σ2

d−→ χ2(k) (20)

In addition to that, we can substitute equation (??) on (??) and (??) toobtain:

(β − β)′(X′X)−1(β − β)

σ2= (β − β)′V (β)−1(β − β)

d−→ χ2(k)

And since under some fairly general assumptions σ2 is a consistent estimatorof  σ2, we have that:

σ

1

nX′X

−1 p−→ σ2B−1

Therefore, it follows that:

(β − β)(X′X)(β − β)

σ2

d−→ χ2(k)

Furthermore, if our test does not concern the entire vector β but only somepart of it, say β2, it is easy to derive asymptotic results for it, by premulti-

plying our test statistic by a specific matrix:

√ n(β2 − β2) =

√ n

0 0

0 Ik

(β − β)

which gives us:

√ n(b∗2 − β2) =

1

nX∗

2′X∗

2

− 1

2 (X∗

2′X∗

2

)− 1

2 X∗2′u

18

Page 19: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 19/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

(X∗2′X∗

2)−1

2 X∗2u

d−→ N (0, σ2Ik2)

u′X∗2(X∗

2′X∗

2)−1

2 (X∗2′X∗

2)−1

2 X∗2′u

σ2

d−→ χ2(k2)

Or, equivalently:

(b∗2 −

β2

)(X∗2

′X∗2

)(b∗2 −

β2

)

σ2

d

−→ χ2(k2)

(b∗2 − β2)(X∗2′X∗

2)(b∗2 − β2)

σ2

d−→ χ2(k2)

These newly obtained asymptotic results can be used in hypothesis testing.For example, if we wish to test H 0 : β2 = β2,0, we use:

(b∗2 − β2,0)′X∗′2 X∗

2(b∗2 − β2,0)

σ2 =

e′e

n − k

−1

×

e′1e1 − e′e

d−→ χ2(k2)

where e1 = M1y are the residuals of the regression of the modified dependentvariable y = y − X2β2,0 on X1. For the particular case in which β2,0 = 0,we have that:

b∗2′X∗′

2 X∗2b∗2

σ2 =

e′e

n − k

−1

×

e∗1′e∗1 − e′e

d−→ χ2(k2)

where the above equality follows from:

e∗1 = M1y = M1(X1β1 + X2β2 + e) = e + X∗2b2

e∗1′e∗1 = e′e + b′2X∗

2′X∗

2b2

These important asymptotic results can be used to test the value of anylinear function of the parameters of the model. Suppose that H 0 consists of  p different hypothesis concerning linear combinations fo the elements of  β.These hypothesis can be summarized by the p × k matrix H, so that thehypothesis we wish to test can be written as:

H 0 : H β = θ0 p×k k×1 p×1

19

Page 20: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 20/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

such that H has rank p. We can then create a larger matrix F, formed by H

as a submatrix, together with another submatrix L, in the following manner:

Fβ =

L

(k− p)×k

H p

×k

β =

=

γ 1

γ 2

Because we wish to test H 0, it is important that the submatrix L is suchthat F is an invertible matrix, so we can modify our model to become:

y = XF−1Fβ + u

In a previous example, we were interested in testing the hypothesis that thelast k2 of the k elements of β are equal to 0. To test this null hypothesis, wecan use the following matrix H:

H = 0 Ik2k2×k1 k2×k2

= 0 0 . . . 0 1 0 . . . 00 0 . . . 0 0 1 . . . 0...

.... . .

......

.... . .

...0 0 . . . 0 0 0 . . . 1

k2×k1 k2×k2

Typically, we divide the matrix H into two submatrices, H1 and H2.

H =

H1 H2

 p×(k− p) p× p

H2 must be invertible, so that we can compute the inverse matrix of  F:

F =

Ik− p 0

H1 H2

F−1F =

Ik− p 0

−H−12 H1 H−1

2

Ik− p 0

H1 H2

=

Ik− p 0

0 I p

20

Page 21: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 21/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Starting from these matrices, we can adapt our regression model:

H 0 : Hβ = θ0 ⇒ H 0 : γ 2 = θ0

y − Z2θ0

   y

= Z1γ 1 + Z2 (γ 2 − θ0)

   γ 2

+u

The last step is to test the new null hypothesis, H 0 : γ 2 = 0, which issomething we already know how to do.

Example 5 Consider a multiple regression model with three explanatory va-riables ( X is a  n × 3 matrix).

yi = β 1X 1i + β 2X 2i + β 3X 3i + ui β =

β 1β 2β 3

∈ R3

Suppose we are interested in testing the following null hypothesis:

H 0 : β 1 + β 2 + β 3 = 0 ⇒ H 0 : 13β = 0

To test this hypothesis we use the submatrix  H3 = 1′3. And since we are interested only in  H3, we can choose  L in a way such that  F is invertible in a simple matter.

F =

1 0 10 1 01 1 1

21

Page 22: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 22/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Relaxing the assumptions of the classical model

Until now, we have studied the linear regression model under the classicalassumptions. One of these assumptions is the nonstochastic nature of theexplanatory variables. Now we leave it aside and let X be a random matrix.This model starts from the following assumptions:

1.E 

(y|β) =

2. X is stochastic

3. V(y) = σ2In

4. X has rank k.

It is easy to see that the OLS estimator remains unbiased if  X is random:

E (β) = E X [E (β|X)] = E X [(X′X)−1X′E (y|X)] = E X [(X′X)−1X′Xβ] = β

The variance of our estimator, however, does change under this new assump-tion. From the conditional variance identity, we have that:

V(β) = E X [V(β|X)] + VX [E (β|X)]   =0

(21)

where the conditional variance is given by:

V(β|X) = V((X′X)−1X′y|X) = (X′X)−1X′V(y|X)X(X′X)−1

If  V(y|X) = V(u|X) = σ2In, we have that:

V(β|X) = (X′X)−1

X′σ2

InX(X′X)−1

= σ2

(X′X)−1

Substituting this result on equation (??), we get to:

V(β) = σ2E (X′X)−1

V(√ 

n(β − β)) = σ2E 

1

nX′X

−1

22

Page 23: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 23/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

One one hand, since we did not give any details about the distributionof the xi’s, we cannot say more about the variance of 

√ n(β − β), in small

samples.On the other hand, notice that to get to this new formula for the variance

of our OLS estimator we made a crucial hypothesis concerning the variance of u. We maintained the classical assumption that V(u) = σ2In. Nonetheless,

if we assume a different variance matrix for u, we move further away fromour initial model, but make our results more general. Suppose that:

V(u|X) = Σ(θ) = σ2In, θ ∈ Rd

where Σ(θ) is a positive definite matrix.We start with the pure heteroskedasticity case. Assume that the variance

matrix of  u takes the following form:

V(u|X) =

σ21 0 . . . 0

0 σ22 . . . 0

... ... . . . ...0 0 . . . σ2

n

Then the variance matrix of our estimator of β, conditional to the observedvalue of  X is given by:

V(β|X) = (X′X)−1X′

σ21 . . . 0

.... . .

...0 . . . σ2

n

X(X′X)−1

= xix′i−1

xix′i×

σ2i xix′i

−1

It follows that:

V(√ 

n(β − β)|X) =

∑xix

′i

n

−1 ∑xix

′iσ

2i

n

∑xix

′i

n

−1

(22)

√ n(β − β)

d−→ N (0, (E (xix′i))−1 lim n−1

xix′iu2i (E (xix′i))−1

23

Page 24: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 24/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

For the more general case, we have the following conditional variancematrix:

V(u|X) =

σ11 . . . σ1n

.... . .

...σn1 . . . σnn

which leads to

√ n(β − β) =

1

nX′X

−11√ n

Xu =

1

n

xix′i

−11√ 

n

xiui

and the following result for the variance:

V(√ 

n(β − β)) = E 

1

n

xix′i

−11

n

i

 j

xix′ jσ2ij

1

n

xix′i

−1

Notice that although we have a new variance matrix Σ, our least squares

estimator remains unbiased:E (β|X) = E ((X′X)−1X′y|X) = (X′X)−1X′E (y|X) = (X′X)−1X′Xβ = β

The variance of the β, however, does not remain the same:

V(β|X) = (X′X)−1X′V(u|X)X(X′X)−1 = (X′X)−1X′Σ(θ)X(X′X)−1

In addition to that, the new variance matrix Σ(θ) affects our asymptotic re-sults, specially if the variance of the score statistic differs from the varianceof the Hessian matrix. To see that, let us repeat the process we have previ-ously gone through to find the asymptotic distribution of the OLS estimator.

Starting from our usual objective function:

Qn(β) =1

2n(y − Xβ)′(y − Xβ)

we obtain the Score statistic and the Hessian matrix:

Sn(β) =∂Qn(β)

∂ β= −1

nX′(y − Xβ)

24

Page 25: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 25/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

Hn(β) =∂ 2Qn(β)

∂ β∂ β′=

1

nX′X

Once again, we make an important assumption concerning X′X:

Hn(β) =1

nX′X

p−→ B, where B is a positive definite matrix

Similarly to before, we can use the Lindeberg-Feller CLT:

√ n Sn(β∗) = − 1√ 

nX′u = − 1√ 

n

xiui

d−→ N (0, A),

where plim 1n

X′ΣX = A. So it follows that:

√ n(β − β)

d−→ N (0, B−1AB−1)

In details:0 ≈ Sn(β∗) + Hn(β∗)(β − β∗)

√ n(β − β∗) ≈ Hn(β∗)−1√ n Sn(β∗)We can then conclude that the OLS estimator will probably be asymptoticallyinefficient, unless B is proportional to A.

It may be useful to compare the OLS estimator to the ML estimatorwhen we do not have homoskedasticity. Consider the joint probability densityfunction of  y:

f (y;β, Σ) = f (y|X;β, Σ)g(X)

ln f (y;β, Σ) = ln f (y|X;β, Σ) + ln g(X)

The vector β that maximizes the pdf  f (y;β, Σ) also maximizes the conditi-

onal function f (y|X;β, Σ). If we assume that u|X ∼ N (0, Σ), it followsthat y|X ∼ N (Xβ, Σ). So we have that:

f (y|X;β, Σ) = (2π)−n

2 det(Σ)exp

{−1

2(y − Xβ)′Σ−1(y − Xβ)

}So, the maximum likelihood estimator will be given by:

βml = arg min1

2n(y − Xβ)′Σ−1(y − Xβ)

25

Page 26: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 26/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

This optimization problem, which renders the ML estimator when the resi-duals are normally distributed, is identical to the optimization problem thatgives us the Generalized Least Squares (GLS) estimator:

βgls = arg min1

2n(y − Xβ)′Σ−1(y − Xβ)

which gives us the following first-order condition:∂ ln f (y|X;β, Σ)

∂ β= − 1

nX′Σ−1(y − Xβgls) = 0

Under certain conditions, we can obtain the GLS estimator:

βgls = (X′Σ−1

X)−1X′Σ−1

y

Under the new assumption concerning the variance matrix, the GLS estima-tor is still unbiased.

E (βgls

|X) = (X′Σ

−1X)−1X′Σ

−1E (y

|X) = (X′Σ

−1X)−1X′Σ

−1Xβ = β

As expected, however, the variance of the estimators is altered by the newvariance matrix of the residuals.

V(βgls|X) = (X′Σ−1

X)−1X′Σ−1

ΣΣ−1X(X′Σ−1

X)−1

= (X′Σ−1

X)−1X′Σ−1

X(X′Σ−1

X)−1 (23)

Remember that the variance of the OLS estimator is:

V(βols|X) = (X′X)−1X′ΣX(X′X)−1 (24)

Typically, these variance matrices do not need to be equal. However, for the

particular case in which Σ = σ2

In, we have that:

βml = (X′σ2InX)−1X′σ2Iny = βols

In addition to that, our GLS estimator recovers asymptotic optimality pro-perties that were lost by the least squares estimators once we have alteredthe variance of the residuals. First, let us look at the Score statistic:

Sn(β) = − 1

nX′Σ

−1(y − Xβ)

26

Page 27: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 27/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

√ n Sn(β∗) = − 1√ 

nX′Σ−1(y − Xβ∗)

d−→ N (0, A)

Hn(β) =∂ Sn(β)

∂ β=

1

nX′Σ

−1X

p−→ A

where A = plim 1n

X′Σ−1X. So we have that:

√ n(β − β∗) ≈ Hn(β∗)−1√ nSn(β∗)d−→ N (0, A−1)

The idea behind the GLS estimator is to transform the model in a waythat the variance of the adjusted residuals is proportional to In, so that wecan apply the Gauss-Markov theorem. In other words, we seek an optima-lity result that does not depend on the normality of the residuals, obtainedby restricting the class of estimators at which we look. To do so, let uspremultiply the regression equation by Σ− 1

2 :

Σ− 1

2 y = Σ− 1

2 Xβ + Σ− 1

2 un×n n×1

   y∗

n×n n×k k×1

    X∗β

n×n n×1

   u∗

V(u|X) = Σn

×n

Since Σ is a positive definite matrix, we can write it as:

Σ = PΛP′, where P′P = In and Λ is a diagonal matrix.

Σ1

2 = PΛ1

2 P′ ⇒ Σ1

2 Σ1

2 = PΛ1

2 P′P′Λ1

2 P = PΛP′ = Σ

Our new regression equation is:

y∗ = X∗β + u∗

For this new modified model, the variance matrix of the residuals becomes:

V(u∗) = E (u∗u∗′) = E (Σ−1

2 uu′Σ−1

2 ) = Σ−1

2 E (uu′)Σ−1

2 = Σ−1

2 ΣΣ−1

2 = In

Therefore, all the requirements of the classical regression model are met, andwe can apply the Gauss-Markov theorem in y∗ = X∗β + u∗:

β = (X∗′X)−1X∗′y∗

= (X′Σ− 1

2 Σ− 1

2 X)−1X′Σ− 1

2 Σ− 1

2 y

= (X′Σ−1

X)−1X′Σ−1

y = βgls

27

Page 28: Linear Regression Models, by Marcelo Moreira

7/27/2019 Linear Regression Models, by Marcelo Moreira

http://slidepdf.com/reader/full/linear-regression-models-by-marcelo-moreira 28/28

Economics G6411Marcelo J. Moreira

Fall 2011Columbia University

The generalized least squares estimator βgls is the minimum variance linearunbiased estimator of of β, a result known as Aitken’s Theorem.

Although it is an interesting extension of the OLS estimator, the GLSestimator requires knowledge of  Σ. However, we rarely know Σ, which me-ans that testing hypothesis and constructing confidence intervals is a verycomplicated matter. What is left to do is to estimate this matrix and subs-

titute Σ in the GLS estimator formula with its estimator, Σ. Nonetheless,for estimates of Σ, the resulting statistic may have a rather complicated dis-tribution in small samples, leaving us no alternative but to focus on its largesample properties. This new estimator of β, with Σ instead of  Σ, is calledthe Feasible Generalized Least Squares (FGLS) estimator.

βfgls = (X′Σ(θ)−1

X)−1X′Σ(θ)−1

y = β + (X′Σ−1

X)−1X′Σ−1

u

The properties of the FGLS estimator will of course depend on the propertiesof the estimator of the variance matrix, Σ.

plim β =β + plim1

nX′Σ

−1X−1 1

nX′Σ

−1u

=β +

plim

1

nX′Σ

−1X

−1

plim1

nX′Σ

−1u

However, if we choose a consistent estimator Σ, under general conditions wecan be sure that the FGLS estimator of  β will have the same asymptoticdistribution as the GLS estimator βgls:

√ n(ˆβfgls − β∗)

d

−→ N (0, A−1

)