chumacero ols

8/9/2019 Chumacero OLS

1/23

Ordinary Least Squares: A Review

Rómulo A. Chumacero!

February 2005

1 I ntroduction

Ragnar Frisch (one of the founders of the Econometrics Society ) is credited with

coining the term ‘econometrics’. Econometrics aims at giving empirical content toeconomic relationships by uniting three key ingredients: economic theory, economicdata, and statistical methods. Neither ‘theory without measurement’, nor ‘measure-ment without theory’ are su!cient for explaining economic phenomena. It is theirunion that is the key to understand economic relationships.

Social scientists generally must accept the conditions under which their subjectsact and the responses occur. As economic data come almost exclusively from non-experimental sources, researchers cannot specify or choose the level of a stimulus andrecord the outcome. They can just observe the natural experiments that take place.In this sense economics, as meteorology, is an observational science.

For example, many economists have studied the influence of monetary policy on

macroeconomic conditions, yet the e" ects of actions by central banks continue to bewidely debated. Some of the controversies would be removed if a central bank couldexperiment with monetary policy over repeated trials under identical conditions, thusbeing able to isolate the e" ects of policy more accurately. However, no one can turnback the clock to try various policies under essentially the same conditions. Eachtime a central bank contemplates an action, it faces a new set of conditions. Theactors and technologies have all changed. The social, economic, and political ordersare di" erent. To learn about one aspect of the economic world, one must take intoaccount many others. To apply past experience e" ectively, one must take into accountsimilarities and di" erences between the past, present, and future.

Here we will review some of the fi

nite sample properties of the most basic and pop-ular estimation procedure in econometrics (Ordinary Least Squares, OLS for short).The document is organized as follows: Section 2 describes the general framework forregression analysis. Section 3 derives the OLS estimator and discusses its properties.

!Department of Economics of the University of Chile and Research Department of the CentralBank of Chile. E -mail address: r chumace@econ. uchi l e. cl

#


2/23

Section 4 considers OLS estimation subject to linear constraints. Section 5 developstests for linear constraints. Finally, Section 6 discusses the issue of prediction withinthe OLS context.

2 P reliminariesAn econometrician has the observational data {w1, w2,...,wT }, where each wt is avector of data. Partition wt = (yt, xt) where yt " R, xt " Rk. Let the joint density of the variables be given by f (yt, xt, θ), where θ is a vector of unknown parameters.

In econometrics we are often interested in the conditional distribution of oneset of random variables given another set of random variables (e.g., the conditionaldistribution of consumption given income, or the conditional distribution of wagesgiven individual characteristics). Recalling that the joint density can be written asthe product of the conditional density and the marginal density, we have:

f (yt, xt, θ) = f (yt |xt, θ1 )f (xt, θ2),

where f (xt, θ2) =R !

"!f (yt, xt, θ)dy is the marginal density of x.

Regression analysis can be defined as statistical inferences on θ1. For this purposewe can ignore f (xt, θ2), provided there is no relationship between θ1 and θ2.

# In thisframework, y is called the ‘dependent’ or ‘endogenous’ variable and the vector x iscalled the vector of ‘independent’ or ‘exogenous’ variables.

In regression analysis we usually want to estimate only the first and second mo-ments of the conditional distribution, rather than the whole parameter vector θ1 (incertain cases the first two moments characterize θ1 completely). Thus we can definethe conditional mean m (xt, θ3) and conditional variance g (xt, θ4) as

m (xt, θ3) = E (yt |xt, θ3 ) =Z !"!

yf (y|xt, θ1)dy

g (xt, θ4) =

Z !

"!

y2f (y|xt, θ1)dy # [m (xt, θ3)]2 .

The conditional mean and variance are random variables, as they are functions of the random vector xt. If we define ut as the di" erence between yt and its conditionalmean,

ut = yt #m (xt, θ3) ,

we obtain: yt = m (xt, θ3) + ut. (#)

Other than (yt, xt) having a joint density, no assumptions have been made todevelop (#).

#In this case we say that x is “weakly exogenous” for θ1.

2


3/23

P roposition 1 Properties of ut :1. E (ut |xt ) = 0,2. E (ut) = 0,3. E [h(xt)ut] = 0 for any function h (·) ,4. E (xtut) = 0.

P roof. #. By definition of ut and the linearity of conditional expectations,2

E (ut |xt ) = E [yt #m (xt) |xt ]= E [yt |xt ]# E [m(xt) |xt ]= m (xt) #m (xt) = 0.

2. By the law of iterated expectations and the first result,3

E (ut) = E [E (ut |xt )] = E (0) = 0.

3. By essentially the same argument,

E [h(xt)ut] = E [E [h(xt)ut |xt ]]

= E [h(xt)E [ut |xt ]]

= E [h(xt) · 0] = 0.

4. Follows from the third result, setting h(xt) = xt.

Equation (#) plus the first result of Proposition # are often stated jointly as theregression framework:

yt = m (xt, θ3) + ut

E (ut |xt ) = 0.

This is a framework, not a model, because no restrictions have been placed on the joint distribution of the data. These equations hold true by definition.

Given that the moments m (·) and g (·) can take any shape (usually nonlinear), aregression model imposes further restrictions on the joint distribution and on u (theregression error). If we assume that m (·) is linear we obtain what is known as thelinear regression model:

m (xt, θ3) = x0

tβ,

where β is a k-element vector. Finally, let

Y T ×1

=

!"#

y1...

yT

$%& , X

T ×k=

!"#

x01...

x0T

$%& =

!"#

x1,1 · · · x1,k...

. . . ...

xT,1 · · · xT,k

$%& , u

T ×1=

!"#

u1...

uT

$%& .

2 The linearity of conditional expectations states that E [g (x) y |x ] = g (x)E [y |x ].3 The law of iterated expectations states that E [E [y |x, z ]|x ] = E [y |x ].

3


4/23

Definition 1 The Linear Regression Model (LRM) is:1. yt = x

0

tβ + ut or Y = X β + u,2. E (ut |xt ) = 0,3. rank (X ) = k or det (X 0X ) 6= 0,4. E (utus) = 0

$t 6= s.

The most important assumption of the model is the linearity of the conditional ex-pectation. Furthermore, this framework considers that x provides no information forforecasting u and that X is of full rank. Finally, it is assumed that ut is uncorrelatedwith us.

4

Definition 2 The Homoskedastic Linear Regression Model (HLRM) is the LRM plus 5. E (u2t |xt ) = σ

2 or E (uu0 |X ) = σ2I T .

This model adds the auxiliary assumption that g (·) is conditionally homoskedas-tic.

Definition 3 The Normal Linear Regression Model (NLRM) is the LRM plus 6. ut % N (0, σ2) .Possing and additional assumption, this model has the advantage that exact dis-

tributional results are available for the OLS estimators and tests statistics. It is notvery popular in current econometric practice and, as we will see, is not necessary toderive most of the results that follow.

3 OL S E stimation

This section defines the OLS estimator of β and shows that it is the best linear

unbiased estimator. The estimation of the error variance is also discussed.

3.1 D efinition of the OLS Estimators of β and σ2

Define the sum of squares of the residuals (SS R) function as:

S T (β ) = (Y # Xβ )0 (Y # Xβ )= Y 0Y # 2Y 0Xβ + β 0X 0Xβ.

The OLS estimator ( bβ ) minimizes S T (β ). The First Order Necessary Conditions(FONC) for minimization are:

∂S T (β )∂β

¯̄̄̄ β

= #2X 0Y + 2X 0X bβ = 0,which yield the normal equations X 0Y = X 0X bβ .

4 Ocassionally, we will make the assumption of serial independence of {ut} which is stronger thanno correlation, although both concepts are equivalent when u is normal.

4


5/23

P roposition 2 arg minβ

S T (β ) = bβ = (X 0X )"1 (X 0Y ) .P roof. Using the normal equations we obtain bβ = (X 0X )"1 (X 0Y ). To verify that

bβ is indeed a minimum we evaluate the Second Order Su!cient Conditions (SOSC)

∂ 2S T (β )∂β∂β 0

¯̄̄̄ β

= 2X 0X,

which show that bβ is a minimum, as X 0X is a positive definite matrix.Three important implications are derived from this theorem: First, bβ is a linear

function of Y . Second, even if X is a non stochastic matrix, bβ is a random variableas it depends on Y which is itself a random variable. Finally, in order to obtain theOLS estimator we require X 0X to be of full rank.

Given

bβ , we define

bu = Y # X bβ, (2)and call it the least squares residuals. Using bu, we can estimate σ2 by bσ2 = T "1 bu0 bu.

Using (2), we can write

Y = X bβ + bu = P Y + MY,where P = X (X 0X )"1 X 0 and M = I #P . Given that bu is orthogonal to X (that is, bu0X = 0), OLS can be regarded as decomposing Y into two orthogonal components: acomponent that can be written as a linear combination of the column vector of X anda component that is orthogonal to X . Alternatively, we can call P Y the projection

of Y onto the space spanned by the column vectors of X and M Y the projection of Y onto the space orthogonal to X . These properties are illustrated in Figure #.5

P roposition 3 Let A be an n × r matrix of rank r. A matrix of the form P =A (A0A)"1 A0 is called a projection matrix and has the following properties:

i) P = P 0 = P 2 (Hence P is symmetric and idempotent),ii) rank (P ) = r,iii) the characteristic roots (eigenvalues) of P consist of r ones and n-r zeros,iv) if Z = Ac for some vector c, then P Z = Z (hence the word projection),v) M = I #P is also idempotent with rank n-r, the eigenvalues consist of n-r ones

and r zeros, and if Z = Ac, then MZ = 0,

vi) P can be written as G0G, where GG0 = I , or as v1v01 + v2v02 + ... + vrv0r where vi is a vector and r = rank(P ).

P roof. Left as an exercise.6

5 The column space of X is denoted by Col(X ).6 Appendix A presents this and other exercises.

5


6/23

Y

Col(X )

MY

PY

0

Figure #: Orthogonal Decomposition of Y

3.2 G aussian Quasi-M aximum L ikelihood Estimator

Now we relate a traditional motivation for the OLS estimator. The NLRM is yt =x0tβ + ut with ut % N (0, σ2).

The density function for a single observation is

f ¡

yt¯̄xt, β , σ

2¢

= 1&

2πσ2e"

(yt!x0

tβ)2

2σ2

and the log-likelihood for the full sample is

T ¡

β, σ2; Y |X ¢

= ln

" T Yt=1

f ¡

yt¯̄xt, β , σ

2¢#

=T

Xt=1ln f

¡yt

¯̄xt, β , σ

2

¢= #T

2 ln(2π)# T

2 ln¡

σ2¢# 1

2σ2

T Xt=1

(yt # x0tβ )2

= #T 2

ln(2π)# T 2

ln¡

σ2¢# 1

2σ2S T (β ) .

6


7/23

P roposition 4 In the NLRM, bβ MLE = bβ OLS .P roof. The FONC for the maximization of T (β, σ

2) are:

∂T (β, σ2)

∂β ¯̄̄̄ β, σ2 = 1

bσ2 ³X 0Y

#X 0X bβ ´ = 0

∂T (β, σ2)

∂σ2

¯̄̄̄ β, σ2

= # T 2 bσ2 +

³Y # X bβ ́ 0 ³Y # X bβ ́

2 bσ4 = 0.Thus, bβ MLE = (X 0X )"1 (X 0Y ) and bσ2 = T "1 bu0 bu.7This result is obvious since T (β, σ

2) is a function of β only through S T (β ). Thus,MLE maximizes T (β, σ

2) by minimizing S T (β ). Due to this equivalence, the OLS

estimator

bβ is frequently referred to as the “Gaussian MLE”, the “Gaussian Quasi-

MLE”, or the “Gaussian Pseudo-MLE”.8

3.3 T he M ean and Variance of bβ and bσ2P roposition 5 In the LRM, E

h³ bβ # β ́ |X i = 0 and E ³ bβ ́ = β.P roof. From previous results,

bβ = (X 0X )"1 X 0Y = (X 0X )"1 X 0 (Xβ + u)= β + (X 0X )

"1X 0u.

Then

E h³ bβ # β ́ |X i = E h(X 0X )"1 X 0u |X i

= (X 0X )"1

X 0E (u |X )

= 0.

Applying the law of iterated expectations, E ³ bβ ́ = E hE ³ bβ |X ́ i = β.

Thus,

bβ is unbiased for β . Indeed it is conditionally unbiased (conditional on X ),

which is a stronger result.

P roposition 6 In the HLRM, V ³ bβ |X ́ = σ2 (X 0X )"1 and V ³ bβ ́ = σ2E £(X 0X )"1¤ .7 Verify that the SOSC are satisfied.8 The term “quasi” (“pseudo”) is used for misspecified models. In this case, the normality as-

sumption was used to construct the likelihood and the estimator, but may be believed not to betrue.

7


8/23

P roof. Since bβ # β = (X 0X )"1 X 0u,V ³ bβ |X ́ = E ·³ bβ # β ́ ³ bβ # β ́ 0 |X ̧

= E h(X 0X )"1 X 0uu0X (X 0X )"1 |X i= (X 0X )

"1X 0E [uu0 |X ] X (X 0X )

"1

= σ2 (X 0X )"1

.

Thus, V ³ bβ ́ = E hV ³ bβ |X ́ i+ V hE ³ bβ |X ´i = σ2E £(X 0X )"1¤ .

This result is derived from the assumptions that u is uncorrelated and homoskedas-tic. The variance-covariance matrix of bβ measures the precision with which the re-lationship between Y and X is estimated. Some of its features are: First, and mostobvious, the variance of

bβ grows proportionally with σ2 (the volatility of the unpre-

dictable component). Second, although less obvious, as the sample size increases, thevariance-covariance matrix of bβ should decrease (we will provide formal argumentsin this regard when we analyze the asymptotic properties of OLS). Finally, it alsodepends on the volatility of the regressors; as it increases, the precision with whichwe measure β will be enhanced. Thus, we generally “prefer” a sample of X that ismore volatile, given that it would better help us to uncover its association with Y .

P roposition 7 In the LRM, bσ2 is biased.P roof. We know that

bu = MY . It is trivial to verify that

bu = Mu. Then,

bσ2 = T "1 bu

0

bu = T "1u0Mu. This implies that

E ¡ bσ2 |X ¢ = T "1E [u0Mu |X ]= T "1trE [u0Mu |X ]

= T "1E [tr (u0Mu) |X ]

= T "1E [tr (Muu0) |X ]

= T "1σ2tr (M )

= σ2 (T # k) T "1.

Applying the law of iterated expectations we obtain E

¡ bσ2

¢ = σ2 (T # k) T "1.

To derive this result we used the facts that σ2

is a scalar (tr denotes the trace of amatrix), that the expectation is a lineal operator (thus tr and E are interchangeable),that tr(AB)=tr(BA), and that M is symmetric in which case tr(M ) =

PT i=1 λi, where

λi denotes the i-th eigenvalue of M (here we used the results of Proposition 3).Proposition 7 shows that bσ2 is a biased estimator of σ2. A trivial modification

yields an unbiased estimator for σ2; eσ2 = (T # k)"1 bu0 bu.8


9/23

P roposition 8 In the NLRM, V ¡ bσ2¢ = T "22 (T # k) σ4.

P roof. Left as an exercise.

From the results derived so far, three facts are worth mentioning: First, with

the exception of Proposition 8 none of the results derived required the assumption of normality of the error term. Second, while bσ2 is biased, it coincides with the maximumlikelihood estimator of the variance under the assumption of normality of u and, as wewill show later, it is consistent. Finally, both the variance-covariance matrix of bβ and bσ2 depend on σ2 which is unknown, thus in practice we use estimators of the variance-covariance matrix of the OLS estimators by replacing σ2 with bσ2 or eσ2. For example,the estimator of the variance-covariance matrix of bβ is bV ³ bβ ́ = eσ2 (X 0X )"1 .3.4

bβ is BLU E

Defi

nition 4 Let bθ and θ# be estimators of a vector parameter θ. Let A and Bbe their respective mean squared error matrices; that is A = E

³ bθ # θ´³ bθ # θ´0and B = E (θ# # θ) (θ# # θ)0. We say that bθ is better (or more e ! cient) than θ#if c0 (B #A) c ' 0 for every vector c and every parameter value and c0 (B # A) c > 0

for at least one value of c and at least one value of the parameter.9

Once we made precise what we mean by better , we are ready to present one of themost famous theorems in econometrics;

T heorem 1 (G auss-M arkov) The Best Linear Unbiased Estimator (BLUE) is

bβ.

P roof. Let A = (X 0X )"1 X 0, then bβ = AY . Consider any other linear estimatorb. Without loss of generality, let b = (A + C ) Y . Then,

E (b |X ) = (X 0X )"1

X 0Xβ + CXβ = (I + CX ) β.

For b to be unbiased we require C X = 0 to hold, in which case

V (b |X ) = E £

(A + C ) uu0 (A + C )0¤

.

As (A + C ) (A + C )0 = (X 0X )"1 + CC 0, we obtain

V (b |X ) = V ³ bβ |X ́ + σ2CC 0.9 This definition can also be stated as B ' A for every parameter value and B 6= A for at least

one parameter value (in this context, B ' A means that B #A is positive semi-definite and B > Ameans that B #A is positive definite).

9


10/23

Then, V (b |X ) ' V ³ bβ |X ´ , as C C 0 is a positive semi-definite matrix.

Despite its popularity, the Gauss-Markov theorem is not very powerful. It re-stricts our quest of alternative candidates to those that are both linear and unbiasedestimators. There may be a “nonlinear” or biased estimator that can do better in the

metric of Definition 4. Furthermore, OLS ceases to be BLUE when the assumptionof homoskedasticity is relaxed. If both homoskedasticity and normality are present,we can rely on a stronger theorem which we will discuss later (the Cramer-Rao lowerbound).

3.5 A nalysis of Variance (A N OVA )

By definition,Y = bY +

bu.

Subtracting Y (the sample mean of Y ) from both sides we have,

Y # Y = ³ bY # Y ´ + bu.Thus ¡

Y # Y ¢0 ¡Y # Y ¢ = ³ bY # Y ´0 ³ bY # Y ´+ 2³ bY # Y ´0 bu + bu0 bu,but bY 0 bu = Y 0P MY = 0 and Y 0 bu = Y ı0 bu = 0 when the model contains an intercept(more generally, if ı lies in the space spanned by X ).#0 Thus

¡Y # Y ¢0

¡Y # Y ¢ = ³ bY # Y ´0

³ bY # Y ´ + bu0 bu.This is called the analysis of variance formula, often written as

T SS = ESS + SSR,

where T SS , ESS , and SS R stand for “Total sum of squares”, “Equation sum of squares” and “Sum of squares of the residuals”, respectively. The equation R2 (alsoknown as the centered coe!cient of determination) is defined as

R2 = ES S

T SS = 1# S SR

T SS = 1 # Y

0MY

Y 0LY ,

where L = I T # T "1ıı0. Therefore, provided that the regressors include a constant,0 ( R2 ( 1. If the regressors do not include a constant, R2 can be negative be-cause, without the benefit of an intercept, the regression could do worse (tracking thedependent variable) than the sample mean.

#0 We define ı as a row vector of ones.

#0


11/23

The equation measures the percentage of the variance of Y that is accounted for inthe variation of the predicted value bY . R2 is typically reported in applied work and isfrequently referenced as “measure” or “goodness” of fit. This label is inappropriate,as R2 does not measure the adequacy or “fit” of a model.##

It is not even clear if R2 has an unambiguous interpretation in terms of forecast

performance. To see this, note that the “explanatory” power of the models yt =xtβ + ut and yt # xt = xtγ + ut with γ = β # 1 are the same. The models aremathematically identical and yield the same implications and forecasts. Yet theirreported R2 will di" er greatly. For illustration, suppose that β ' 1. Then the R2

from the second model will (nearly) equal zero, while the R2 from the first model canbe arbitrarily close to one. An econometrician reporting the near-unit R2 from thefirst model might claim “success”, while an econometrician reporting the R2 ' 0 fromthe second model might be accused of a poor fit. This di" erence in reporting is quiteunfortunate, since the two models and implications are mathematically identical. Thebottom line is that R2 is not a measure of fit and should not be interpreted as such.

Another interesting fact about R2 is that it necessarily increases as regressors areadded to the model. As by definition the OLS estimate minimizes the SSR, by addingadditional regressors, the SSR cannot increase; it either can stay the same, or (morelikely) decrease. But the T SS is una" ected by adding regressors, so that R2 eitherstays constant or increases. To counteract this e" ect, Theil proposed an adjustment,

typically called R2

(or “adjusted” R2) which penalizes model dimensionality and isdefined as:

R2

= 1 # S SRT SS

T

T # k = 1 # eσ2 bσ2y .

While often reported in applied work, this statistic is not used that much todayas better model evaluation criteria have been developed (we will discuss this later).

3.6 OL S E stimator of a Subset of β

Sometimes we may not be interested in obtaining estimates of the whole parametervector, but only of a subset of β . Partition

X =£

X 1 X 2¤

and

β =

µ β 1β 2 ¶

.

##More unfortunate is the claim that the R2 measures the percentage of the variance of y thatis “explained” by the model. An econometric model, by itself, doesn’t explain anything. Only thecombination of a good econometric model and sound economic theory can, in principle, “explain” aphenomenon.

##


12/23

Then X 0X bβ = X 0Y can be written as:X 01X 1

bβ 1 + X 01X 2 bβ 2 = X 01Y (3a)X 02X 1

bβ 1 + X

0

2X 2

bβ 2 = X

0

2Y. (3b)

Solving for bβ 2 and reinserting in (3a) we obtain bβ 1 = (X 01M 2X 1)"1 X 01M 2Y and bβ 2 = (X 02M 1X 2)"1 X 02M 1Y,where M i = I # P i = I #X i (X 0iX i)"1 X 0i (for i = 1, 2).

These results can also be derived using the following theorem:

T heorem 2 (Frisch-Waugh-L ovell)

bβ 2 and

bu can be computed using the following

algorithm:1. Regress Y on X 1, obtain residual eY ,2. Regress X 2 on X 1, obtain residual eX 2,3. Regress eY on eX 2, obtain bβ 2 and residuals bu.P roof. Left as an exercise.

In some contexts, the Frisch-Waugh-Lovell (FWL) theorem can be used to speedcomputation, but in most cases there is little computational advantage of using it. #2

There are, however, two common applications of the FWL theorem, one of which isusually presented in introductory econometrics courses: the demeaning formula for

regression; the other deals with ill-conditioned problems.The first application can be constructed as follows: Partition X = £ X 1 X 2 ¤where X 1 = ı is a vector of ones, and X 2 is the matrix of observed regressors. In thiscase,

eX 2 = M 1X 2 = X 2 # ı (ı0ı)"1 ı0X 2= X 2 #X 2

and

eY = M 1Y = Y # ı (ı0ı)

"1ı0Y

= Y # Y .which are ‘demeaned’.

#2 A few decades ago, a crucial limitation for conducting OLS estimation was the computationalcost of inverting even moderately sized matrices and the FWL was invoked routinely.

#2


13/23

The FWL theorem says that bβ 2 is the OLS estimate from a regression of eY oneX 2, or yt # Y on x2t #X 2:

bβ 2 = Ã

T

Xt=1 ¡x2t #X 2¢ ¡

x2t #X 2¢0

!"1

Ã T

Xt=1 ¡x2t #X 2¢ ¡

yt # Y ¢0

!.

Thus, the OLS estimator for the slope coe!cients is a regression with demeaneddata.

The other application is more useful. In our analysis we assumed that X is fullrank (X 0X is invertible). Suppose for a moment that X 1 is full rank but that X 2 isnot. In that case β 2 cannot be estimated, but β 1 still can be estimated as follows: bβ 1 = (X 01M #2X 1)"1 X 01M #2 Y,where M #2 is formed using X

#

2 that has columns equal to the maximal number of linearly independent columns of X 2.

4 C onstrained Least Squares (C L S)

In this section we shall consider the estimation of β and σ2 when there are certainlinear constraints on the elements of β . We shall assume that the constraints are of the form:

Q0β = c, (4)

where Q is a k × q matrix of known constants and c is a q -vector of known constants.We shall also assume that q < k and rank(Q) = q .

4.1 Derivation of the C L S E stimator

The CLS estimator of β , denoted by β , is defined to be the value of β that mini-mizes the S SR subject to the constraint (4). The Lagrange expression for the CLSminimization problem is

L (β, γ ) = (Y # Xβ )0 (Y # Xβ ) + 2γ 0 (Q0β # c) ,

where γ is a q -vector of Lagrange multipliers corresponding to the q constraints. TheFONC are

∂ L

∂β

¯̄̄̄β,γ

= #2X 0Y + 2X 0Xβ + 2Qγ = 0

∂ L

∂γ

¯̄̄̄β,γ

= Q0β # c = 0.

#3


14/23

The solution for β is

β = bβ # (X 0X )"1 Q hQ0 (X 0X )"1 Qi"1 ³Q0 bβ # c´ . (5)The corresponding estimator of σ2 can be defined as

σ2 = T "1¡

Y # Xβ ¢0 ¡Y # Xβ ¢ .4.2 CLS as BLU E

It can be shown that (5) can be expressed as

β = β + R (R0X 0XR)"1

R0X 0u,

where R is a k × (k # q ) matrix such that the matrix (Q, R) is nonsingular andR0Q = 0.#3 Therefore β is unbiased and its variance-covariance matrix is given by

V ¡

β ¢

= σ2R (R0X 0XR)"1

R0.

Now define the class of linear estimators β # = D0Y #d where D0 is a k×T matrixand d is a k-vector. This class is broader than the class of linear estimators consideredin the unconstrained case because of the additive constants d. We did not included previously because in the unconstrained model the unbiasedness condition wouldensure d = 0. Here, the unbiasedness condition E (D0Y # d) = β implies D0X =I + GQ0 and d = Gc for some arbitrary k × q matrix G. We have V (β #) = σ2D0Dand CLS is BLUE because of the identity

D0D#R (R0X 0XR)"1 R0 = hD0 #R (R0X 0XR)"1 R0X 0i hD0 #R (R0X 0XR)"1 R0X 0i0 ,where we have used D 0X = I + GQ0 and R 0Q = 0.

5 I nference with L inear C onstraints

In this section we shall regard the linear constraints (4) as a testable hypothesis,calling it the null hypothesis. For now we will assume that the normal linear regressionmodel holds and derive the most frequently used tests in the OLS context.#4

#3

Such a matrix can always be found and is not unique, and any matrix that satisfies theseconditions will do.

#4 We will discuss the case of inference in the presence of nonlinear constraints and departuresfrom normality of u later. For those impatient, none of the results derived here change when theseassumptions are relaxed (at least asymptotically).

#4


15/23

5.1 T he t Test

The t test is an ideal test to use when we have a single constraint, that is, q = 1. Aswe assumed that u is normally distributed, so is

bβ ; thus under the null hypothesis we

have

Q

0 bβ av N hc, σ2Q0 (X 0X )"1 Qi .With q = 1, Q0 is a row vector and c is a scalar. Therefore

Q0 bβ # c£σ2Q0 (X 0X )"1 Q

¤1/2 % N (0, 1) . (6)This is the test statistic that we would use if σ were known. As bu0 bu

σ2 % χ2T "k, (7)

it can be shown that (6) and (7) are independent, hence:

tT = Q0 bβ # c£eσ2Q0 (X 0X )"1 Q¤1/2 % S T "k,

which is Student’s t with T # k degrees of freedom. Only now we have invoked theassumption of normality of u and, as shown later, it is not necessary for (6) to hold(in large samples).

If we were interested in testing a single hypothesis of the form:

H0 : β 1 = 0,

we would define Q = £ 1 0 · · · 0 ¤0 and c = 0, in which case we would obtain thefamiliar t test

tT = bβ 1q bV 1,1 ,

where bV 1,1 is the #,# component of the estimator of the variance-covariance matrix of bβ .With these tools we can construct confidence intervals C T for β i. As C T is a

function of the data, it is random. Its objective is to cover β i with high probability.The coverage probability is Pr (β " C T ). We say that C T has (1 # α) % coverage for

β if Pr (β " C T ) ) (1# α). We construct a confi

dence interval as follows:

Pr

· bβ i # zα/2q bV i,i < β i


16/23

most common choice for α is 0.05. If |tT | < zα/2, we cannot reject the null hypothesisat an α% significance level; otherwise the null hypothesis is rejected.

An alternative approach to reporting results, is to report a p-value. The p-valuefor the above statistic is constructed as follows. Define the tail probability, or p-valuefunction

pT = p (tT ) = Pr (|Z | ' |tT |) = 2 (1# ! (|tT |)) .If the p-value pT is small (close to zero) then the evidence against H0 is strong.

In a sense, p-values and hypothesis tests are equivalent since pT ( α if and only if |tT | ' zα/2. The p-value is more general, however, in that the reader is allowed topick the level of significance α.#5

A confidence interval for σ can be constructed as follows

Pr

"(T # k) eσ2χ2T "k,1"α/2

< σ2 < (T # k) eσ2

χ2T "k,α/2

# = 1# α. (8)

5.2 T he F Test

When q > 1 we cannot apply the t test described above and use instead a simpletransformation of what is known as the Likelihood Ratio Test (which we will discussat length later). Under the null hypothesis, it can be shown that

S T ¡

β ¢# S T ³ bβ ́

σ2 % χ2q.

As in the previous case, when σ2 is not known, a finite sample correction can bemade by replacing σ2 with eσ

2, in which case we have

S T ¡

β ¢# S T ³ bβ ́eσ2 = T # kq

³Q0 bβ # c´0 £Q0 (X 0X )"1 Q¤"1 ³Q0 bβ # c´ bu0 bu % F q,T "k. (9)

Once again, as in the case of t tests, we reject the null hypothesis when the valuecomputed exceeds the critical value.

5.3 Tests for Structural B reaks

Suppose we have a two-regimes regression

Y 1 = X 1β 1 + u1

Y 2 = X 2β 2 + u2,

#5 GAUSS tip: to compute p (t) use 2! cdfnc (t).

#6


17/23

where the vectors and matrices have T 1 and T 2 rows respectively (T = T 1 + T 2).Suppose further that

E

· u1u2

¸£

u01 u0

2

¤ =

· σ21I T 1 0

0 σ22I T 2

¸.

We want to test the null hypothesis H0 : β 1 = β 2. First, we will derive an F testassuming homoskedasticity among regimes and later we will relax this assumption.To apply the test we define

Y = X β + u,

where

Y =

· Y 1Y 2

¸, X =

· X 1 0

0 X 2

¸, β =

· β 1β 2

¸, and u =

· u1u2

¸.

Applying (9) we obtain:

T 1 + T 2 # 2kk

³ bβ 1 # bβ 2´0 £(X 01X 1)"1 + (X 02X 2)"1¤"1 ³ bβ 1 # bβ 2´Y 0h

I #X (X 0X )"1 X 0i

Y % F k,T 1+T 2"2k, (#0)

where bβ 1 = (X 01X 1)"1 X 01Y 1 and bβ 2 = (X 02X 2)"1 X 02Y 2.Alternative, the same result can be derived as follows: Define the sum of squares

of the residuals under the alternative of structural change,

S T

³ bβ ́ = Y 0 hI #X (X 0X )"1 X 0iY and the sum of squares of the residuals under the null hypothesis

S T ¡β ¢ = Y 0 hI #X (X 0X )"1 X 0iY.

It is easy to show that

T 1 + T 2 # 2kk

S T ¡

β ¢# S T ³ bβ ́

S T

³ bβ ́ % F k,T 1+T 2"2k. (##)In this case an unbiased estimate of σ2 is

eσ2 = S T ¡β ¢T 1 + T 2 # 2k .Before we remove the assumption that σ1 = σ2 we will first derive a test of the

equality of the variances. Under the null hypothesis (same variances across regimes)we have bu0i bui

σ2 % χ2T i"k for i = 1, 2.

#7


18/23

Because these chi-square variables are independent, we have

T 2 # kT 1 # k

bu01 bu1 b

u02

bu2% F T 1"k,T 2"k.

Unlike the previous tests, a two-tailed test should be used here, because a largeor small value of the test is a reason to reject the null hypothesis.

If we remove the assumption of equal variances among regimes and focus on thehypothesis of equality of the regression parameters, the tests are more involved. Wewill concentrate on the case in which k = 1, where a t test is applicable. It can beshown (though this is not trivial) that

tT = bβ 1 # bβ 2q σ21X01X1

+ σ22

X02X2

% S v,

where

v =h σ21X01X1

+ σ22X02X2i2

σ41

(T 1"1)(X01X1)2 +

σ42

(T 2"1)(X02X2)2

.

A cleaner way to perform this type of tests is through the use of direct LikelihoodRatio Tests (which we will discuss in depth later).

Even though structural change (or Chow) tests are popular, modern econometricpractice is skeptic with respect to the way in which they are described above, particu-larly because in these cases the econometrician sets in an ad-hoc manner the point atwhich to split the sample. Recent theoretical and empirical applications are workingon treating the period of possible break as an endogenous latent variable.

6 P rediction

We are now interested in producing out-of-sample predictions for y p (for p > T ). Inthat period, the relationship will be:

y p = x0

pβ + u p,

where y p and u p are scalars and x p are the pth period observations on the regressors.If we assume that the conditions outlined in HLRM are satisfied, it is trivial to verify

that the best linear predictor is x0 p bβ T , with bβ T denoting the OLS estimator of β conditional on the information available on period T .#6

#6 “Best” is defined in terms of the candidate that minimizes the mean squared prediction errorconditional on observing x p.

#8


19/23

In this case, it can be verified that, conditional on x p, the mean squared predictionerror is

E £

( by p # y p)2 |x p ¤ = σ2 h1 + x0 p (X 0X )"1 x pi .In order to construct an estimator of the variance of the forecast error, replace

σ2

with eσ2. It may be thought that the construction of confidence intervals for theprediction is trivial and could be formulated as follows:Pr

· by p # zα/2q bV yp < y p


20/23

where RMSE stands for Root Mean-Squared Error, MAE for Mean Absolute Error,and P is the number of periods being forecasted. These have an obvious scalingproblem. Several that do not, are based on the Theil U statistic:

U = v uutPP p=1 (y p # by p)

2

PP p=1 y2 p .This measure is related to R2 but is not bounded by zero and one. Large values

indicate a poor forecasting performance. An alternative is to compute the measurein terms of the changes in y:

U ! =

v uutPP p=1 ("y p #" by p)2PP p=1 ("y p)

2 ,

where"y p = y p

#y p"1 and " by p = by p # y p"1or, in percentage changes,

"y p = y p # y p"1

y p"1and " by p = by p # y p"1

y p"1.

These measures will reflect the model’s ability to track turning points in the data.When several competing forecast models are considered, one set of them will

appear more successful than another in a given dimension (say, one model has thesmallest MAE for 2-steps ahead forecasts). It is inevitable then to ask how likely itis that this result is due to chance. Diebold and Mariano (#995), approach forecastcomparison in this framework.

Consider the pair of h-steps ahead forecast errors of models i and j ( bui,p, bu j,p) for p = 1, . . . , P ; whose quality is to be judged by the loss function g ( bui,p).#8 Definingd p = g ( bui,p) # g ( bu j,p), under the null hypothesis of equal forecast accuracy betweenmodels i and j, we have E d p = 0. Given the covariance-stationary realization {d p}

P p=1,

it is natural to base a test on the observed sample mean:

d = 1

P

P X p=1

d p.

Even with optimal h-steps ahead forecasts, the sequence of forecast errors followsa MA(h # 1) process. If the autocorrelations of order h and higher are zero, thevariance of d can be consistently estimated as follows:

V = 1

P

Ã bγ 0 + 2 h"1X j=1

bγ j! ,#8 For example, in case of Mean Squared Error comparison, g (·) is a quadratic loss function

g ( bui,p) = bu2i,p and in the case of MAE, it is the absolute value loss function g ( bui,p) = | bui,p|.20


21/23

where bγ j is an estimate if the j -th autocovariance of d p.The Diebold-Mariano (DM) statistic is given by

DM = d&

V

d) N (0, 1)

under the null of equal forecast accuracy. Harvey et al (#997) suggest to modify theDM test and use instead:

HLN = DM ·

·P + 1# 2h + h (h# 1) /P

P

¸1/2to correct size problems of DM . They also suggest to use a Student’s t with P # 1degrees of freedom instead of a standard normal to account for possible fat-tailederrors.

To test if model i is not dominated by model j in terms of forecasting accuracy

for the loss function g (·), a one-sided test of DM or H LN can be conducted, whereunder the null E d p ( 0. Thus, if the null is rejected, we conclude that model jdominates model i.

2#


22/23

R eferences

Amemiya, T. (#985). Advanced Econometrics . Harvard University Press.

Baltagi, B. (#999). Econometrics . Springer-Verlag.

Diebold, F. and R. Mariano (#995). “Comparing Predictive Accuracy,” Journal of Business and Economic Statistics #3, 253-65.

Greene, W. (#993). Econometric Analysis . Macmillan.

Hansen, B. (200#). “Lecture Notes on Econometrics,” Manuscript . Michigan Univer-sity.

Harvey, D., S. Leybourne, and P. Newbold (#997). “Testing the Equality of PredictionMean Square Errors,” International Journal of Forecasting #3, 28#-9#.

Hayashi, F. (2000). Econometrics . Princeton University Press.

Lam, J. and M. Veill (2002). “Bootstrap Prediction Intervals for Single Period Re-gression Forecasts,” International Journal of Forecasting #8, #25-30.

Mittelhammer, R., G. Judge, and D. Miller (2000). Econometric Foundations . Cam-bridge University Press.

Ruud, P. (2000). An Introduction to Classical Econometric Theory . Oxford UniversityPress.

22


23/23

A Workout P roblems

#. Prove that independence implies no correlation but that the contrary is notnecessarily true. Give an example of variables that are uncorrelated but notindependent.

2. Let y , x be scalar dichotomous random variables with zero means. Define u =y#Cov(y, x) [V (x)]"1. Prove that E (u |x) = 0. Are u and x independent?

3. Let y be a scalar random variable and x a vector random variable. Prove thatE [y # E (y |x)]2 ( E [y # w (x)]2 for any function w.

4. Prove that if V (ut) = σ2, V ( but) = (1 # ht) σ2. Find an expression for ht.5. Prove Proposition 3.

6. Prove Proposition 8.

7. In Theorem # we used the fact that (A + C ) (A + C )0 = (X 0X )"1 + CC 0. Provethis.

8. Prove that when a constant is included, R2 = 1#(Y 0MY/Y 0LY ) , with L beingas defined in section 3.5.

9. Derive the variance-covariance matrix of bβ 2 defined in section 3.6.#0. Prove Theorem 2.

##. Prove that (5) is the CLS estimator.

#2. Prove that the CLS estimator can be expressed as β = β +R (R0X 0XR)"1 R0X 0uand obtain V

¡β ¢

.

#3. Show that ( bu0 bu) σ"2 % χ2T "k.#4. Demonstrate (8).

#5. Derive equations (9), (#0), and (##).

#6. Prove that to test the null H0 : β i = 0 for all i except the constant, the F testis equivalent to (T # k) R2/ [(1#R2) (k # 1)] .

23

chumacero ols

Documents