chumacero ols

Upload: rjalon

Post on 01-Jun-2018

242 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Chumacero OLS

    1/23

    Ordinary Least Squares: A Review

    Rómulo A. Chumacero!

    February 2005

    1 I ntroduction

    Ragnar Frisch (one of the founders of the   Econometrics Society ) is credited with

    coining the term ‘econometrics’. Econometrics aims at giving empirical content toeconomic relationships by uniting three key ingredients: economic theory, economicdata, and statistical methods. Neither ‘theory without measurement’, nor ‘measure-ment without theory’ are su!cient for explaining economic phenomena. It is theirunion that is the key to understand economic relationships.

    Social scientists generally must accept the conditions under which their subjectsact and the responses occur. As economic data come almost exclusively from non-experimental sources, researchers cannot specify or choose the level of a stimulus andrecord the outcome. They can just observe the natural experiments that take place.In this sense economics, as meteorology, is an observational science.

    For example, many economists have studied the influence of monetary policy on

    macroeconomic conditions, yet the e" ects of actions by central banks continue to bewidely debated. Some of the controversies would be removed if a central bank couldexperiment with monetary policy over repeated trials under identical conditions, thusbeing able to isolate the e" ects of policy more accurately. However, no one can turnback the clock to try various policies under essentially the same conditions. Eachtime a central bank contemplates an action, it faces a new set of conditions. Theactors and technologies have all changed. The social, economic, and political ordersare di" erent. To learn about one aspect of the economic world, one must take intoaccount many others. To apply past experience e" ectively, one must take into accountsimilarities and di" erences between the past, present, and future.

    Here we will review some of the fi

    nite sample properties of the most basic and pop-ular estimation procedure in econometrics (Ordinary Least Squares, OLS for short).The document is organized as follows: Section 2 describes the general framework forregression analysis. Section 3 derives the OLS estimator and discusses its properties.

    !Department of Economics of the University of Chile and Research Department of the CentralBank of Chile.   E -mail address:  r chumace@econ. uchi l e. cl

    #

  • 8/9/2019 Chumacero OLS

    2/23

    Section 4 considers OLS estimation subject to linear constraints. Section 5 developstests for linear constraints. Finally, Section 6 discusses the issue of prediction withinthe OLS context.

    2 P reliminariesAn econometrician has the observational data  {w1, w2,...,wT }, where each   wt   is avector of data. Partition wt = (yt, xt) where yt " R, xt " Rk. Let the joint density of the variables be given by f (yt, xt, θ), where θ  is a vector of unknown parameters.

    In econometrics we are often interested in the conditional distribution of oneset of random variables given another set of random variables (e.g., the conditionaldistribution of consumption given income, or the conditional distribution of wagesgiven individual characteristics). Recalling that the joint density can be written asthe product of the conditional density and the marginal density, we have:

    f (yt, xt, θ) = f (yt |xt, θ1 )f (xt, θ2),

    where f (xt, θ2) =R !

    "!f (yt, xt, θ)dy  is the marginal density of  x.

    Regression analysis can be defined as statistical inferences on θ1. For this purposewe can ignore  f (xt, θ2), provided there is no relationship between  θ1  and θ2.

    # In thisframework,   y   is called the ‘dependent’ or ‘endogenous’ variable and the vector  x   iscalled the vector of ‘independent’ or ‘exogenous’ variables.

    In regression analysis we usually want to estimate only the  first and second mo-ments of the conditional distribution, rather than the whole parameter vector  θ1  (incertain cases the  first two moments characterize  θ1  completely). Thus we can definethe conditional mean  m (xt, θ3) and conditional variance  g (xt, θ4) as

    m (xt, θ3) =   E  (yt |xt, θ3 ) =Z   !"!

    yf (y|xt, θ1)dy

    g (xt, θ4) =

    Z   !

    "!

    y2f (y|xt, θ1)dy # [m (xt, θ3)]2 .

    The conditional mean and variance are random variables, as they are functions of the random vector xt. If we define ut  as the di" erence between yt  and its conditionalmean,

    ut =  yt #m (xt, θ3) ,

    we obtain: yt =  m (xt, θ3) + ut.   (#)

    Other than   (yt, xt)  having a joint density, no assumptions have been made todevelop (#).

    #In this case we say that  x  is “weakly exogenous” for  θ1.

    2

  • 8/9/2019 Chumacero OLS

    3/23

    P roposition 1  Properties of  ut :1.   E  (ut |xt ) = 0,2.   E  (ut) = 0,3.   E  [h(xt)ut] = 0  for any function  h (·) ,4.   E  (xtut) = 0.

    P roof.   #. By definition of  ut  and the linearity of conditional expectations,2

    E  (ut |xt ) =   E  [yt #m (xt) |xt ]=   E  [yt |xt ]# E  [m(xt) |xt ]=   m (xt) #m (xt) = 0.

    2. By the law of iterated expectations and the  first result,3

    E  (ut) = E  [E  (ut |xt )] = E  (0) = 0.

    3. By essentially the same argument,

    E  [h(xt)ut] =   E  [E  [h(xt)ut |xt ]]

    =   E  [h(xt)E  [ut |xt ]]

    =   E  [h(xt) · 0] = 0.

    4. Follows from the third result, setting  h(xt) =  xt.

    Equation (#) plus the   first result of Proposition   #  are often stated jointly as theregression framework:

    yt   =   m (xt, θ3) + ut

    E  (ut |xt ) = 0.

    This is a framework, not a model, because no restrictions have been placed on the joint distribution of the data. These equations hold true by definition.

    Given that the moments m (·) and g (·) can take any shape (usually nonlinear), aregression model imposes further restrictions on the joint distribution and on  u  (theregression error). If we assume that m (·)  is linear we obtain what is known as thelinear regression model:

    m (xt, θ3) = x0

    tβ,

    where β   is a  k-element vector. Finally, let

    Y T ×1

    =

    !"#

    y1...

    yT 

    $%& ,   X 

    T ×k=

    !"#

    x01...

    x0T 

    $%& =

    !"#

    x1,1   · · ·   x1,k...

      . . .  ...

    xT,1   · · ·   xT,k

    $%& ,   u

    T ×1=

    !"#

    u1...

    uT 

    $%& .

    2 The linearity of conditional expectations states that E  [g (x) y |x ] = g (x)E  [y |x ].3 The law of iterated expectations states that  E  [E  [y |x, z ]|x ] = E  [y |x ].

    3

  • 8/9/2019 Chumacero OLS

    4/23

    Definition 1  The Linear Regression Model (LRM) is:1.   yt =  x

    0

    tβ  + ut  or  Y   = X β  + u,2.   E  (ut |xt ) = 0,3. rank (X ) =  k  or  det (X 0X ) 6= 0,4.   E  (utus) = 0

     $t 6= s.

    The most important assumption of the model is the linearity of the conditional ex-pectation. Furthermore, this framework considers that  x  provides no information forforecasting u  and that X  is of full rank. Finally, it is assumed that  ut is uncorrelatedwith us.

    4

    Definition 2  The Homoskedastic Linear Regression Model (HLRM) is the LRM plus 5.   E  (u2t |xt ) = σ

    2 or  E  (uu0 |X ) = σ2I T .

    This model adds the auxiliary assumption that  g (·) is conditionally homoskedas-tic.

    Definition 3  The Normal Linear Regression Model (NLRM) is the LRM plus 6.   ut %  N  (0, σ2) .Possing and additional assumption, this model has the advantage that exact dis-

    tributional results are available for the OLS estimators and tests statistics. It is notvery popular in current econometric practice and, as we will see, is not necessary toderive most of the results that follow.

    3 OL S E stimation

    This section defines the OLS estimator of   β   and shows that it is the best linear

    unbiased estimator. The estimation of the error variance is also discussed.

    3.1 D efinition of the OLS Estimators of  β   and  σ2

    Define the sum of squares of the residuals (SS R) function as:

    S T  (β ) = (Y  # Xβ )0 (Y  # Xβ )=   Y 0Y  # 2Y 0Xβ  + β 0X 0Xβ.

    The OLS estimator ( bβ ) minimizes S T  (β ). The First Order Necessary Conditions(FONC) for minimization are:

    ∂S T  (β )∂β 

    ¯̄̄̄ β 

    = #2X 0Y   + 2X 0X  bβ  = 0,which yield the normal equations  X 0Y   = X 0X  bβ .

    4 Ocassionally, we will make the assumption of serial independence of  {ut} which is stronger thanno correlation, although both concepts are equivalent when  u  is normal.

    4

  • 8/9/2019 Chumacero OLS

    5/23

    P roposition 2  arg minβ 

    S T  (β ) = bβ  = (X 0X )"1 (X 0Y ) .P roof.  Using the normal equations we obtain bβ  = (X 0X )"1 (X 0Y ). To verify that

     bβ  is indeed a minimum we evaluate the Second Order Su!cient Conditions (SOSC)

    ∂ 2S T  (β )∂β∂β 0

    ¯̄̄̄ β 

    = 2X 0X,

    which show that bβ  is a minimum, as  X 0X  is a positive definite matrix.Three important implications are derived from this theorem: First, bβ  is a linear

    function of  Y . Second, even if  X  is a non stochastic matrix, bβ   is a random variableas it depends on  Y   which is itself a random variable. Finally, in order to obtain theOLS estimator we require  X 0X  to be of full rank.

    Given

     bβ , we define

     bu =  Y  # X  bβ,   (2)and call it the least squares residuals. Using bu, we can estimate  σ2 by bσ2 = T "1 bu0 bu.

    Using (2), we can write

    Y   = X  bβ  + bu = P Y   + MY,where P   = X  (X 0X )"1 X 0 and M  = I #P . Given that bu is orthogonal to  X  (that is, bu0X  = 0), OLS can be regarded as decomposing Y  into two orthogonal components: acomponent that can be written as a linear combination of the column vector of  X  anda component that is orthogonal to  X . Alternatively, we can call P Y   the projection

    of  Y  onto the space spanned by the column vectors of  X   and M Y   the projection of Y  onto the space orthogonal to  X . These properties are illustrated in Figure  #.5

    P roposition 3   Let   A   be an   n ×  r   matrix of rank   r. A matrix of the form   P   =A (A0A)"1 A0 is called a projection matrix and has the following properties:

    i)  P   = P 0 = P 2 (Hence  P   is symmetric and idempotent),ii) rank (P ) = r,iii) the characteristic roots (eigenvalues) of  P  consist of r ones and n-r zeros,iv) if  Z  = Ac   for some vector  c, then  P Z  = Z  (hence the word projection),v) M  = I #P  is also idempotent with rank n-r, the eigenvalues consist of n-r ones 

    and r zeros, and if  Z  = Ac, then MZ = 0,

    vi) P  can be written as  G0G, where  GG0 = I , or as  v1v01 + v2v02 + ... + vrv0r   where vi   is a vector and  r =  rank(P ).

    P roof.  Left as an exercise.6

    5 The column space of  X  is denoted by Col(X ).6 Appendix A presents this and other exercises.

    5

  • 8/9/2019 Chumacero OLS

    6/23

    Col(X )

    MY

    PY 

    0

    Figure  #: Orthogonal Decomposition of  Y 

    3.2 G aussian Quasi-M aximum L ikelihood Estimator

    Now we relate a traditional motivation for the OLS estimator. The NLRM   is yt  =x0tβ  + ut  with ut %  N  (0, σ2).

    The density function for a single observation is

    f ¡

    yt¯̄xt, β , σ

     =  1& 

    2πσ2e"

    (yt!x0

    tβ)2

    2σ2

    and the log-likelihood for the full sample is

    T ¡

    β, σ2; Y   |X ¢

      = ln

    "  T Yt=1

    f ¡

    yt¯̄xt, β , σ

    2¢#

    =T 

    Xt=1ln f 

    ¡yt

    ¯̄xt, β , σ

    2

    ¢=   #T 

    2 ln(2π)# T 

    2 ln¡

    σ2¢#   1

    2σ2

    T Xt=1

    (yt # x0tβ )2

    =   #T 2

     ln(2π)# T 2

     ln¡

    σ2¢#   1

    2σ2S T  (β ) .

    6

  • 8/9/2019 Chumacero OLS

    7/23

    P roposition 4  In the NLRM, bβ MLE  = bβ OLS .P roof.  The FONC for the maximization of  T  (β, σ

    2) are:

    ∂T  (β, σ2)

    ∂β  ¯̄̄̄ β,  σ2 =  1

     bσ2 ³X 0Y 

     #X 0X  bβ ´ = 0

    ∂T  (β, σ2)

    ∂σ2

    ¯̄̄̄ β,  σ2

    =   #   T 2 bσ2  +

    ³Y  # X  bβ ́ 0 ³Y  # X  bβ ́

    2 bσ4   = 0.Thus, bβ MLE  = (X 0X )"1 (X 0Y ) and bσ2 = T "1 bu0 bu.7This result is obvious since T  (β, σ

    2) is a function of  β  only through S T  (β ). Thus,MLE maximizes  T  (β, σ

    2)  by minimizing  S T  (β ). Due to this equivalence, the OLS

    estimator

     bβ  is frequently referred to as the “Gaussian MLE”, the “Gaussian Quasi-

    MLE”, or the “Gaussian Pseudo-MLE”.8

    3.3 T he M ean and Variance of  bβ   and bσ2P roposition 5  In the LRM,  E 

    h³ bβ # β ́ |X i = 0  and  E ³ bβ ́  =  β.P roof.  From previous results,

     bβ    = (X 0X )"1 X 0Y   = (X 0X )"1 X 0 (Xβ  + u)=   β  + (X 0X )

    "1X 0u.

    Then

    E h³ bβ # β ́ |X i   =   E h(X 0X )"1 X 0u |X i

    = (X 0X )"1

    X 0E  (u |X )

    = 0.

    Applying the law of iterated expectations,  E ³ bβ ́  = E hE ³ bβ |X ́ i = β.

    Thus,

     bβ  is unbiased for β . Indeed it is conditionally unbiased (conditional on  X ),

    which is a stronger result.

    P roposition 6  In the HLRM, V ³ bβ |X ́  =  σ2 (X 0X )"1 and V ³ bβ ́  =  σ2E £(X 0X )"1¤ .7 Verify that the SOSC are satisfied.8 The term “quasi” (“pseudo”) is used for misspecified models. In this case, the normality as-

    sumption was used to construct the likelihood and the estimator, but may be believed not to betrue.

    7

  • 8/9/2019 Chumacero OLS

    8/23

    P roof.   Since bβ # β  = (X 0X )"1 X 0u,V ³ bβ |X ́   =   E ·³ bβ # β ́ ³ bβ # β ́ 0 |X ̧

    =   E h(X 0X )"1 X 0uu0X  (X 0X )"1 |X i= (X 0X )

    "1X 0E  [uu0 |X ] X  (X 0X )

    "1

    =   σ2 (X 0X )"1

    .

    Thus,  V ³ bβ ́  = E hV ³ bβ |X ́ i+ V hE ³ bβ |X ´i =  σ2E £(X 0X )"1¤ .

    This result is derived from the assumptions that u is uncorrelated and homoskedas-tic. The variance-covariance matrix of  bβ  measures the precision with which the re-lationship between  Y   and X   is estimated. Some of its features are: First, and mostobvious, the variance of 

     bβ  grows proportionally with  σ2 (the volatility of the unpre-

    dictable component). Second, although less obvious, as the sample size increases, thevariance-covariance matrix of  bβ  should decrease (we will provide formal argumentsin this regard when we analyze the asymptotic properties of OLS). Finally, it alsodepends on the volatility of the regressors; as it increases, the precision with whichwe measure  β   will be enhanced. Thus, we generally “prefer” a sample of  X   that ismore volatile, given that it would better help us to uncover its association with  Y .

    P roposition 7  In the LRM, bσ2 is biased.P roof.   We know that

     bu   =   MY . It is trivial to verify that

     bu   =   Mu. Then,

     bσ2 = T "1 bu

    0

     bu =  T "1u0Mu. This implies that

    E ¡ bσ2 |X ¢   =   T "1E  [u0Mu |X ]=   T "1trE  [u0Mu |X ]

    =   T "1E  [tr (u0Mu) |X ]

    =   T "1E  [tr (Muu0) |X ]

    =   T "1σ2tr (M )

    =   σ2 (T  # k) T "1.

    Applying the law of iterated expectations we obtain  E 

    ¡ bσ2

    ¢ =  σ2 (T  # k) T "1.

    To derive this result we used the facts that  σ2

    is a scalar (tr denotes the trace of amatrix), that the expectation is a lineal operator (thus tr and E  are interchangeable),that tr(AB)=tr(BA), and that M  is symmetric in which case tr(M ) =

     PT i=1 λi, where

    λi  denotes the i-th eigenvalue of  M  (here we used the results of Proposition 3).Proposition 7 shows that bσ2 is a biased estimator of  σ2. A trivial modification

    yields an unbiased estimator for σ2; eσ2 = (T  # k)"1 bu0 bu.8

  • 8/9/2019 Chumacero OLS

    9/23

    P roposition 8  In the NLRM,  V ¡ bσ2¢ = T "22 (T  # k) σ4.

    P roof.  Left as an exercise.

    From the results derived so far, three facts are worth mentioning: First, with

    the exception of Proposition 8 none of the results derived required the assumption of normality of the error term. Second, while bσ2 is biased, it coincides with the maximumlikelihood estimator of the variance under the assumption of normality of  u and, as wewill show later, it is consistent. Finally, both the variance-covariance matrix of  bβ  and bσ2 depend on σ2 which is unknown, thus in practice we use estimators of the variance-covariance matrix of the OLS estimators by replacing  σ2 with bσ2 or eσ2. For example,the estimator of the variance-covariance matrix of  bβ   is bV ³ bβ ́  = eσ2 (X 0X )"1 .3.4

     bβ  is BLU E

    Defi

    nition 4   Let  bθ   and   θ# be estimators of a vector parameter   θ. Let   A   and   Bbe their respective mean squared error matrices; that is   A   =   E 

    ³ bθ # θ´³ bθ # θ´0and   B   =   E  (θ# # θ) (θ# # θ)0. We say that  bθ   is better (or more e  ! cient) than   θ#if  c0 (B #A) c ' 0   for every vector  c  and every parameter value and  c0 (B # A) c > 0

     for at least one value of c and at least one value of the parameter.9

    Once we made precise what we mean by  better , we are ready to present one of themost famous theorems in econometrics;

     T heorem 1 (G auss-M arkov)  The Best Linear Unbiased Estimator (BLUE) is 

     bβ.

    P roof.   Let A  = (X 0X )"1 X 0, then bβ  = AY . Consider any other linear estimatorb. Without loss of generality, let b  = (A + C ) Y . Then,

    E  (b |X ) = (X 0X )"1

    X 0Xβ  + CXβ  = (I  + CX ) β.

    For b  to be unbiased we require  C X  = 0  to hold, in which case

    V  (b |X ) =  E £

    (A + C ) uu0 (A + C )0¤

    .

    As (A + C ) (A + C )0 = (X 0X )"1 + CC 0, we obtain

    V  (b |X ) = V ³ bβ |X ́ + σ2CC 0.9 This definition can also be stated as  B '  A  for every parameter value and  B  6= A   for at least

    one parameter value (in this context,  B ' A means that  B #A is positive semi-definite and  B > Ameans that  B #A  is positive definite).

    9

  • 8/9/2019 Chumacero OLS

    10/23

    Then, V  (b |X ) ' V ³ bβ |X ´ ,  as  C C 0 is a positive semi-definite matrix.

    Despite its popularity, the Gauss-Markov theorem is not very powerful. It re-stricts our quest of alternative candidates to those that are both linear and unbiasedestimators. There may be a “nonlinear” or biased estimator that can do better in the

    metric of Definition 4. Furthermore, OLS ceases to be BLUE when the assumptionof homoskedasticity is relaxed. If both homoskedasticity and normality are present,we can rely on a stronger theorem which we will discuss later (the Cramer-Rao lowerbound).

    3.5 A nalysis of Variance (A N OVA )

    By definition,Y   = bY   +

     bu.

    Subtracting Y  (the sample mean of  Y ) from both sides we have,

    Y  # Y   = ³ bY  # Y ´ + bu.Thus ¡

    Y  # Y ¢0 ¡Y  # Y ¢ = ³ bY  # Y ´0 ³ bY  # Y ´+ 2³ bY  # Y ´0 bu + bu0 bu,but bY 0 bu =  Y 0P MY   = 0  and Y  0 bu =  Y ı0 bu = 0  when the model contains an intercept(more generally, if  ı  lies in the space spanned by  X ).#0 Thus

    ¡Y  # Y ¢0

    ¡Y  # Y ¢ = ³ bY  # Y ´0

    ³ bY  # Y ´ + bu0 bu.This is called the analysis of variance formula, often written as

    T SS  =  ESS  + SSR,

    where   T SS ,   ESS , and   SS R  stand for “Total sum of squares”, “Equation sum of squares” and “Sum of squares of the residuals”, respectively. The equation  R2 (alsoknown as the centered coe!cient of determination) is defined as

    R2 = ES S 

    T SS   = 1# S SR

    T SS   = 1 #  Y 

    0MY 

    Y 0LY   ,

    where  L  =  I T  # T "1ıı0. Therefore, provided that the regressors include a constant,0 (   R2 (   1. If the regressors do not include a constant,   R2 can be negative be-cause, without the benefit of an intercept, the regression could do worse (tracking thedependent variable) than the sample mean.

    #0 We define ı  as a row vector of ones.

    #0

  • 8/9/2019 Chumacero OLS

    11/23

    The equation measures the percentage of the variance of  Y  that is accounted for inthe variation of the predicted value bY .   R2 is typically reported in applied work and isfrequently referenced as “measure” or “goodness” of  fit. This label is inappropriate,as R2 does not measure the adequacy or “fit” of a model.##

    It is not even clear if  R2 has an unambiguous interpretation in terms of forecast

    performance. To see this, note that the “explanatory” power of the models   yt   =xtβ  + ut   and   yt # xt   =   xtγ  + ut   with   γ   =   β  # 1   are the same. The models aremathematically identical and yield the same implications and forecasts. Yet theirreported  R2 will di" er greatly. For illustration, suppose that  β  '  1. Then the R2

    from the second model will (nearly) equal zero, while the  R2 from the  first model canbe arbitrarily close to one. An econometrician reporting the near-unit R2 from thefirst model might claim “success”, while an econometrician reporting the R2 ' 0  fromthe second model might be accused of a poor  fit. This di" erence in reporting is quiteunfortunate, since the two models and implications are mathematically identical. Thebottom line is that  R2 is not a measure of  fit and should not be interpreted as such.

    Another interesting fact about R2 is that it necessarily increases as regressors areadded to the model. As by definition the OLS estimate minimizes the SSR, by addingadditional regressors, the SSR  cannot increase; it either can stay the same, or (morelikely) decrease. But the T SS   is una" ected by adding regressors, so that  R2 eitherstays constant or increases. To counteract this e" ect, Theil proposed an adjustment,

    typically called  R2

    (or “adjusted”  R2) which penalizes model dimensionality and isdefined as:

    R2

    = 1 # S SRT SS 

    T  # k   = 1 # eσ2 bσ2y .

    While often reported in applied work, this statistic is not used that much todayas better model evaluation criteria have been developed (we will discuss this later).

    3.6 OL S E stimator of a Subset of  β 

    Sometimes we may not be interested in obtaining estimates of the whole parametervector, but only of a subset of  β . Partition

    X  =£

      X 1   X 2¤

    and

    β  =

    µ  β 1β 2 ¶

    .

    ##More unfortunate is the claim that the   R2 measures the percentage of the variance of   y   thatis “explained” by the model. An econometric model, by itself, doesn’t explain anything. Only thecombination of a good econometric model and sound economic theory can, in principle, “explain” aphenomenon.

    ##

  • 8/9/2019 Chumacero OLS

    12/23

    Then X 0X  bβ  = X 0Y  can be written as:X 01X 1

     bβ 1 + X 01X 2 bβ 2   =   X 01Y    (3a)X 02X 1

     bβ 1 + X 

    0

    2X 2

     bβ 2   =   X 

    0

    2Y.   (3b)

    Solving for bβ 2  and reinserting in (3a) we obtain bβ 1 = (X 01M 2X 1)"1 X 01M 2Y and  bβ 2 = (X 02M 1X 2)"1 X 02M 1Y,where M i =  I # P i =  I #X i (X 0iX i)"1 X 0i   (for i  = 1, 2).

    These results can also be derived using the following theorem:

     T heorem 2 (Frisch-Waugh-L ovell)

     bβ 2  and 

     bu can be computed using the following 

    algorithm:1. Regress  Y   on  X 1,  obtain residual  eY ,2. Regress  X 2   on  X 1,  obtain residual  eX 2,3. Regress  eY   on  eX 2, obtain  bβ 2  and residuals  bu.P roof.  Left as an exercise.

    In some contexts, the Frisch-Waugh-Lovell (FWL) theorem can be used to speedcomputation, but in most cases there is little computational advantage of using it. #2

    There are, however, two common applications of the FWL theorem, one of which isusually presented in introductory econometrics courses: the demeaning formula for

    regression; the other deals with ill-conditioned problems.The   first application can be constructed as follows: Partition  X   = £   X 1   X 2 ¤where X 1 =  ı  is a vector of ones, and X 2  is the matrix of observed regressors. In thiscase,

    eX 2   =   M 1X 2 =  X 2 # ı (ı0ı)"1 ı0X 2=   X 2 #X 2

    and

    eY    =   M 1Y   = Y  # ı (ı0ı)

    "1ı0Y 

    =   Y  # Y .which are ‘demeaned’.

    #2 A few decades ago, a crucial limitation for conducting OLS estimation was the computationalcost of inverting even moderately sized matrices and the FWL was invoked routinely.

    #2

  • 8/9/2019 Chumacero OLS

    13/23

    The FWL theorem says that bβ 2  is the OLS estimate from a regression of  eY   oneX 2, or yt # Y   on x2t #X 2:

     bβ 2 = Ã

      T 

    Xt=1 ¡x2t #X 2¢ ¡

    x2t #X 2¢0

    !"1

    à T 

    Xt=1 ¡x2t #X 2¢ ¡

    yt # Y ¢0

    !.

    Thus, the OLS estimator for the slope coe!cients is a regression with demeaneddata.

    The other application is more useful. In our analysis we assumed that  X   is fullrank (X 0X   is invertible). Suppose for a moment that  X 1  is full rank but that  X 2   isnot. In that case  β 2  cannot be estimated, but  β 1  still can be estimated as follows: bβ 1 = (X 01M #2X 1)"1 X 01M #2 Y,where   M #2   is formed using   X 

    #

    2   that has columns equal to the maximal number of linearly independent columns of  X 2.

    4 C onstrained Least Squares (C L S)

    In this section we shall consider the estimation of  β   and σ2 when there are certainlinear constraints on the elements of  β . We shall assume that the constraints are of the form:

    Q0β  =  c,   (4)

    where Q  is a k × q  matrix of known constants and  c  is a q -vector of known constants.We shall also assume that  q < k  and rank(Q) = q .

    4.1 Derivation of the C L S E stimator

    The CLS estimator of   β , denoted by  β , is defined to be the value of  β   that mini-mizes the  S SR  subject to the constraint (4). The Lagrange expression for the CLSminimization problem is

    L (β, γ ) = (Y  # Xβ )0 (Y  # Xβ ) + 2γ 0 (Q0β # c) ,

    where γ  is a  q -vector of Lagrange multipliers corresponding to the  q   constraints. TheFONC are

    ∂ L

    ∂β 

    ¯̄̄̄β,γ 

    =   #2X 0Y   + 2X 0Xβ  + 2Qγ  = 0

    ∂ L

    ∂γ 

    ¯̄̄̄β,γ 

    =   Q0β # c = 0.

    #3

  • 8/9/2019 Chumacero OLS

    14/23

    The solution for  β   is

    β  = bβ # (X 0X )"1 Q hQ0 (X 0X )"1 Qi"1 ³Q0 bβ # c´ .   (5)The corresponding estimator of  σ2 can be defined as

    σ2 = T "1¡

    Y  # Xβ ¢0 ¡Y  # Xβ ¢ .4.2 CLS as BLU E

    It can be shown that (5) can be expressed as

    β  =  β  + R (R0X 0XR)"1

    R0X 0u,

    where   R   is a   k  × (k # q )   matrix such that the matrix   (Q, R)   is nonsingular andR0Q = 0.#3 Therefore β  is unbiased and its variance-covariance matrix is given by

    V ¡

    β ¢

     = σ2R (R0X 0XR)"1

    R0.

    Now define the class of linear estimators β # = D0Y  #d where D0 is a  k×T   matrixand d is a  k-vector. This class is broader than the class of linear estimators consideredin the unconstrained case because of the additive constants  d. We did not included  previously because in the unconstrained model the unbiasedness condition wouldensure   d   = 0. Here, the unbiasedness condition  E  (D0Y  # d) =   β   implies   D0X   =I  +  GQ0 and d  =  Gc  for some arbitrary  k × q   matrix  G. We have  V  (β #) =  σ2D0Dand CLS is BLUE because of the identity

    D0D#R (R0X 0XR)"1 R0 = hD0 #R (R0X 0XR)"1 R0X 0i hD0 #R (R0X 0XR)"1 R0X 0i0 ,where we have used  D 0X  = I  + GQ0 and R 0Q = 0.

    5 I nference with L inear C onstraints

    In this section we shall regard the linear constraints (4) as a testable hypothesis,calling it the null hypothesis. For now we will assume that the normal linear regressionmodel holds and derive the most frequently used tests in the OLS context.#4

    #3

    Such a matrix can always be found and is not unique, and any matrix that satisfies theseconditions will do.

    #4 We will discuss the case of inference in the presence of nonlinear constraints and departuresfrom normality of  u  later. For those impatient, none of the results derived here change when theseassumptions are relaxed (at least asymptotically).

    #4

  • 8/9/2019 Chumacero OLS

    15/23

    5.1 T he t   Test

    The t  test is an ideal test to use when we have a single constraint, that is,  q  = 1. Aswe assumed that u  is normally distributed, so is

     bβ ; thus under the null hypothesis we

    have

    Q

    0 bβ   av  N hc, σ2Q0 (X 0X )"1 Qi .With  q  = 1, Q0 is a row vector and  c  is a scalar. Therefore

    Q0 bβ # c£σ2Q0 (X 0X )"1 Q

    ¤1/2 %  N  (0, 1) .   (6)This is the test statistic that we would use if  σ  were known. As bu0 bu

    σ2 % χ2T "k,   (7)

    it can be shown that (6) and (7) are independent, hence:

    tT   =  Q0 bβ # c£eσ2Q0 (X 0X )"1 Q¤1/2 % S T "k,

    which is Student’s  t  with  T  # k  degrees of freedom. Only now we have invoked theassumption of normality of  u and, as shown later, it is not necessary for (6) to hold(in large samples).

    If we were interested in testing a single hypothesis of the form:

    H0 :  β 1 = 0,

    we would define Q  = £   1 0   · · ·   0 ¤0 and c  = 0, in which case we would obtain thefamiliar t  test

    tT   = bβ 1q bV 1,1 ,

    where bV 1,1  is the  #,# component of the estimator of the variance-covariance matrix of  bβ .With these tools we can construct confidence intervals   C T   for   β i. As   C T   is a

    function of the data, it is random. Its objective is to cover  β i  with high probability.The coverage probability is  Pr (β  " C T ). We say that C T   has (1 # α) %  coverage for

    β   if  Pr (β  " C T ) ) (1# α). We construct a confi

    dence interval as follows:

    Pr

    · bβ i # zα/2q bV i,i < β i 

  • 8/9/2019 Chumacero OLS

    16/23

    most common choice for α  is 0.05. If  |tT | < zα/2, we cannot reject the null hypothesisat an α%  significance level; otherwise the null hypothesis is rejected.

    An alternative approach to reporting results, is to report a p-value. The p-valuefor the above statistic is constructed as follows. Define the tail probability, or p-valuefunction

     pT   = p (tT ) = Pr (|Z | ' |tT |) = 2 (1# ! (|tT |)) .If the p-value  pT  is small (close to zero) then the evidence against H0  is strong.

    In a sense, p-values and hypothesis tests are equivalent since  pT  (  α   if and only if |tT | '  zα/2. The p-value is more general, however, in that the reader is allowed topick the level of significance  α.#5

    A confidence interval for σ  can be constructed as follows

    Pr

    "(T  # k) eσ2χ2T "k,1"α/2

    < σ2 <  (T  # k) eσ2

    χ2T "k,α/2

    # = 1# α.   (8)

    5.2 T he F   Test

    When  q >   1   we cannot apply the   t  test described above and use instead a simpletransformation of what is known as the Likelihood Ratio Test (which we will discussat length later). Under the null hypothesis, it can be shown that

    S T ¡

    β ¢# S T ³ bβ ́

    σ2  % χ2q.

    As in the previous case, when  σ2 is not known, a  finite sample correction can bemade by replacing  σ2 with eσ

    2, in which case we have

    S T ¡

    β ¢# S T ³ bβ ́eσ2   =  T  # kq 

    ³Q0 bβ # c´0 £Q0 (X 0X )"1 Q¤"1 ³Q0 bβ # c´ bu0 bu   % F q,T "k.   (9)

    Once again, as in the case of  t  tests, we reject the null hypothesis when the valuecomputed exceeds the critical value.

    5.3 Tests for Structural B reaks

    Suppose we have a two-regimes regression

    Y 1   =   X 1β 1 + u1

    Y 2   =   X 2β 2 + u2,

    #5 GAUSS tip: to compute  p (t)  use  2! cdfnc (t).

    #6

  • 8/9/2019 Chumacero OLS

    17/23

    where the vectors and matrices have   T 1   and   T 2   rows respectively (T   =   T 1  + T 2).Suppose further that

    ·  u1u2

    ¸£

      u01   u0

    2

    ¤ =

    ·  σ21I T 1   0

    0   σ22I T 2

    ¸.

    We want to test the null hypothesis H0 :  β 1 =  β 2. First, we will derive an  F   testassuming homoskedasticity among regimes and later we will relax this assumption.To apply the test we define

    Y   = X β  + u,

    where

    Y   =

    ·  Y 1Y 2

    ¸,   X  =

    ·  X 1   0

    0   X 2

    ¸,   β  =

    ·  β 1β 2

    ¸, and   u =

    ·  u1u2

    ¸.

    Applying (9) we obtain:

    T 1 + T 2 # 2kk

    ³ bβ 1 # bβ 2´0 £(X 01X 1)"1 + (X 02X 2)"1¤"1 ³ bβ 1 # bβ 2´Y 0h

    I #X  (X 0X )"1 X 0i

    Y % F k,T 1+T 2"2k,   (#0)

    where bβ 1 = (X 01X 1)"1 X 01Y 1  and bβ 2 = (X 02X 2)"1 X 02Y 2.Alternative, the same result can be derived as follows: Define the sum of squares

    of the residuals under the alternative of structural change,

    S T 

    ³ bβ ́  =  Y 0 hI #X  (X 0X )"1 X 0iY and the sum of squares of the residuals under the null hypothesis

    S T ¡β ¢ =  Y 0 hI #X  (X 0X )"1 X 0iY.

    It is easy to show that

    T 1 + T 2 # 2kk

    S T ¡

    β ¢# S T ³ bβ ́

    S T 

    ³ bβ ́   % F k,T 1+T 2"2k.   (##)In this case an unbiased estimate of  σ2 is

    eσ2 =   S T  ¡β ¢T 1 + T 2 # 2k .Before we remove the assumption that  σ1   =  σ2  we will   first derive a test of the

    equality of the variances. Under the null hypothesis (same variances across regimes)we have  bu0i bui

    σ2  % χ2T i"k   for i  = 1, 2.

    #7

  • 8/9/2019 Chumacero OLS

    18/23

    Because these chi-square variables are independent, we have

    T 2 # kT 1 # k

     bu01 bu1 b

    u02

     bu2% F T 1"k,T 2"k.

    Unlike the previous tests, a two-tailed test should be used here, because a largeor small value of the test is a reason to reject the null hypothesis.

    If we remove the assumption of equal variances among regimes and focus on thehypothesis of equality of the regression parameters, the tests are more involved. Wewill concentrate on the case in which  k  = 1, where a  t   test is applicable. It can beshown (though this is not trivial) that

    tT   = bβ 1 # bβ 2q  σ21X01X1

    +   σ22

    X02X2

    % S v,

    where

    v =h   σ21X01X1

    +   σ22X02X2i2

     σ41

    (T 1"1)(X01X1)2  +

      σ42

    (T 2"1)(X02X2)2

    .

    A cleaner way to perform this type of tests is through the use of direct LikelihoodRatio Tests (which we will discuss in depth later).

    Even though structural change (or Chow) tests are popular, modern econometricpractice is skeptic with respect to the way in which they are described above, particu-larly because in these cases the econometrician sets in an  ad-hoc  manner the point atwhich to split the sample. Recent theoretical and empirical applications are workingon treating the period of possible break as an endogenous latent variable.

    6 P rediction

    We are now interested in producing out-of-sample predictions for  y p  (for p > T ). Inthat period, the relationship will be:

    y p =  x0

     pβ  + u p,

    where y p  and u p  are scalars and  x p  are the pth period observations on the regressors.If we assume that the conditions outlined in HLRM  are satisfied, it is trivial to verify

    that the best linear predictor is   x0 p bβ T , with bβ T   denoting the OLS estimator of   β conditional on the information available on period  T .#6

    #6 “Best” is defined in terms of the candidate that minimizes the mean squared prediction errorconditional on observing  x p.

    #8

  • 8/9/2019 Chumacero OLS

    19/23

    In this case, it can be verified that, conditional on x p, the mean squared predictionerror is

    E £

    ( by p # y p)2 |x p ¤ = σ2 h1 + x0 p (X 0X )"1 x pi .In order to construct an estimator of the variance of the forecast error, replace

    σ2

    with eσ2. It may be thought that the construction of confidence intervals for theprediction is trivial and could be formulated as follows:Pr

    · by p # zα/2q bV  yp  < y p  

  • 8/9/2019 Chumacero OLS

    20/23

    where RMSE stands for Root Mean-Squared Error, MAE for Mean Absolute Error,and   P   is the number of periods being forecasted. These have an obvious scalingproblem. Several that do not, are based on the Theil  U  statistic:

    U  = v uutPP  p=1 (y p # by p)

    2

    PP  p=1 y2 p .This measure is related to  R2 but is not bounded by zero and one. Large values

    indicate a poor forecasting performance. An alternative is to compute the measurein terms of the changes in  y:

    U ! =

    v uutPP  p=1 ("y p #" by p)2PP  p=1 ("y p)

    2  ,

    where"y p =  y p

    #y p"1   and   " by p = by p # y p"1or, in percentage changes,

    "y p = y p # y p"1

    y p"1and   " by p = by p # y p"1

    y p"1.

    These measures will reflect the model’s ability to track turning points in the data.When several competing forecast models are considered, one set of them will

    appear more successful than another in a given dimension (say, one model has thesmallest MAE for 2-steps ahead forecasts). It is inevitable then to ask how likely itis that this result is due to chance. Diebold and Mariano (#995), approach forecastcomparison in this framework.

    Consider the pair of  h-steps ahead forecast errors of models  i  and  j   ( bui,p, bu j,p) for p  = 1, . . . , P  ;  whose quality is to be judged by the loss function  g ( bui,p).#8 Definingd p  = g ( bui,p) # g ( bu j,p), under the null hypothesis of equal forecast accuracy betweenmodels i and j, we have E d p = 0. Given the covariance-stationary realization {d p}

    P  p=1,

    it is natural to base a test on the observed sample mean:

    d =  1

    P X p=1

    d p.

    Even with optimal h-steps ahead forecasts, the sequence of forecast errors followsa MA(h # 1)   process. If the autocorrelations of order   h   and higher are zero, thevariance of  d  can be consistently estimated as follows:

    V   =  1

    àbγ 0 + 2 h"1X j=1

     bγ  j! ,#8 For example, in case of Mean Squared Error comparison,   g (·)   is a quadratic loss function

    g ( bui,p) = bu2i,p  and in the case of MAE, it is the absolute value loss function  g  ( bui,p) = | bui,p|.20

  • 8/9/2019 Chumacero OLS

    21/23

    where bγ  j  is an estimate if the  j -th autocovariance of  d p.The Diebold-Mariano (DM) statistic is given by

    DM  =  d& 

    d) N  (0, 1)

    under the null of equal forecast accuracy. Harvey et al (#997) suggest to modify theDM test and use instead:

    HLN  = DM  ·

    ·P  + 1# 2h + h (h# 1) /P 

    ¸1/2to correct size problems of  DM . They also suggest to use a Student’s t  with  P  # 1degrees of freedom instead of a standard normal to account for possible fat-tailederrors.

    To test if model  i   is not dominated by model  j  in terms of forecasting accuracy

    for the loss function  g (·), a one-sided test of  DM   or H LN  can be conducted, whereunder the null   E d p (   0. Thus, if the null is rejected, we conclude that model   jdominates model i.

    2#

  • 8/9/2019 Chumacero OLS

    22/23

    R eferences

    Amemiya, T. (#985). Advanced Econometrics . Harvard University Press.

    Baltagi, B. (#999). Econometrics . Springer-Verlag.

    Diebold, F. and R. Mariano (#995). “Comparing Predictive Accuracy,”   Journal of Business and Economic Statistics   #3, 253-65.

    Greene, W. (#993).  Econometric Analysis . Macmillan.

    Hansen, B. (200#). “Lecture Notes on Econometrics,” Manuscript . Michigan Univer-sity.

    Harvey, D., S. Leybourne, and P. Newbold (#997). “Testing the Equality of PredictionMean Square Errors,” International Journal of Forecasting   #3, 28#-9#.

    Hayashi, F. (2000). Econometrics . Princeton University Press.

    Lam, J. and M. Veill (2002). “Bootstrap Prediction Intervals for Single Period Re-gression Forecasts,” International Journal of Forecasting   #8,  #25-30.

    Mittelhammer, R., G. Judge, and D. Miller (2000).   Econometric Foundations . Cam-bridge University Press.

    Ruud, P. (2000). An Introduction to Classical Econometric Theory . Oxford UniversityPress.

    22

  • 8/9/2019 Chumacero OLS

    23/23

    A Workout P roblems

    #. Prove that independence implies no correlation but that the contrary is notnecessarily true. Give an example of variables that are uncorrelated but notindependent.

    2. Let  y , x  be scalar dichotomous random variables with zero means. Define u =y#Cov(y, x) [V  (x)]"1. Prove that E  (u |x) = 0. Are u  and  x  independent?

    3. Let  y  be a scalar random variable and  x  a vector random variable. Prove thatE  [y # E  (y |x)]2 ( E  [y # w (x)]2 for any function  w.

    4. Prove that if  V  (ut) = σ2, V  ( but) = (1 # ht) σ2. Find an expression for  ht.5. Prove Proposition 3.

    6. Prove Proposition 8.

    7. In Theorem #  we used the fact that  (A + C ) (A + C )0 = (X 0X )"1 + CC 0. Provethis.

    8. Prove that when a constant is included, R2 = 1#(Y 0MY/Y 0LY ) , with L beingas defined in section 3.5.

    9. Derive the variance-covariance matrix of  bβ 2  defined in section 3.6.#0. Prove Theorem 2.

    ##. Prove that (5) is the CLS estimator.

    #2. Prove that the CLS estimator can be expressed as β  =  β +R (R0X 0XR)"1 R0X 0uand obtain V 

    ¡β ¢

    .

    #3. Show that ( bu0 bu) σ"2 % χ2T "k.#4. Demonstrate (8).

    #5. Derive equations (9), (#0), and (##).

    #6. Prove that to test the null H0 :  β i  = 0  for all  i  except the constant, the  F   testis equivalent to  (T  # k) R2/ [(1#R2) (k # 1)] .

    23