top3_clm - new

8/12/2019 Top3_CLM - New

1/31

Topic 3 : The Classical Linear Model

Dr. Ani Dasgupta

Department of International Business, Mass. Maritime Academy

and

Department of Economics, Boston University

Draft lecture notes for a graduate econometrics course offered at BostonUniversity, Spring 2013. Please do not circulate without permission.

1 The Setting

Assume that we have data on nobservations with each observation being of the formyi, xi1, . . . , xikwith i = 1, . . . , n. Thus, we have one left hand side variable and k right hand side variables.As before we will use the vector y and the matrix X to denote the complete stacked set ofobservations on the left and right hand side variables respectively. In keeping with standardnotational practice, we will sometime use the symbol xi

(= Xi.) to denote the i-th row ofXand xi as the corresponding column vector obtained by transposition.

In what follows, no causality is implied from the right hand side variables to the left hand sidevariable. Thus, it is entirely possible to conceive of a DGP (data generating process) whereyis determined first and then various xs are generated from it. The important thing is to make

ceratin reasonable assumptions on thejoint distributionof the variables, armed with which wemay be interested in predicting the (unknown) y value of a new observation using (known)values of the xs for that observation (even though the causality may run from left to right).

Here are some assumptions we will invoke for various results in this topic.

Assumption 1 (Linearity)E(yi|xi) =xi, thus the expectation ofyi is linear in the corre-spondingxijs.

This assumption (A1) is sometime written as yi = xi + i with E(i|xi) = 0 where i is

1

8/12/2019 Top3_CLM - New

2/31

Grad. Econometrics Lecture Notes - Ani Dasgupta 2

referred to as a zero mean error/disturbance term. Of course, this is a semantic issue, since we

can always define i to be the difference between yi and the conditional expectation E(yi|xi).Note that in physical sciences where controlled experiments are usually the norm, the researcherchooses the X matrix, so the xs are non-random. In that case, the assumption is sometimewritten as E(yi) =xi

(this is why the old terminology for the X matrix is design matrix).However, in social sciences, the right-hand side variables are typically also random/stochastic,in which case the left hand side is to be interpreted as conditional on xi.

It is important to understand that the linear model is linear not really because it is linear inthe regressors, but because it is linear in the parameters and the disturbance term. Supposeyou are regressing consumption on income and income squared. Although, you may think that

this model is nonlinear in income, I can always claim that it is linear in two variables: incomeand income squared. There is no problem in applying the techniques discussed in these notesto such a model. Thus, if one could transform old variables and disturbance terms into newvariables and disturbance term so that the model is linear in these, then we would still have alinear model. A familiar example is the Cobb-Douglas production model: y = A K L . Notethat the disturbance term is assumed to be multiplicative in the original model. Once we takelog, we have a linear model, with log () being the new disturbance term. However if you had amodel such as y = 0+x1 , this will not be a linear model. You might argue, But, I can alwaysredefine0 as0/1and1 as 1/1 and claim that we have a linear model, cant I?. Well, youcan. The problem is that you will be able to estimate the gammas with nice properties usingthe (ordinary) least squares technique, but the betas extracted from them will (typically) not

have the same nice properties.

Assumption 2 (Strict Exogeneity)E(y|X) =X

This assumption A2 requires a detailed explanation, because it seems completely redundantgiven the previous assumption and also because it is one assumption that is really crucial forOLS technique to work (in the sense of having good properties). To elaborate on the firstpoint, actually A2 is much more strong an assumption than A1 is (A2 implies A1 but is notimplied by A1). Note that the first assumption is saying thatE(yi|xi) =xi while the currentone is saying E(yi|X) = xi for each i. Thus A2 says that if you knew the right hand sidevariables for one observation, those are all that you need for calculating the expectation of the

left hand side variable for that observation - knowing the right hand side variables for anotherobservation will not make any difference to your calculations.

Note that if= y X, then this assumption implies E(|X) =0.

Assumption 3 (I. I. D. observations)((y1, x1), (y2, x2), . . . , (yn, xn))are i.i.d.; or they forma random sample.

This assumption A3, along with A1 will guarantee A2 (convince yourself!). We will see later

8/12/2019 Top3_CLM - New

3/31


that A3 is not needed to show that the OLS estimates are unbiased; A2 will suffice (but not just

A1). A3 will often be invoked for demonstrating large sample results however. To motivate thesubtle distinctions between these assumptions, consider 3 different DGPs generating bivariatemodels, i.e. models where there is only 1 true regressor (k= 2):

1. x1, x2, . . . xn are i.i.d N(, 2) random variables. 1, 2, . . . n are i.i.d N(0,

2) variablesand each is independent of each x and finally yi is generated via the equation: yi =+xi+i i= 1, . . . , n.

2. x1, x2, . . . xn are i.i.d N(, 2) random variables. 1, 2, . . . n are not i.i.d variables; they

are autocorrelated, in the sense that each is generated from the previous via an equationof the sort: i = i1 + ui where theui sequence is i.i.d N(0,

2). Next,yi = + xi + ias before.

3. yi = +yi1+i where 1, 2, . . . n is an i.i.d N(0, 2) sequence.

You should now be able to verify the following facts.

Fact 1. Model 1 obeys A1-A3, model 2 obeys A1-A2 but not A3 while model 3 obeys A1 butneither A2 nor A3.

The next fact is about some properties of the disturbance terms.

Fact 2. Under A1 and A3, i)E(i) = 0, ii) E(ixjl) = 0

i, j, l and iii) E(ij) = 0, i

=

j.

Also note that the assertions in the last fact hold not just unconditionally, but also conditionalon X.

According to the last fact, E(ixi) = 0 (which implies that covariance ofi with xil is 0) butthis does not mean that i and xi are independent. For example, again consider a bivariatemodel such thatxis are i.i.d. normal, and yi is generated fromN( + xi, xi

2). As the secondmoment ofi depends on xi clearly those two random variables are not independent. Our nextassumption unhinges i further from xi.

Assumption 4 (Homoskedasticity)E(i2|xi) =2

This assumption (A4) is making two statements at once. First, it is saying that the varianceofi does not depend on xi; the second statement it makes is that this variance is same acrossobservations. Note that if we also have A1 and A3, A4 tells us that V ar(|X) =2I.

Our next assumption frees i totally from the shackles ofxi.

Assumption 5 (Normality)i|xi N(0, 2).

8/12/2019 Top3_CLM - New

4/31


Generally speaking, even though the mean and variance of a random variable do not depend on

the realization of another random variable, other moments could. But as the normal density iscompletely characterized by its mean and the variance, it is no more possible that some aspectof the distribution ofi depends on xi. All the aforementioned assumptions can be summarilystated using matrix notation as y|X N(X, 2I).

Our last assumption is a technical assumption on X.

Assumption 6 (Full Rank) X is of full column rank (k) with probability 1.

This is a perfectly reasonable assumption (especially with continuous variables), which rules outvariables which are linear combinations of other variables. From your study of linear algebra,

you now know that this will enable us to invert the XX matrix.

Before I leave this section, I will like to raise an important question that probably has alreadycrossed your mind. Of course, we can never really know the true data generating process. So,suppose we have i.i.d. observations (i.e. A3), but the world is highly non-linear: E(yi|xi) =f(xi) where f( ) can not be written as xi

for some . If we still assume a classical linearmodel, and calculate OLS estimates, are we getting pure garbage? Actually not. We still obtainsomething very useful: an estimate of the best linear predictor of the left hand side variable,given the right hand side variables. We will revisit this in a later problem set. For now, let ustake the leap of faith that the assumptions we are making for the results we need do actuallyhold.

Section End Questions

1. Verify Fact 1.

2. Verify Fact 2.

3. Provide example of a realisticmodel where A6 may be violated.

4. Consider the following data generating process. First, a random variable is generated

from aN(0, 1) distribution. Then 100 independent observations fromN(, 10) are drawnsequentially. Call these valuesx1, x2, . . . , x100. Next, for each i, yi is generated via theequation yi = 1 + 2 xi+i where is are i.i.d. noise from the uniform distribution on[.5, .5]. Which of A1 - A6 are satisfied? Also, can you think of modeling some actualsituation using this set-up?

8/12/2019 Top3_CLM - New

5/31


2 LS Geometry Revisited: The P and M Matrices

In Topic 1, we talked about one method of estimating the unknown coefficients in our linearmodel by minimization of the sum of squared errors which was defined as:

ni=1ei

2, whereei = yi yi = yi xib, with b being the vector ofestimatedcoefficients and yi, the predictedvalue of the left hand side variable for the i th observation. We learnt from that discussion,that using either the projection approach or the matrix differentiation approach1, we can writethe OLS coefficient vector as

b= (XX)1

XY (1)

Let us collect some properties of OLS-related objects. Let e be the vector of the residuals andy the vector of predicted values. Then from the previous equation it follows that

y = Xb

= X(XX)1Xy

= Py (2)

where P is the projection matrix X(XX)1X, which projects the y vector into the planeof the columns ofX and thereby creates the vector of predicted values. The following thingscan then be verified:a) P is symmetric.b)P is idempotent.2

c)PX = X and XP= X

d) rank (P) = k .While the first three properties are a matter of routine checking, the last one follows fromthe fact that rank of an idempotent matrix is equal to its trace, and trace(X(XX)1X) =trace((XX)(XX)1) =k (note that XX is k k).

Similarly, we can express the residual vector as

e = y Xb= (I X(XX)1X)y= My (3)

whereM is the residual maker matrix (I X(XX)1X) which creates the orthogonal com-plement of the y vector to the plane of the columns ofX. You should be able to verify thata) M is symmetric.b)M is idempotent.c)MX= 0 and XM= 0

1There are two other ways to motivate this particular solution to the estimation problem in question, calledthe method of moments approach and the maximum likelihood approach, which we will discuss later in thecourse.

2This makes sense, since projection of a projection must be the first projection itself!

8/12/2019 Top3_CLM - New

6/31


c) rank (M) = n k.The last claim follows from the fact that trace(P) +trace(M) =trace(I) =n.

Finally, it is easy to show by direct multiplication that

PM= MP = 0. (4)

Let us summarize what we have learnt: The vector of observations y can be resolved intotwo mutually orthogonal components: Py (the component along the X-plane and My (thecomponent orthogonal to the above plane). The first component is also the (vector of ) fittedvalues: y= Xb, and the second component is also the (vector of) errors: e= y Xb. P andM are symmetric, idempotent matrices with ranksk andn k respectively.

Now let us assume that the matrix X has a column of 1s, i.e. there is a constant term in ourmodel. Then the following are true:

Fact 3. The sum of the residuals is zero.Proof: e = (My) = yM = 0. This follows since belongs to the X- plane and so itsorthogonal component to the plane is the null vector.

Fact 4. The regression plane passes through the center of the data; i.e. through the point(y,x1,x2, . . . ,xk) where a bar above a variable represents its (sample) average.Proof: This involves showing y = b1x1+. . .+bkxk or,

1n

y= 1nXb, which easily follows on

writing Xb = Py and P= .

Fact 5. The errors are orthogonal to each of the regressors, i.e. eX= 0, which coupled withFact 3 implies that the errors are uncorrelated with each regressor.Proof: The first part is obvious. For the second part note that (sample) covariance between theerrors and any regressor is 0 ifeDX= 0, which it is since by Fact 3, e D= e and e X= 0.

Fact 6. The errors are also orthogonal to the fitted values, i.e. ey= 0, which coupled withFact 3 implies that the fitted values are uncorrelated with the error.Proof: Easy - verify yourself.


1. For a bivariate model, there are three observations with the correspondingx values being1, 3 and 5. Write down explicitly the P and M matrices. Now without doing anycalculation find another set of observational values, such that with these observations,you would have arrived at the same P and M matrices that you found before. Verify.

2. Verify Fact 7.

3. If the model does not have a constant term, does this implyMmust be non-zero?

8/12/2019 Top3_CLM - New

7/31


3 Basic Statistical Properties of OLS estimates

So far, we have said nothing about why the OLS procedure, statistically speaking, is to berecommended. We now turn to that task.

There are 3 basic properties one expects from an estimator. They are unbiasedness, efficiencyand consistency, which we now (intuitively) explain. Remember, that an estimator is just astatistic - it is a random variable whose realization depends on the actual data sample. Therandomness of the data imparts randomness to the estimate, giving rise to what is known as thesampling distribution of the estimator. If the sampling distribution is such that its expectationis equal to the unknown parameter the estimator is trying to estimate (whatever its true value

is), we call the estimator unbiased. If two estimators are both unbiased, we tend to prefer onethat has smaller variance, which is also referred to as the more efficient estimator. Finally,we hope that as the sample size becomes larger and larger, the probability that the estimatorreturns a value that differs from the true parameter by more than a certain fixed error margin,eventually falls to zero. This is known as consistency. In this chapter we will discuss the firsttwo properties in relation to the OLS estimator. We begin by proving five important and usefultheorems in this section.

Theorem 1 Under A2, A6 the OLS estimator is unbiased.

Proof: E(b|X) = E((XX)1XY|X)= (XX)1XE(X+|X)= + (XX)1XE(|X)= (5)

This proves that the OLS estimator is conditionally unbiased. However, since for any X, theconditional expectation E(b|X) is , the unconditional expectation of the estimator E(b) is as well.

Theorem 2 Under A2 - A4 and A6, V ar(b |X) =2

(X

X)1

.

Proof: V ar(b |X) = V ar[(XX)1Xy|X]= ((XX)1X) V ar(y|X) ((XX)1X)= ((XX)1X) 2I ((XX)1X)

= 2(XX)1X X(XX)1

= 2(XX)1 (6)

8/12/2019 Top3_CLM - New

8/31

8/12/2019 Top3_CLM - New

9/31


Likelihood methods, we will show that under the additional assumption of normality - A6, we

can do much better; we can show that the OLS estimator is most efficient among all unbiasedestimators.

Our next task is to find a good estimator for the unknown variance of the error term.

Theorem 4 Under A2-A4 and A6, an unbiased estimate of2 is

ee

n k(wheree is the vector of residuals andee is the residual sum of squares).

Proof: Note that

e = My (see equation 3 in section 2) (8)

= M(X+) (9)

= M (since MX= 0) (10)

Further, since Mis idempotent, it follows that

ee= M (11)

Hence, E(ee|X) = E(M|X)= E(tr(M)|X)= E(tr(M|X)= trE(M|X)= trME(|X)= 2trM

= 2(n k)Hence, both the conditional as well as unconditional expectation ofe e is n k and our claimis proved3.

The last theorem in this section assumes A5, and derives the complete distributions of allestimators of interest, paving the way to hypothesis testing for linear models (taken up insection 6). The proof is simple and elegant thanks to the machinery we have built up so far.

Theorem 5 Under the assumption, y|X N(X, 2I)a. the OLS estimatorb|X is distributed asN(, 2(XX)1)3You will do well to justify the transition of each line into the next for the above equations.

8/12/2019 Top3_CLM - New

10/31


b. ee2 2nk

c. (b ) andee are independent (henceb andee are independent as well).

Proof: a. b = (XX)1XY is linear in Y and since Y|X is normal, the result follows fromTheorems 1 and 2.

b. Since N(0, 2I), / = u N(0, I). Now, since ee2

= uMu and M is symmetric,idempotent with rankn k, the result follows from Theorem 3 in Topic 1.3 (on distribution ofquadratic forms).

c. b = (XX)1Xu and ee is 2uMu. Now the result follows from the fact that

(X

X)

1X

M = 0, upon using property c of the M matrix and theorem 5 from topic 1.3.


1. Prove the assertion made in Fact 7.

2. The Gauss Markov theorem is really quite weak. What do you think would be an ideal,stronger version of the theorem? Do you think that version is true?

3. In part b) of the last theorem can we claim what was claimed? Havent we actually provedee2|X 2nk?

4 R2, Decomposition of Sum of Squares and ANOVA

A linear regression model, after all is only a model and we might be interested in knowing howwell the model fits the data. For this purpose, the measure that is most often used (rightly orwrongly) is R2. The obvious measure of model fit would be the Residual Sum of Squares(RSS) = ee as the OLS method is based on trying to obtain a low value of this (if this was

0, that would mean that we have a perfect fit or yi = xib for each observation). But thismeasure suffers from two important defects: a) it is not unit-free and b) larger number ofobservations will tend to increase its value (although conceptually the model fit might notworsen in the process). One solution is to define a measure which takes the variance of theerrors and normalize it by dividing it by the variance of the left hand side variable. This is ameasure of model misfit, if you will. One can show that assuming that we have a constant termin the model, this measure of misfit is always going to be bounded between 0 and 1. Finally,one minus this misfit measure is our measure of model fit or R2.

8/12/2019 Top3_CLM - New

11/31


Now, let us do all this formally. Define TSS (Total Sum of Squares) as

T SS=n

i=1

(yi y))2 = (Dy)(Dy) =yDy

where D is the deviation-producing matrix. Also define ESS (Explained Sum of Squares)as

ES S=n

i=1

(yi y))2 = (Dy)(Dy) =yDy

where recall that y= Xb is the vector of predicted values. It is easy to establish that

Fact 8. Assuming that the model has a constant term, TSS = ESS + RSS.

Proof:y = y+ e

Dy = Dy+ e (since the mean error is 0)

Premultiplying by y we have

yDy = (y+ e)Dy+ (y+ e)e

yDy = yDy+ ee (12)

where we have used the fact that eD= (De) =e and y is orthogonal to e. Now, we defineR2 to be ESS/TSS; since all 3 terms in the above equations are nonnegative, it is clear whyR2 must range between 0 and 1.

Two remarks are in order. First, the decomposition of squares achieved in the above formulaessentially says that the variance in the left-hand-side variable is partially captured by themodel through the variances of the regressors and hence those of the fitted values, and in thatsense this portion is explained by the model. The remaining part is residual or unexplained.This explains the terminology. However, you should be aware that unfortunately some authorsuse RSS for what we call ESS (as a short form for Regression Sum of Squares) and someuse ESS for what we have called RSS (as a short form for Error Sum of Squares).

Second, you should be keenly aware that this decomposition result requires that there is aconstant term in the model. If there is no such term, there is no reason to assume thatES S

as defined will be less than or equal to T SS, and hence, there is no reason to believe that R2

as traditionally defined will be less than or equal to one. Some software packages (STATA for

example) use a different form of R2 under those scenarios: yyeeyy

, i.e. they do not center

they values anymore in calculating the TSS. This definition ofR2 is known as the uncenteredR2 as opposed to the centered version define earlier. So, if you notice that your R2 value hasshot up on adding the noc option, you should not get too excited - realize that a new formulais being used in the latter case.

An interesting question is: if our model, in reality had no explanatory power at all (i.e. thetrue betas associated with the true regressors were all actually 0), what will be the distribution

8/12/2019 Top3_CLM - New

12/31


of the three quadratic forms TSS, ESS and RSS (and hence R2)? Understanding this enables

us a) to consrtuct a simple test for the null hypothesis H0 : 2= . . .= k = 0 and b) to makesense of the ANOVA table that always accompanies the output of all regression packages.

Fact 9. UnderH0, and the assumption that there is an intercept term, givenX,a) TSS/2 2n1b) ESS/2 2k1c) RSS/2 2nkd) ES S andRSSare independent.

I will ask you to prove this in the next problem set. To proceed, use the usual tricks ...define a N(0, I) random vector z as z = 1[y

1]. Then, using the fact thaty = Py and

e= My, show that each of the three variates mentioned above is a quadratic form in z, withan idempotent matrix sandwiched in between z andz; find the rank (or trace) of these matriceetc. ...

From the above result, it immediately follows that under the null,

ESS/(k 1)RSS/(n k) Fk1, nk

This gives a convenient test statistic for testing that the regression model is junk, and is theF-statistic reported in every regression output.

Finally, the previous fact suggests a different measure of fit which takes into account the numberof regressors we have in the model. Unless one adjusts for the number of regressors in the model,one can always improve upon the R2 value by throwing in more regressors in the model; afterall R2 cannot go down if an additional regressor is added. So instead of defining the measureof fit as 1 ee

yDywe can use another measure called the Adjusted R2 or R2 which is given

by

R2 = 1 ee/(n k)

yDy/(n 1)Noting that the expectation of a chi-squared variate is its degree of freedom, one can see whyunder this measure, throwing in additional regressors is unlikely to help the modelers cause ifthey are really not explaining anything.


1. Why is it true that R2 can never decrease if you add more rhs variables to your model?

2. Is the uncenteredR2 assured to be within 0 and 1? How about the adjusted R2?

3. For the F test described above should we reject the null if the statistic is too high or toolow? Why?

8/12/2019 Top3_CLM - New

13/31


5 More LS Geometry: The Frisch-Waugh-Lovell Theorem

Economists often de-trend all their variables before running a regression. Suppose my left-handside variable and my right hand side variables are all generally increasing in time. Then chancesare that even though the two sides are completely unrelated in a theoretical sense, the fact thattime induces a correlation between the LHS and RHS variables will affect the regression results.So people often fit a time trend to each variable, and then work with the residuals, calling these(i.e. the residuals) the de-trended version of the original variables. An interesting questionin this context is: should we do this, or should we simply include time as a new regressor andwork with all original un-detrended variables? The surprising answer to this question is: itdoes not matter. This section will tell you why via an extremely useful theorem that makes

its appearance in a wide variety of contexts such as derivation of OLS formulas, specificationanalysis, nonlinear regression, outlier analysis as well as panel data models. The theorem isa pure algebraic/geometrical result - it has nothing to do with the statistical assumptions wehave made earlier and indeed, our proof will exploit both the algebra and the geometry of leastsquares. But first, let us state the theorem which is kind of long...

Theorem 6 (Frisch-Waugh-Lovell Theorem) Suppose, we split our regressors in two groups:the first set containing k1 and the second set containing k2 of them. Accordingly suppose theX matrix has been partitioned as[X1 X2]. Letb1 andb2 represent thek1 andk2 dimensionalOLS coefficient (sub)vectors on these two sets of regressors. Now suppose you carry out the

following procedure:

1. Regressy on the variables inX1; collect residuals.

2. For each variable inX2, regress it on all the variables inX1and collect residuals (thus youare runningk2 regressions, each with k1 variables on the right hand side, and collectingk2 sets of residuals).

3. Regress the residual variable obtained in step 1 on the k2 residual variables obtained instep 2.

Then,a) in step 3 you will obtain as your coefficient vector b2, the coefficient subvector you would

have obtained on the second set of regressors had you run your full original regression.b) the residual/error vector from the original regression will coincide with the residual/errorvector obtained in step 3.

Proof: Let P, P1, P2 be the projector matrices on the X, X1, X2 planes respectively. Letthe corresponding orthogonal complementor (or residual-maker or annihilator) matrices beM, M1, M2 respectively. Finally, let e be the residual vector for the full, original regression.Now, we must have

y= X1b1+ X2b2+ e (13)

8/12/2019 Top3_CLM - New

14/31


Multiplying the above by M1 gives

M1 y= M1X2b2+ e (14)

This is because M1 will annihilate the columns ofX1, and given that e is orthogonal to eachcolumn in X, it is clearly orthogonal already to each column of X1. Multiplying the aboveequation by X2

givesX2

M1y= X2M1X2b2 (15)

for e is orthogonal to the columns ofX2. Hence, we can write

b2= (X2M1X2)

1(X2M1y) (16)

Since, M1 is idempotent and symmetric, we can also write

b2= ((M1X2)(M1X2))

1((M1X2)(M1y)) (17)

But now note thatM1yis nothing but the residual obtained in step i), while M1X2 is nothingbut the (set of) residuals obtained in step ii) and equation (17) is nothing but the formula forthe OLS estimator when the former is regressed on the latter! This proves part a). To provepart b), note that given part a) and equation (14), the error vector in step iii) must exactly bee, the error vector in the original regression..

Here is a simple application of FWL. Suppose, we have a bivariate regression model withk = 2(1 true regressor plus a constant term) - hence, we are essentially fitting a line to a bivariatescatter. Can we quickly find a formula of the slope of the line without inverting 2 2 XXmatrices? Here is a way. Think of the X matrix as partitioned, with X1 simply being , andX2 being the column of observations on the true regressor. Now P1= ()(

)1 = 1n, and

henceM1is nothing but the de-meaner D. SoM1yis just the de-meaned y vector andM1X2is just the de-meaned vector of the true regressors. Hence, if we use the formula in equation(17), we see that

b2=

ni=1(xi2 x2)(yi y)n

i=1(xi2 x2)2 (18)

which just happens to be the covariance ofy with the true regressor divided by the variance ofthe true regressor.


1. Verify the FWL theorem by creating a simulated data set and running a series of regres-sions in STATA.

2. For the bivariate model, we just figured out how to obtain a formula for b2. Can you nowquickly write down a formula forb1?

8/12/2019 Top3_CLM - New

15/31


6 Hypothesis Testing in the Linear Model

In Section 3, by way of Theorem 5, we have laid all the groundwork for performing tests of(linear) hypothesis in the linear model which we now exploit. Note that the results stated inthat theorem require the normality assumption (unlike the theorems on the unbiasedness andefficiency aspects of the OLS estimator).

The simplest hypothesis one can think of testing is on a particular coefficient. Consider thequestion of testing the null hypothesis H0 : j = c against the alternative H1 : j= c, wherec is some given number and j is the j-th coefficient (In the special case where c= 0, we callsuch a test the test of significance of the j -th coefficient). The idea of the test is to construct

a statistic which, under the null hypothesis will be t-distributed withn k degrees of freedom,and then to reject the null at % level of significance if the computed statistic lies outside theinterval (t/2, t/2) witht/2 be a number such that a t-variate with nk degrees of freedomhas /2 probability of exceeding it. 4

How do we create such a t-statistic? Let us refer to the matrix (XX)1 as V. Then accordingto theorem 5, part a)

bj j

Vjj N(0, 1) (19)

where Vjj is the j -th diagonal element ofV . Under the null, j is c, so

bj c

Vjj N(0, 1) (20)

Also, according to theorem 5, part b),

ee

2 2nk (21)

And finally, according to part c) in theorem 5, the b vector (and hence, all components of it)are independent of this chi squared variate. Since, a standard normal variate divided by thesquare root of a chi-squared variate (which has been divided by its degrees of freedom) is tnk(as long as the 2 variates are independent), we have:

bjc

Vjj ee2(nk)

= bj c ee(nk)

Vjj tnk (22)

Notice that the above expression is exactly bjcdivided by the estimated standard deviationofbj since the true standard deviation is

Vjj and an (unbiased) estimate of is

eenk (theorem

4Of course, this kind of critical region is to be used only for two-tailed tests, where we are testing whetherj = c. If we were to test j c (respectively j c), we will reject if the computed t-statistic is smaller(respectively larger) than t (respectively t). These tests will be one-tailed tests.

8/12/2019 Top3_CLM - New

16/31


4). This estimated standard deviation ofbj is called its standard error. In most cases, when

an OLS estimate is reported, conventionally either the standard error or the t-statistics forc = 0 (which is

bjs.e.(bj)

) is also reported. The following table displays t.025 and t.01 values for

various degrees of freedom:

d.f. t.025 t.01

3 3.1824463 4.5407029

5 2.5705818 3.36493

10 2.2281388 2.7637695

15 2.1314495 2.6024803

30 2.0422724 2.4572615

50 2.0085591 2.4032719100 1.9839715 2.3642174

200 1.9718962 2.3451371

300 1.9679029 2.3388419

The table shows the justification for the following rule of thumb used for medium-sized samples(nkis around 25-30): reject the significance of a coefficient at 5% level if its t-statistic is morethan 2, or equivalently, if the standard error is less than half of the magnitude of the coefficient.Since, for smaller degrees of freedom, the t-distribution is fatter and since for smaller levelsof significance, the critical region must be smaller, under either of these circumstances t/2 is

larger than 2, and the computed t-statistics need to be larger (but as you can see hardly everlarger than 3). On the other hand, as sample size increases, for any reasonable size of the test,t/2 can be smaller than 2, yet is very close to it.

A slightly more complicated hypothesis could involve a linear combinations of the coefficients:a = c where a is some k-dimensional vector, b is the vector of OLS coefficients, and c is,as before, some constant. For instance, in a model with k = 6, if we are interested in testing123 + 46= 7, the vector a is (1, 0,2, 0, 0, 4) and the numberc is 7. We can easily handlesuch situations using theorem 5 and what we know about distributions of linear functions ofnormally distributed random vectors. Since, (givenX),

b

N(, 2(XX)1)

it follows thatab N(a, 2a(XX)1a).

Now, since, under the null a = c, following arguments completely analogous to ones madeabove, we can argue that under the null, a

c2 a(XX)1a

is distributedN(0, 1) and independent

of it, ee2

will be distributed as chi-squared with n k degrees of freedom. Hence, the statistic

t= ab c eenk a

(XX)1a

8/12/2019 Top3_CLM - New

17/31


under the null, will be t-distributed with n k degrees of freedom and we can use this as ourtest statistic.

We now discuss a general method of testing linear hypotheses, which can not only be usedfor testing situations described above, but can also be used to test simultaneous or, joint va-lidity of several linear hypotheses. Suppose our null hypothesis can be expressed in the form:R= rwhereRis aqkmatrix of rankqand r is q1. Here are some examples of this set-up:

1. To test H0: 3= 5, we set

R=

0 0 1 0 . . . 0

, and r= 5

2. To test H0: 2+3= 1, we set

R=

0 1 1 0 . . . 0

, and r= 1

3. To test H0: 2= 3= . . .= k = 0, we set

R=

0 1 0 0 . . . 00 0 1 0 . . . 0...

... ...

... . . .

...0 0 0 0 . . . 1

k1k

, and r=

00...0

k1k

4. To test H0: 1= 2= 3, we set

R=

1 1 0 0 . . . 00 1 1 0 . . . 0

, and r=

00

Note that in the last two examples,R has as many rows as there are (independent) restrictions.This is why we assume that R is of full (row) rank q. Because otherwise one is able to expressthe coefficients involved in one of the restrictions as a linear combinations of the coefficients inanother restrictions, which will imply that either the former is redundant or inconsistent withothers (which case it is depends on the value of the corresponding element in r).

We will now construct an F-test for H0 : R =r against the alternative H1 : R=r. Since,intuitively, it makes sense to reject the hypothesis if Rb differs from r by much (note thateach is a vector here), let us examine the distribution ofRb r. Now,

Rb r= R(+ (XX)1X) r (23)

If the null held, we would have

Rb r= R(XX)1X= R(XX)1Xu (24)

8/12/2019 Top3_CLM - New

18/31


where u is N(0, I). Hence,

E(Rb r) = 0 (25)V ar(Rb r) = 2(R(XX)1X)I(X(XX)1R) =2R(XX)1R (26)

Hence, by Theorem 2 in Topic 1.3,

1

2(Rb r)[R(XX)1R]1(Rb r) 2q (27)

Now, note that using equation 24, we can write the above expression as uAu, where

A= X(XX)1R[R(XX)1R]1R(XX)1X (28)

It is easy to check that A is symmetric and idempotent. Also, recall thatee

2 =uMu 2nk (29)

where M is the symmetric, idempotent residual-maker matrix. Since MX = 0, it is routine tocheck that AM = 0. Now appealing to equations 27 and 29, theorem 5 in topic 1.3 and thedefinition of an F-variate we see that

(Rb r)[R(XX)1R]1(Rb r) / qee / (n k) Fq, nk (30)

and we just got ourselves an F-statistic (free of unknown parameters) to test R= r. One can

now test the given hypothesis by computing the above statistic and rejecting it at % level ifthe statistic exceeds the value beyond which an F-distributed random variable with qand nkdegrees of freedom is likely to occur with % probability.

At this stage several natural questions come to mind.

Q1. For single linear hypothesis of the form j =c or a =c are the two methods, one that

uses the two-tailed t-test and one that uses the just-described one-tailed F-test equivalent?A1. The answer is yes. The key to seeing this is that in these cases q= 1, the square of thefirst test statistic is exactly the second test statistic and the fact that square of a tnk variateis an F1, nk variate.

Q2. Are these tests valid only given X ?A2. Although the test statistics clearly are computed using X, the theory developed showsthat the distributions of the statistics depend only on n , k and q. Hence, the tests are validunconditionally as well. Put in another way, even if the xs were stochastic, if we use thesemethods to reject at 5% level, then on average will be right 95% of the time.

Q3. If one believes R= r, how does one compute the (restricted) OLS estimates?A3. Section 8 answers that question, and in the process, identifies a way of conducting theF-test above without actually computing the messy matrix expression in equation 30.

8/12/2019 Top3_CLM - New

19/31



1. Provide at least 3 examples of a linear model that you may be interested in estimating,where testing a linear hypothesis such as a= c may be of interest.

2. Provide at least two examples of a linear model that you may be interested in estimatingwhere testing a linear hypothesis of the form R= r may be of interest.

3. How do we know that the matrix R(XX)1R is invertible?

4. If I wish to test2+3 5 can I test that using a t-test?5. If I wish tosimultaneouslytest that2+ 3 5 and3+ 4 10 can I conduct an F test

of it? Any other test?

7 Inferential Oddities

In the last few sections we have pieced together an impressive body of theory that can be usedfor inference in linear models. We have talked about two criteria of model fit: R2 and the jointsignificance of (true) regressors. Now here is a piece of bad news: these criteria do not always gohand in hand. For example, it is perfectly possible for a model to yield a very high R2, yet eachcoefficient may turn out to be insignificant (and the ANOVA F-test to reject joint significance).

I show this by means of an (admittedly artificial) example. Consider the following dataset:

+-------------+

| y x2 x3 |

|-------------|

1. | 3 3 5 |

2. | 1 1 1 |

3. | 8 5 7 |

4. | 3 2 4 |

5. | 5 4 6 |+-------------+

A regression run on this model yields the following output (in STATA):

8/12/2019 Top3_CLM - New

20/31


Source | SS df MS Number of obs = 5-------------+------------------------------ F( 2, 2) = 11.17

Model | 25.7 2 12.85 Prob > F = 0.0821

Residual | 2.3 2 1.15 R-squared = 0.9179

-------------+------------------------------ Adj R-squared = 0.8357

Total | 28 4 7 Root MSE = 1.0724

------------------------------------------------------------------------------

y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

x2 | 1.95 1.234403 1.58 0.255 -3.361206 7.261206x3 | -.25 .8477912 -0.29 0.796 -3.897751 3.397751

_cons | -.7 1.174734 -0.60 0.612 -5.754473 4.354473

------------------------------------------------------------------------------

As if this is not disconcerting enough, we might also have the situation where all coefficientsare significant (and the model passes the test of joint significance of true regressors) but R2 isabsurdly low. This happens very often in large datasets with few regressors. For instance, Igenerated a dataset in STATA using the following commands: 5

set seed 123456789

set obs 100

gen x3 = int(10*uniform())

gen x2 = int(10*uniform())

gen y = .5*x2 + .5*x3+ 10*uniform()

When I ran a regression ofy on x2 and x3 I got the following output:

Source | SS df MS Number of obs = 100

-------------+------------------------------ F( 2, 97) = 18.47

Model | 310.790141 2 155.395071 Prob > F = 0.0000

Residual | 816.24906 97 8.41493876 R-squared = 0.2758

-------------+------------------------------ Adj R-squared = 0.2608

Total | 1127.0392 99 11.3842344 Root MSE = 2.9009

5I set the seed so that you can replicate my dataset exactly if you dont believe what you are about to see...

8/12/2019 Top3_CLM - New

21/31


------------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

x2 | .3611357 .101986 3.54 0.001 .1587218 .5635497

x3 | .5083714 .1029474 4.94 0.000 .3040494 .7126934

_cons | 5.69346 .7035135 8.09 0.000 4.297181 7.08974

------------------------------------------------------------------------------

So what is an applied researcher to do? Of course, one will like to have both a high R2

and significance of coefficients, but in cases as above, most people place lot more trust onsignificance of coefficients rather than R2. First of all, you can estimate two mathematicallyidentical versions of the same relation and yet get widely different R2. Secondly, it is well-understood that in social sciences many variables will not be represented in your model whichwill generate a lot of noise in your left hand side variable. That should not detract us fromthe search for significance in those variables which theory suggest do have influence on the lefthand side variable.


1. Justify the statement: ...you can estimate two mathematically identical versions of the

same relation and yet get widely different R2

.

2. The R2 in the last example is about 28%. If you wanted to get an even lower R2 withoutlosing the significance of the variables, how would you have tweaked the simulation?

8 Estimation Subject to Linear Restrictions

What are the estimates if we try to minimize the sum of squared errors, subject to R= r? Toanswer this, we set up the problem as

Minimize (y Xb)(y Xb)subject to Rb= r (31)

and solve it by the method of Lagrangian multipliers.6 Let 2 be a q 1 vector of suchmultipliers 7. The Lagrangian then is

L = (y Xb)(y Xb) + 2(Rb r) (32)6Note that b denotes the restricted estimator, whereas b will continue to denote the (unrestricted) OLS

estimator.7The factor 2 is used just to simplify the ensuing algebra.

8/12/2019 Top3_CLM - New

22/31


Taking derivative with respect to b and and setting these to 0 gives:

2Xy+ 2(XX)b + 2R = 0 (33)Rb r = 0 (34)

Pre-multiplying both sides of equation 33 by (XX)1 and re-arranging, we get

b = (XX)1XY (XX)1R= b (XX)1R (35)

Pre-multiplying both sides of equation 35 by R, we get

Rb= Rb

R(XX)1R (36)

which on using equation 34 gives

R(XX)1R = Rb ri.e. = [R(XX)1R]1(Rb r) (37)

Using this in equation 35 we obtain

b= b (XX)1R[R(XX)1R]1(Rb r) (38)which is the expression for the restricted least squares estimator.

Now let e= y Xb be the errors from fitting the restricted estimator and let e = y Xb bethe usual errors obtained from fitting the unrestricted estimator. We have

e= e X(b b) (39)Taking transpose and post-multiplying by egives

ee = ee + (b b)XX(b b) ( Note: eX= 0)= ee + (Rb r)[R(XX)1R]1R(XX)1(XX)(XX)1R[R(XX)1R]1(Rb r)

( Note: XX and R(XX)1R are symmetric.)

= ee + (Rb r)[R(XX)1R]1(Rb r) (40)Hence,

RSSrestricted RSSunrestricted = (Rb r)[R(XX)1R]1(Rb r) (41)But now notice that the right hand side the above equation looks very much like the numeratorof the F-statistic in equation 30. Hence, this gives us a way to obtain the test statistic cleanly.It simply is

(RSSrestricted RSSunrestricted)/qRSSunrestricted/(n k)

(42)

This useful device simplifies a lot of things as we will see when studying the Nerlove paper andrunning dummy variable regression and conducting tests on it.

8/12/2019 Top3_CLM - New

23/31



1. Suppose, we have a simple bivariate model: y = 1 + 2x+ . We are interested inestimating the model subject to1+ 2= 2. We can do it in two ways. First, we can useequation 38. Or we could create a new model: u= 1v+ , where u = y 2x, v = 1 x.Verify that the second model is algebraically equivalent to the first. Once, we obtain anestmate of1 from the second model, we could easily find the estimate of2 using thegiven relationship. Do you think this second procedure is valid and will it give me thesame estimates as the first procedure? If you are not sure how to answer the last parttheoretically, check it empirically by constructing an artificial dataset.

2. Is the restricted RSS (divided by2) distributed as a chi-squared variate?

9 Confidence Regions

From your study of statistics, you know how to construct confidence intervals for a parameter.For instance, since

bjjs.e.(bj)

is t-distributed with n k degrees of freedom, you can claim that

Prob[t/2 bj js.e.(bj)

t/2] = 1 (43)

where is a pre-specified number such as .05 or .1 and t/2 is as defined in section 6. is Alittle manipulation of this yields

Prob[bj+t/2 s.e.(bj) j bj t/2 s.e.(bj)] = 1 (44)

One then proceeds to claim that a 1 percent confidence interval 8 is

[bj t/2 s.e.(bj) , bj+t/2 s.e.(bj)]

Now consider the question of computing aa joint confidence regionfor several of the coefficients.Suppose the 95% confidence interval for 1 is [-1,3] and the 95% confidence interval of 3 is[-4, 7]. Does this mean that the 95% (joint) confidence region for the two parameters is the

rectangular region [1, 3] [4, 7]? The answer is no as you will soon see.

The following is a way to organize our thought about calculating the 1% confidence regionsfor any subset of the vector. Let R be a matrix with one row for each coefficient we are

8Statisticians are often at pains to point out that it is not quite to accurate to claim that there is a 95%chance that the true j is contained in this interval (for either the true j is inside the interval or not). Whatis true is as this: If we used this procedure to calculate 95% confidence intervals again and again from variousrandom datasets, 95% of the time, the true beta will lie inside the computed interval.

8/12/2019 Top3_CLM - New

24/31


interested in and let a row have a 1 in the j-th place and zero-s elsewhere if it corresponds to

coefficientj . Using the theory developed in section 6, one can show that

(Rb R)[R(XX)1R]1(Rb R) / k1ee / (n k) Fk1, nk (45)

where k1 is the number of rows in the matrix R. Hence if we construct the set:

(Rb R)[R(XX)1R]1(Rb R) / k1ee / (n k) Fk1,nk;

(46)

where Fk1,nk; is the 1 percentile point of the F-distribution with k1 and n k degreesof freedom, we will be assured that 1 percent of times the true parameter (subvector) willlie in this region. This is our confidence region; geometrically, this turns out to be an ellipsoid,

not a rectangle.

Let us put this in practice using an example (adapted from Johnston). Suppose we have run aregression with n = 5 andk = 3, and obtained the following results:

XX=

5 15 2515 55 81

25 81 129

, b=

42.51.5

, RSS = 1.5

Suppose we are interested in finding a confidence region for (2, 3). We then set

R= 0 1 00 0 1

Next, we compute the matrix [R(XX)1R]1, which you can show to be

10 6

6 4

Hence,

the boundaryof the 95% confidence region is:

([2.5 2,1.5 3]

10 66 4

2.5 21.5 3

)/2

1.5/2 =F2,2;.05= 19 (47)

Simplification gives102

2 + 1223+ 432 322 183 2 = 0 (48)

And the region bounded by this curve (which is the equation of an ellipse!) is the required 95%confidence region.


1. In the above example, construct the 95% confidence intervals separately for 1 and 2.How does the rectangle compare with the ellipse?

2. Is it true that the 99% confidence region will always contain the 95% confidence region?

8/12/2019 Top3_CLM - New

25/31


10 Categorical and Dummy Variables

By a categorical variable we mean a variable that can take only certain discrete values(numerical or non-numerical), which we call the levels of the categorical variable. Variableswhich take only 0/1 values are called dummy variables. Thus a dummy variable is acategorical variable with only two levels. In this section we investigate methods to estimate theeffects of varioua levels of a categorical variable on the left hand side variable. With multiplelevel categorical variables, as we will see, the trick is to represent it by a bunch of dummyvariables. We organize our discussion by breaking into various cases.

10.1 Case 1: One Categorical Variable - Two Levels

Suppose Hollywood actresses claim that there is a gender bias in the industry: actors are paidhigher salaries compared to actresses, other things equal. A basic two-sample equality of meanstest does not suffice here, because it does not control for other things being equal. Supposepeople really will pay a lot to see the actors act but not so much the actresses, then the salarydifferential could be justified! One way to test this theory is to posit the following two salaryrelationships:

y= 1+g + for actors

and

y= 2+g + for actresses

whereyis salary from a certain movie, gis the gross earnings of the movie, and is a disturbanceterm, and then to test whether 1 = 2. An equivalent way to do the same thing is to posit

just one relationship

y= 2+g +D + for actors and actresses

where D is a dummy variable taking the value 1 for actors and 0 for actresses. We then needto test if 0.

10.2 One Categorical Variable with Several Levels

Suppose, we want to test if salary is affected by educational achievements. Assume that educa-tional achievement is a categorical variable; we only know whether a particular individual is ahigh school graduate, or if he holds a bachelors degree or whether he has a masters or beyond.We also allow salary to be affected by other measures as ability (lets say this is proxied by IQscores). We call this variable A. Now suppose we define three dummy variables:

Dh = 1 if the subject is a high school graduate only (and 0 otherwise)

8/12/2019 Top3_CLM - New

26/31

8/12/2019 Top3_CLM - New

27/31


Now suppose we wish to test the claim that gender does not matter in determining salary, the

table tells us that we need to test Ho : 3 = 0. On the other hand, if we wished to test, thateducation does not matter, we will be testing Ho: 4= 0, 5= 0.

10.4 Case 4: Case 3 Plus Interactions

A problem with the above type of model is that the model implies that if there is a genderdifferential in salary, that differential is unaffected by education leve, which may not be thecase in reality. Similarly, it may be the case that higher education level not only generateshigher salary, but the magnitude of that effect differs between men and women. To allow for

such possibilities, we create some additional dummy variables. The rules are simple: create anew dummy for every combination of old dummies, one from each category. Thus, if category1 has 4 levels and category 2 has 5 levels, we had originally 3 + 4 = 7 dummies. Now, we willcreate 3 4 = 12 new dummies. To go back to our original problem, our regression equationnow becomes:

y = 1+2A+3Dmale+4Dh+5Db+6Dmale,h+7Dmale,b+

Now the previous table changes to:

High School Bachelors Masters

Male 1+2A+3+4+6 1+2A+3+5+7 1+2A+3

Female 1+2A+4 1+2A+5 1+2A

This table allows us to test everything we could test before, but in addition, it allows us totest some new hypothesis. For instance, to test that education does not affect salary, we testHo: 4= 0, 5= 6= 7= 0. To test that if there is a sex differential in salary, it is unaffectedby education we test Ho: 6= 0, 7= 0.


1. For the two models proposed in case 2, are we guaranteed to get the same R2 either way?

2. How can be sure that there will not be a case of perfect multicollinearity in case 4?

3. If we had a smaple of 40 men and 60 women, and we were testing the last hypothesismentioned in the last case, what will be the degrees of freedom for the F test?

8/12/2019 Top3_CLM - New

28/31


Appendix: A Tour Guide of Regression Outputs

Consider the following (totally artificial) dataset and the corresponding regression output (heren= 4, k = 3 (including the constant term)).

+-------------+

| y x2 x3 |

|-------------|

1. | 1 1 1 |

2. | 2 0 1 |

3. | 3 0 1 |

4. | 4 1 0 |

+-------------+

. reg y x2 x3

Source | SS df MS Number of obs = 4

-------------+------------------------------ F( 2, 1) = 4.50

Model | 4.5 2 2.25 Prob > F = 0.3162

Residual | .5 1 .5 R-squared = 0.9000

-------------+------------------------------ Adj R-squared = 0.7000

Total | 5 3 1.66666667 Root MSE = .70711

------------------------------------------------------------------------------

y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

x2 | -1.5 .8660254 -1.73 0.333 -12.5039 9.503896

x3 | -3 1 -3.00 0.205 -15.7062 9.706205

_cons | 5.5 1.118034 4.92 0.128 -8.705969 19.70597

------------------------------------------------------------------------------

There are three other rather useful items that STATA can be asked to release at this stage:the predicted values (yi s), the residuals (eis) and the (estimated) variance-covariance matrix(that will be 2(XX)1) where 2 is the estimated value of the variance of the error term -see Theorem 2).

I now list the commands that will obtain these objects and the outputs:

8/12/2019 Top3_CLM - New

29/31


. predict yhat(option xb assumed; fitted values)

. list yhat

+------+

| yhat |

|------|

1. | 1 |

2. | 2.5 |

3. | 2.5 |

4. | 4 |+------+

. predict e, resid

. list e

+----------+

| e |

|----------|

1. | 8.88e-16 |

2. | -.5 |3. | .5 |

4. | 0 |

+----------+

. mat Varb = e(V)

. mat list Varb

symmetric Varb[3,3]

x2 x3 _consx2 .75

x3 .5 1

_cons -.75 -1 1.25

Here, yhat, e and V arb are the names I have given to the three objects (you could chooseyour name; for instance if you wanted to call your error vector err (say), you would issue thecommand: predict err, resid). I should add that the value of the first residual that you see to

8/12/2019 Top3_CLM - New

30/31


be 8.88e 16 should really have been zero in the perfect world, but in the world of numericalcomputations by computers, alas, although it is a very small number, it is not exactly 0. Forpractical purposes though, know that when you see such small numbers, for all you know, theycould actually be 0.

Let us now go back to the regression output and try and understand each and every part ofthis output.

You can see that there are 3 parts to the output: the table at top left, which we will call theANOVA (Analysis of Variance) table, then some information on top right and finally a table atthe bottom. Lets go over each item in each of the three portions of the output.

The ANOVA table really creates the setting for testing the hypothesis that the regression modelis no good at all, i.e. the coefficient for each true variable is 0. In this table, you see threecolumns: SS, df and MS. These stand for Sum of Squares, degrees of freedom and MeanSum of Squares. The number in the model row and the first column is then the model sum ofsquares or the Explained Sum of Squares (ESS). Recall that ESS is yDy. If you are curious,you can try to create the deviations from the means for the yhat vector listed above; then, uponsquaring and summing the deviations from the mean, you will get the number listed above: 4.5.Similarly, the SS number along the Residuals row is our old friend RSS, or sum of the squaredresiduals. You can check from the e vector listed above that this is indeed .5. Along the samecolumn, we lastly have Total Sum of Squares, which can be obtained by taking the original yvector, creating deviations from mean, squaring and summing. Since, TSS = ESS + RSS when

there is an intercept term, we can also compute TSS = 4.5 + .5 = 5.

Now, recall from Fact 9, that under the null hypothesis we are interested in, ESS, RSS andTSS (divided by 2) are Chi-squared variates with degrees of freedoms k 1, n k and n 1respectively. These degrees of freedoms are listed in the next column of the ANOVA table.Then in the last column, the item in column 1 is divided by the item in column 2 (so, forinstance, Model MS (Mean Squares) = 4.5 / 2 = 2.25).

Let us now move to the right hand side of the ANOVA table. The first item is self-explanatory:the number of observations, which in this case is 4. Next, we have the F-statistics for testingthe null hypothesis (the regression model is no good). Recall from the discussion followingFact 6 that this statistic is ESS/(k 1) divided by RSS/(n k) and it is supposed to be Fwith k 1 and n k degrees of freedom. Given that we have already computed the model MSand regression MS in the ANOVA table, this is just the former divided by the latter: 2.25 / .5= 4.5. Note also thatk 1 = 2 and n k= 1 in our example. This explains the second line onthe right hand side. The third line gives you the p-value of the test statistic: probability thatan F(2,1) variate will take a value of 4.5 or above. This probability is apparently is .3162. Sincethis number exceeds .05, the F-statistic does not belong to the critical region and we cannotreject the null at 5% significance (clearly, we cannot reject it at 10% either). The next itemsare R2 and adjustedR2 which we have defined in section 4 of this topic. The last item is Root

8/12/2019 Top3_CLM - New

31/31


Mean Squared Error, or the square root of the Residual MS (which is .5 from the ANOVA

table). The significance of this item is that it is an estimator for the unknown parameter inour model (see Theorem 4).

Moving on to the table below, the first column, of course lists the estimated coefficients. Thenext item is the standard error. Standard error is the estimated standard deviation of theassociated coefficient. Recall that the variance of the OLS coefficient vector is 2(XX)1;when the in the above expression is replaced by its estimate (the root MSE), we get theestimated variance of the OLS coefficients. This was what was captured by the matrix Varbin the output above. Now if you take its j-th diagonal entry and take square root, you get theestimated standard deviation or the standard error of the j-th coefficient.

Next in the table we see the t-values. These are the t-statistics computed to test whether acertain coefficient is 0. The t-statistic turns out to be nothing but the coefficient divided by thestandard error (see the discussion at the bottom of page 13). Under the null hypothesis thatthe j-th coefficient is 0, this statistic is distributed as a t-variate with n k degrees of freedom.In the next column, STATA tells answers the following question: what is the probability of a t-variate withnkdf will take a value of magnitude higher than the magnitude of the t-statisticsin the previous column? From your knowledge of two-tailed t-tests, you should know that if thisprobability is .05 or less than we can reject the null hypothesis that the corresponding variable(or coefficient) is insignificant (at the 5% level of significance). Hence, from the above outputwe see that none of the variables may be deemed (statistically) significant.

Finally, the last two columns compute the confidence intervals. In Section 9, we show that thisinterval is

[bj t/2 s.e.(bj) , bj+t/2 s.e.(bj)]where is a pre-specified number like .05, and t/2 is a number such that a t-variate withnkdegrees of freedom will exceed this number with probability /2.

top3_clm - new

Documents