least-squares estimation: y onto c x xb 2 r x p yj...least-squares estimation: recall that the...
TRANSCRIPT
Least-Squares Estimation:
Recall that the projection of y onto C(X), the set of all vectors of theform Xb for b ∈ Rk+1, yields the closest point in C(X) to y. That is,p(y|C(X)) yields the minimizer of
Q(β) = ‖y −Xβ‖2 (the least squares criterion)
This leads to the estimator β given by the solution of
XT Xβ = XT y (the normal equations)
orβ = (XT X)−1XT y.
All of this has already been established back when we studied projections(see pp. 30–31). Alternatively, we could use calculus:
To find a stationary point (maximum, minimum, or saddle point) of Q(β),we set the partial derivative of Q(β) equal to zero and solve:
∂
∂βQ(β) =
∂
∂β(y −Xβ)T (y −Xβ) =
∂
∂β(yT y − 2yT Xβ + βT (XT X)β)
= 0− 2XT y + 2XT Xβ
Here we’ve used the vector differentiation formulas ∂∂zc
T z = c and ∂∂zz
T Az =2Az (see §2.14 of our text).
Setting this result equal to zero, we obtain the normal equations, whichhas solution β = (XT X)−1XT y. That this is a minimum rather thana max, or saddle point can be verified by checking the second derivativematrix of Q(β):
∂2Q(β)∂β
= 2XT X
which is positive definite (result 7, p. 54), therefore β is a minimum.
101
Example — Simple Linear Regression
Consider the case k = 1:
yi = β0 + β1xi + ei, i = 1, . . . , n
where e1, . . . , en are i.i.d. each with mean 0 and variance σ2. Then themodel equation becomes
y1
y2...
yn
=
1 x1
1 x2...
...1 xn
︸ ︷︷ ︸=X
(β0
β1
)
︸ ︷︷ ︸=β
+
e1
e2...
en
.
It follows that
XT X =(
n∑
i xi∑i xi
∑i x2
i
), XT y =
( ∑i yi∑
i xiyi
)
(XT X)−1 =1
n∑
i x2i − (
∑i xi)2
( ∑i x2
i −∑i xi
−∑i xi n
).
Therefore, β = (XT X)−1XT y yields
β =(
β0
β1
)=
1n
∑i x2
i − (∑
i xi)2
((∑
i x2i )(
∑i yi)− (
∑i xi)(
∑i xiyi)
−(∑
i xi)(∑
i yi) + n∑
i xiyi
).
After a bit of algebra, these estimators simplify to
β1 =∑
i(xi − x)(yi − y)∑i(xi − x)2
=Sxy
Sxx
and β0 = y − β1x
102
In the case that X is of full rank, β and µ are given by
β = (XT X)−1XT y, µ = Xβ = X(XT X)−1XT y = PC(X)y.
• Notice that both β and µ are linear functions of y. That is, in eachcase the estimator is given by some matrix times y.
Note also that
β = (XT X)−1XT y = (XT X)−1XT (Xβ + e) = β + (XT X)−1XT e.
From this representation several important properties of the least squaresestimator β follow easily:
1. (unbiasedness):
E(β) = E(β + (XT X)−1XT e) = β + (XT X)−1XT E(e)︸︷︷︸=0
= β.
2. (var-cov matrix)
var(β) = var(β + (XT X)−1XT e) = (XT X)−1XT var(e)︸ ︷︷ ︸=σ2I
X(XT X)−1
= σ2(XT X)−1
3. (normality) β ∼ Nk(β, σ2(XT X)−1) (if e is assumed normal).
• These three properties require increasingly strong assumptions. Prop-erty (1) holds under assumptions A1 and A2 (additive error andlinearity).
• Property (2) requires, in addition, the assumption of sphericity.
• Property (3) requires assumption A5 (normality). However, later wewill present a central limit theorem-like result that establishes theasymptotic normality of β under certain conditions even when e isnot normal.
103
Example — Simple Linear Regression (Continued)
Result 2 on the previous page says for var(y) = σ2I, var(β) = σ2(XT X)−1.Therefore, in the simple linear regression case,
var(
β0
β1
)= σ2(XT X)−1
=σ2
n∑
i x2i − (
∑i xi)2
( ∑i x2
i −∑i xi
−∑i xi n
)
=σ2
∑i(xi − x)2
(n−1
∑i x2
i −x−x 1
).
Thus,
var(β0) =σ2
∑i x2
i /n∑i(xi − x)2
= σ2
[1n
+x2
∑i(xi − x)2
],
var(β1) =σ2
∑i(xi − x)2
,
and cov(β0, β1) =−σ2x∑
i(xi − x)2
• Note that if x > 0, then cov(β0, β1) is negative, meaning that theslope and intercept are inversely related. That is, over repeatedsamples from the same model, the intercept will tend to decreasewhen the slope increases.
104
Gauss-Markov Theorem:
We have seen that in the spherical errors, full-rank linear model, the least-squares estimator β = (XT X)−1XT y is unbiased and it is a linear esti-mator.
The following theorem states that in the class of linear and unbiased esti-mators, the least-squares estimator is optimal (or best) in the sense thatit has minimum variance among all estimators in this class.
Gauss-Markov Theorem: Consider the linear model y = Xβ+e whereX is n × (k + 1) of rank k + 1, where n > k + 1, E(e) = 0, and var(e) =σ2I. The least-squares estimators βj , j = 0, 1, . . . , k (the elements ofβ = (XT X)−1XT y have minimum variance among all linear unbiasedestimators.
Proof: Write βj as βj = cT β where c is the indicator vector containing a 1in the (j +1)st position and 0’s elsewhere. Then βj = cT (XT X)−1XT y =aT y where a = X(XT X)−1c. The quantity being estimated is βj = cT β =cT (XT X)−1XT µ = aT µ where µ = Xβ.
Consider an arbitrary linear estimator βj = dT y of βj . For such an esti-mator to be unbiased, it must satisfy E(βj) = E(dT y) = dT µ = aT µ forany µ ∈ C(X). I.e.,
dT µ− aT µ = 0 ⇒ (d− a)T µ = 0 for all µ ∈ C(X),
or (d− a) ⊥ C(X). Then
βj = dT y = aT y + (d− a)T y = βj + (d− a)T y.
The random variables on the right-hand side, βj and (d − a)T y, havecovariance
cov(aT y, (d− a)T y) = aT var(y)(d− a) = σ2aT (d− a) = σ2(dT a− aT a).
Since dT µ = aT µ for any µ ∈ C(X) and a = X(XT X)−1c ∈ C(X), itfollows that dT a = aT a so that
cov(aT y, (d− a)T y) = σ2(dT a− aT a) = σ2(aT a− aT a) = 0.
105
It follows that
var(βj) = var(βj) + var((d− a)T y) = var(βj) + σ2||d− a||2.
Therefore, var(βj) ≥ var(βj) with equality if and only if d = a, or equiva-lently, if and only if βj = βj .
Comments:
1. Notice that nowhere in this proof did we make use of the specificform of c as an indicator for one of the elements of β. That is,we have proved a slightly more general result than that given in thestatement of the theorem. We have proved that cT β is the minimumvariance estimator in the class of linear unbiased estimators of cT βfor any vector of constant c.
2. The least-squares estimator cT β where β = (XT X)−1XT y is oftencalled the B.L.U.E. (best linear unbiased estimator) of cT β. Some-times, it is called the Gauss-Markov estimator.
3. The variance of the BLUE is
var(cT β) = σ2‖a‖2 = σ2[X(XT X)−1c
]TX(XT X)−1c = σ2
[cT (XT X)−1c
].
Note that this variance formula depends upon X through (XT X)−1.Two implications of this observation are:
– If the columns of the X matrix are mutually orthogonal, then(XT X)−1 will be diagonal, so that the elements of β are un-correlated.
– Even for a given set of explanatory variables, the values atwhich the explanatory variable are observed will affect the vari-ance (precision) of the resulting parameter estimators.
4. What is remarkable about the Gauss-Markov Theorem is its distri-butional generality. It does not require normality! It says that β isBLUE regardless of the distribution of e (or y) as long as we havemean zero, spherical errors.
106
An additional property of least-squares estimation is that the estimatedmean µ = X(XT X)−1XT y is invariant to (doesn’t change as a result of)linear changes of scale in the explanatory variables.
That is, consider the linear models
y =
1 x11 x12 · · · x1k
1 x21 x22 · · · x2k...
......
. . ....
1 xn1 xn2 · · · xnk
︸ ︷︷ ︸=X
β + e
and
y =
1 c1x11 c2x12 · · · ckx1k
1 c1x21 c2x22 · · · ckx2k...
......
. . ....
1 c1xn1 c2xn2 · · · ckxnk
︸ ︷︷ ︸=Z
β∗ + e
Then, µ, the least squares estimator of E(y), is the same in both of thesetwo models. This follows from a more general theorem:
Theorem: In the linear model y = Xβ + e where E(e) = 0 and X is offull rank, µ, the least-squares estimator of E(y) is invariant to a full ranklinear transformation of X.
Proof: A full rank linear transformation of X is given by
Z = XH
where H is square and of full rank. In the original (untransformed) linearmodel µ = X(XT X)−1XT y = PC(X)y. In the transformed model y =Zβ∗+e, µ = Z(ZT Z)−1ZT y = PC(Z)y = PC(XH)y. So, it suffices to showthat PC(X) = PC(XH). This is true because if x ∈ C(XH) then x = XHbfor some b, ⇒ x = Xc where c = Hb ⇒ x ∈ C(X) ⇒ C(XH) ⊂ C(X).In addition, if x ∈ C(X) then x = Xd for some d ⇒ x = XHH−1d =XHa where a = H−1d ⇒ x ∈ C(XH) ⇒ C(X) ⊂ C(XH). Therefore,C(X) = C(XH).
• The simple case described above where each of the xj ’s is rescaledby a constant cj occurs when H = diag(1, c1, c2, . . . , ck).
107
Maximum Likelihood Estimation:
Least-squares provides a simple, intuitively reasonable criterion for esti-mation. If we want to estimate a parameter describing µ, the mean ofy, then choose the parameter value that minimizes the squared distancebetween y and µ. If var(y) = σ2I, then the resulting estimator is BLUE(optimal, in some sense).
• Least-squares is based only on assumptions concerning the mean andvariance-covaraince matrix (the first two moments) of y.
• Least-squares tells us how to estimate parameters associated withthe mean (e.g., β) but nothing about how to estimate parametersdescribing the variance (e.g., σ2) or other aspects of the distributionof y.
An alternative method of estimation is maximum likelihood estimation.
• Maximum likelihood requires the specification of the entire distribu-tion of y (up to some unknown parameters), rather than just themean and variance of that distribution.
• ML estimation provides a criterion of estimation for any parameterdescribing the distribution of y, including parameters describing themean (e.g., β), variance (σ2), or any other aspect of the distribution.
• Thus, ML estimation is simultaneously more general and less generalthan least squares in certain senses. It can provide estimators ofall sorts of parameters in a broad array of model types, includingmodels much more complex than those for which least-squares isappropriate; but it requires stronger assumptions than least-squares.
108
ML Estimation:
Suppose we have a discrete random variable Y (possibly a vector) withobserved value y. Suppose Y has probability mass function
f(y; γ) = Pr(Y = y; γ)
which depends upon an unknown p× 1 parameter vector γ taking valuesin a parameter space Γ.
The likelihood function, L(γ; y) is defined to equal the probability massfunction but viewed as a function of γ, not y:
L(γ; y) = f(y; γ)
Therefore, the likelihood at γ0, say, has the interpretation
L(γ0; y) = Pr(Y = y when γ = γ0)
= Pr(observing the obtained data when γ = γ0)
Logic of ML: choose the value of γ that makes this probability largest ⇒γ, the Maximum Likelihood Estimator or MLE.
We use the same procedure when Y is continuous, except in this contextY has a probability density function f(y; γ), rather than a p.m.f.. Never-theless, the likelihood is defined the same way, as L(γ; y) = f(y; γ), andwe choose γ to maximize L.
Often, our data come from a random sample so that we observe y corre-sponding to Yn×1, a random vector. In this case, we either
(i) specify a multivariate distribution for Y directly and then the like-lihood is equal to that probability density function (e.g. we assumeY is multivariate normal and then the likelihood would be equal toa multivariate normal density), or
(ii) we use an assumption of independence among the components ofY to obtain the joint density of Y as the product of the marginaldensities of its components (the Yi’s).
109
Under independence,
L(γ;y) =n∏
i=1
f(yi; γ)
Since its easier to work with sums than products its useful to note that ingeneral
arg maxγ
L(γ; y) = arg maxγ
log L(γ; y)︸ ︷︷ ︸≡`(γ;y)
Therefore, we define a MLE of γ as a γ so that
`(γ, y) ≥ `(γ; y) for all γ ∈ Γ
If Γ is an open set, then γ must satisfy (if it exists)
∂`(γ)∂γj
= 0, j = 1, . . . , p
or in vector form
∂`(γ; y)∂γ
=
∂`(γ)∂γ1
...∂`(γ)∂γp
= 0, (the likelihood equation, a.k.a. score equation)
110
In the classical linear model, the unknown parameters of the model are βand σ2, so the pair (β, σ2) plays the role of γ.
Under the assumption A5 that e ∼ Nn(0, σ2In), it follows that y ∼Nn(Xβ, σ2In), so the likelihood function is given by
L(β, σ2;y) =1
(2πσ2)n/2exp
{− 1
2σ2‖y −Xβ‖2
}
for β ∈ Rk+1 and σ2 > 0.
The log-likelihood is a bit easier to work with, and has the same maximiz-ers. It is given by
`(β, σ2;y) = −n
2log(2π)− n
2log(σ2)− 1
2σ2‖y −Xβ‖2.
We can maximize this function with respect to β and σ2 in two steps:First maximize with respect to β treating σ2 as fixed, then second plugthat estimator back into the loglikelihood function and maximize withrespect to σ2.
For fixed σ2, maximizing `(β, σ2;y) is equivalent to maximizing the thirdterm − 1
2σ2 ‖y−Xβ‖2 or, equivalently, minimizing ‖y−Xβ‖2. This is justwhat we do in least-squares, and leads to the estimator β = (XT X)−1XT y.
Next we plug this estimator back into the loglikelihood (this gives what’sknown as the profile loglikelihood for σ2):
`(β, σ2) = −n
2log(2π)− n
2log(σ2)− 1
2σ2‖y −Xβ‖2
and maximize with respect to σ2.
111
Since the 2 exponent in σ2 can be a little confusing when taking derivatives,let’s change symbols from σ2 to φ. Then taking derivatives and settingequal to zero we get the (profile) likelihood equation
∂`
∂φ=−n/2
φ+
(1/2)‖y −Xβ‖2φ2
= 0,
which has solution
φ = σ2 =1n‖y −Xβ‖2 =
1n
n∑
i=1
(yi − xTi β)2,
where xTi is the ith row of X.
• Note that to be sure that the solution to this equation is a maximum(rather than a minimum or saddle-point) we must check that ∂2`
∂φ2 isnegative. I leave it as an exercise for you to check that this is indeedthe case.
Therefore, the MLE of (β, σ2) in the classical linear model is (β, σ2) where
β = (XT X)−1XT y
andσ2 =
1n‖y −Xβ‖2
=1n‖y − µ‖2,
where µ = Xβ.
• Note that
σ2 =1n‖y − p(y|C(X))‖2 =
1n‖p(y|C(X)⊥)‖2.
112
Estimation of σ2:
• Maximum likelihood estimation provides a unified approach to esti-mating all parameters in the model, β and σ2.
• In contrast, least squares estimation only provides an estimator ofβ.
We’ve seen that the LS and ML estimators of β coincide. However, theMLE of σ2 is not the usually preferred estimator of σ2 and is not theestimator of σ2 that is typically combined with LS estimation of β.
Why not?
Because σ2 is biased.
That E(σ2) 6= σ2 can easily be established using our results for takingexpected values of quadratic forms:
E(σ2) = E(
1n‖PC(X)⊥y‖2
)=
1n
E{(PC(X)⊥y)T PC(X)⊥y
}=
1n
E(yT PC(X)⊥y)
=1n
σ2 dim(C(X)⊥) + ‖PC(X)⊥Xβ︸ ︷︷ ︸=0, because Xβ ∈ C(X)
‖2
=σ2
ndim(C(X)⊥)
=σ2
n{n− dim(C(X))︸ ︷︷ ︸
=rank(X)
} =σ2
n{n− (k + 1)}
Therefore, the MLE σ2 is biased by a multiplicative factor of {n−k−1}/nand an alternative unbiased estimator of σ2 can easily be constructed as
s2 ≡ n
n− k − 1σ2 =
1n− k − 1
‖y −Xβ‖2,
or more generally (that is, for X not necessarily of full rank),
s2 =1
n− rank(X)‖y −Xβ‖2.
113
• s2 rather than σ2 is generally the preferred estimator of σ2. In fact,it can be shown that in the spherical errors linear model, s2 is thebest (minimum variance) estimator of σ2 in the class of quadratic(in y) unbiased estimators.
• In the special case that Xβ = µjn (i.e., the model contains onlyan intercept, or constant term), so that C(X) = L(jn), we get β =µ = y, and rank(X) = 1. Therefore, s2 becomes the usual samplevariance from the one-sample problem:
s2 =1
n− 1‖y − yjn‖2 =
1n− 1
n∑
i=1
(yi − y)2.
If e has a normal distribution, then by part 3 of the theorem on p. 85,
‖y −Xβ‖2/σ2 ∼ χ2(n− rank(X))
and, since the central χ2(m) has mean m and variance 2m,
s2 =σ2
n− rank(X)‖y −Xβ‖2/σ2
︸ ︷︷ ︸∼χ2(n−rank(X))
implies
E(s2) =σ2
n− rank(X){n− rank(X)} = σ2,
and
var(s2) =σ4
{n− rank(X)}2 2{n− rank(X)} =2σ4
n− rank(X).
114
Properties of β and s2 — Summary:
Theorem: Under assumptions A1–A5 of the classical linear model,
i. β ∼ Nk+1(β, σ2(XT X)−1),ii. (n− k − 1)s2/σ2 ∼ χ2(n− k − 1), andiii. β and s2 are independent.
Proof: We’ve already shown (i.) and (ii.). Result (iii.) follows from thefact that β = (XT X)−1XT µ = (XT X)−1XT PC(X)y and s2 = (n − k −1)−1||PC(X)⊥y||2 are functions of projections onto mutually orthogonalsubspaces C(X) and C(X)⊥.
Minimum Variance Unbiased Estimation:
• The Gauss-Markov Theorem establishes that the least-squares es-timator cT β for cT β in the linear model with spherical, but not-necessarily-normal, errors is the minimum variance linear unbiasedestimator.
• If, in addition, we add the assumption of normal errors, then theleast-squares estimator has minimum variance among all unbiasedestimators.
• The general theory of minimum variance unbiased estimation is be-yond the scope of this course, but we will present the backgroundmaterial we need without proof or detailed discussion. Our main goalis just to establish that cT β and s2 are minimum variance unbiased.A more general and complete discussion of minimum variance unbi-ased estimation can be found in STAT 6520 or STAT 6820.
115
Our model is the classical linear model with normal errors:
y = Xβ + e, e ∼ N(0, σ2In)
We first need the concept of a complete sufficient statistic:
Sufficiency: Let y be random vector with p.d.f. f(y; θ) depending on anunknown k × 1 parameter θ. Let T(y) be an r × 1 vector-valued statisticthat is a function of y. Then T(y) is said to be a sufficient statistic forθ if and only if the conditional distribution of y given the value of T(y)does not depend upon θ.
• If T is sufficient for θ then, loosely, T summarizes all of the informa-tion in the data y relevant to θ. Once we know T, there’s no moreinformation in y about θ.
The property of completeness is needed as well, but it is somewhat tech-nical. Briefly, it ensures that if a function of the sufficient statistic existsthat is unbiased for the quantity being estimated, then it is unique.
Completeness: A vector-valued sufficient statistic T(y) is said to becomplete if and only if E{h(T(y))} = 0 for all θ implies Pr{h(T(y)) =0} = 1 for all θ.
Theorem: If T(y) is a complete sufficient statistic, then f(T(y)) is aminimum variance unbiased estimator of E{f(T(y))}.
Proof: This theorem is known as the Lehmann-Scheffe Theorem and itsproof follows easily from the Rao-Blackwell Theorem. See, e.g., Bickel andDoksum, p. 122, or Casella and Berger, p. 320.
In the linear model, the p.d.f. of y depends upon β and σ2, so the pair(β, σ2) plays the role of θ.
116
Is there a complete sufficient statistic for (β, σ2) in the classical linearmodel?
Yes, by the following result:
Theorem: Let θ = (θ1, . . . , θr)T and let y be a random vector withprobability density function
f(y) = c(θ) exp
{r∑
i=1
θiTi(y)
}h(y).
Then T(y) = (T1(y), . . . , Tr(y))T is a complete sufficient statistic providedthat neither θ nor T(y) satisfy any linear constraints.
• The density function in the above theorem describes the exponentialfamily of distributions. For this family, which includes the normaldistribution, then it is easy to find a complete sufficient statistic.
Consider the classical linear model
y = Xβ + e, e ∼ N(0, σ2In)
The density of y can be written as
f(y; β, σ2) = (2π)−n/2(σ2)−n/2 exp{−(y −Xβ)T (y −Xβ)/(2σ2)}= c1(σ2) exp{−(yT y − 2βT XT y + βT XT Xβ)/(2σ2)
= c2(β, σ2) exp[{(−1/(2σ2))yT y + (σ−2βT )(XT y)}
If we reparameterize in terms of θ where
θ1 = − 12σ2
,
θ2...
θk+2
=
1σ2
β,
then this density can be seen to be of the exponential form, with vector-valued complete sufficient statistic
(yT yXT y
).
117
So, since cT β = cT (XT X)−1XT y, is a function of XT y and is an unbi-ased estimator of cT β, it must be minimum variance among all unbiasedestimators.
In addition, s2 = 1n−k−1 (y −Xβ)T (y −Xβ) is an unbiased estimator of
σ2 and can be written as a function of the complete sufficient statistic aswell:
s2 =1
n− k − 1[(In −PC(X))y]T [(In −PC(X))y]
=1
n− k − 1yT (In −PC(X))y =
1n− k − 1
{yT y − (yT X)(XT X)−1(XT y)}.
Therefore, s2 is a minimum variance unbiased estimator as well.
Taken together, these results prove the following theorem:
Theorem: For the full rank, classical linear model with y = Xβ + e,e ∼ Nn(0, σ2In), s2 is a minimum variance unbiased estimator of σ2,and cT β is a minimum variance unbiased estimator of cT β, where β =(XT X)−1XT y is the least squares estimator (MLE) of β.
118
Generalized Least Squares
Up to now, we have assumed var(e) = σ2I in our linear model. There aretwo aspects to this assumption: (i) uncorrelatedness (var(e) is diagonal),and (ii) homoscedasticity (the diagonal elements of var(e) are all the same.
Now we relax these assumptions simultaneously by considering a moregeneral variance-covaraince structure. We now consider the linear model
y = Xβ + e, where E(e) = 0, var(e) = σ2V,
where X is full rank as before, and where V is a known positive definitematrix.
• Note that we assume V is known, so there still is only one variance-covariance parameter to be estimated, σ2.
• In the context of least-squares, allowing V to be unknown compli-cates things substantially, so we postpone discussion of this case. Vunknown can be handled via ML estimation and we’ll talk aboutthat later. Of course, V unknown is the typical scenario in practice,but there are cases when V would be known.
• A good example of such a situation is the simple linear regressionmodel with uncorrelated, but heteroscedastic errors:
yi = β0 + β1xi + ei,
where the ei’s are independent, each with mean 0, and var(ei) =σ2xi. In this case, var(e) = σ2V where V = diag(x1, . . . , xn), aknown matrix of constants.
119
Estimation of β and σ2 when var(e) = σ2V:
A nice feature of the model
y = Xβ + e where var(e) = σ2V (1)
is that, although it is not a Gauss-Markov (spherical errors) model, it issimple to transform this model into a Gauss-Markov model. This allowsus to apply what we’ve learned about the spherical errors case to obtainmethods and results for the non-spherical case.
Since V is known and positive definite, it is possible to find a matrix Qsuch that V = QQT (e.g., QT could be the Cholesky factor of V).
Multiplying on both sides of the model equation in (1) by the known matrixQ−1, it follows that the following transformed model holds as well:
Q−1y = Q−1Xβ + Q−1e
or y = Xβ + e where var(e) = σ2I (2)
where y = Q−1y, X = Q−1X and e = Q−1e.
• Notice that model (2) is a Gauss-Markov model because
E(e) = Q−1E(e) = Q−10 = 0
andvar(e) = Q−1var(e)(Q−1)T = σ2Q−1V(Q−1)T
= σ2Q−1QQT (Q−1)T = σ2I
120
The least-squares estimator based on the transformed model minimizes
eT e = eT (Q−1)T Q−1e = (y −Xβ)T (QQT )−1(y −Xβ)
= (y −Xβ)T V−1(y −Xβ) (The GLS Criterion)
• So the generalized least squares estimates of β from model (1) mini-mize a squared statistical (rather than Euclidean) distance betweeny and Xβ that takes into account the differing variances among theyi’s and the covariances (correlations) among the yi’s.
• There is some variability in terminology here. Most authors refer tothis approach as generalized least-squares when V is an arbitrary,known, positive definite matrix and use the term weighted least-squares for the case in which V is diagonal. Others use the termsinterchangeably.
Since GLS estimators for model (1) are just ordinary least squares estima-tors from model (2), many properties of GLS estimators follow easily fromthe properties of ordinary least squares.
Properties of GLS Estimators:
1. The best linear unbiased estimator of β in model (1) is
β = (XT V−1X)−1XT V−1y.
Proof: Since model (2) is a Gauss-Markov model, we know that (XT X)−1XT yis the BLUE of β. But this estimator simplifies to
(XT X)−1XT y = [(Q−1X)T (Q−1X)]−1(Q−1X)T (Q−1y)
= [XT (Q−1)T Q−1X]−1XT (Q−1)T Q−1y
= [XT (QQT )−1X]−1XT (QQT )−1y
= (XT V−1X)−1XT V−1y.
121
2. Since µ = E(y) = Xβ in model (1), the estimated mean of y is
µ = Xβ = X(XT V−1X)−1XT V−1y.
In going from the var(e) = σ2I case to the var(e) = σ2V case, we’vechanged our estimate of the mean from
X(XT X)−1XT y = PC(X)y
toX(XT V−1X)−1XT V−1y.
Geometrically, we’ve changed from using the Euclidean (or orthogo-nal) projection matrix PC(X) to using a non-Euclidean (or oblique)projection matrix X(XT V−1X)−1XT V−1. The latter accounts forcorrelation and heteroscedasticity among the elements of y whenprojecting onto C(X)
3. The var-cov matrix of β is
var(β) = σ2(XT V−1X)−1.
Proof:
var(β) = var{(XT V−1X)−1XT V−1y}= (XT V−1X)−1XT V−1 var(y)︸ ︷︷ ︸
=σ2V
V−1X(XT V−1X)−1
= σ2(XT V−1X)−1.
122
4. An unbiased estimator of σ2 is
s2 =(y −Xβ)T V−1(y −Xβ)
n− k − 1
=yT [V−1 −V−1X(XT V−1X)−1XT V−1]y
n− k − 1,
where β = (XT V−1X)−1XT V−1y.
Proof: Homework.
5. If e ∼ N(0, σ2V), then the MLEs of β and σ2 are
β = (XT V−1X)−1XT V−1y
σ2 =1n
(y −Xβ)T V−1(y −Xβ).
Proof: We already know that β is the OLS estimator in model (2) and thatthe OLS estimator and MLE in such a Gauss-Markov model coincide, so βis the MLE of β. In addition, the MLE of σ2 is the MLE of this quantityin model (2), which is
σ2 =1n||(I− X(XT X)−1XT )y||2
=1n
(y −Xβ)T V−1(y −Xβ)
after plugging in X = Q−1X, y = Q−1y and some algebra.
123
Misspecification of the Error Structure:
Q: What happens if we use OLS when GLS is appropriate?
A: The OLS estimator is still linear and unbiased, but no longer best.In addition, we need to be careful to compute the var-cov matrix of ourestimator correctly.
Suppose the true model is
y = Xβ + e, E(e) = 0, var(e) = σ2V.
The BLUE of β here is the GLS estimator β = (XT V−1X)−1XT V−1y,with var-cov matrix σ2(XT V−1X)−1.
However, suppose we use OLS here instead of GLS. That is, suppose weuse the estimator
β∗ = (XT X)−1XT y
Obviously, this estimator is still linear, and it is unbiased because
E(β∗) = E{(XT X)−1XT y} = (XT X)−1XT E(y)
= (XT X)−1XT Xβ = β.
However, the variance formula var(β∗) = σ2(XT X)−1 is no longer correct,because this was derived under the assumption that var(e) = σ2I (see p.103). Instead, the correct var-cov of the OLS estimator here is
var(β∗) = var{(XT X)−1XT y} = (XT X)−1XT var(y)︸ ︷︷ ︸=σ2V
X(XT X)−1
= σ2(XT X)−1XT VX(XT X)−1. (∗)
In contrast, if we had used the GLS estimator (the BLUE), the var-covmatrix of our estimator would have been
var(β) = σ2(XT V−1X)−1. (∗∗)
124
Since β is the BLUE, we know that the variances from (*) will be ≥ thevariances from (**), which means that the OLS estimator here is a lessefficient (precise), but not necessarily much less efficient, estimator underthe GLS model.
Misspecification of E(y):
Suppose that the true model is y = Xβ+e where we return to the sphericalerrors case: var(e) = σ2I. We want to consider what happens when weomit some explanatory variable is X and when we include too many x’s.So, let’s partition our model as
y = Xβ + e = (X1,X2)(
β1
β2
)+ e
= X1β1 + X2β2 + e. (†)
• If we leave out X2β2 when it should be included (when β2 6= 0) thenwe are underfitting.
• If we include X2β2 when it doesn’t belong in the true model (whenβ2 = 0) then we are overfitting.
• We will consider the effects of both overfitting and underfitting on thebias and variance of β. The book also consider effects on predictedvalues and on the MSE s2.
125
Underfitting:
Suppose model (†) holds, but we fit the model
y = X1β∗1 + e∗, var(e∗) = σ2I. (♣)
The following theorem gives the bias and var-cov matrix of β∗1 the OLSestimator from ♣.
Theorem: If we fit model ♣ when model (†) is the true model, then themean and var-cov matrix of the OLS estimator β∗1 = (XT
1 X1)−1XT1 y are
as follows:
(i) E(β∗1) = β1 + Aβ2, where A = (XT1 X1)−1XT
1 X2.
(ii) var(β∗1) = σ2(XT1 X1)−1.
Proof:
(i)E(β∗1) = E[(XT
1 X1)−1XT1 y] = (XT
1 X1)−1XT1 E(y)
= (XT1 X1)−1XT
1 (X1β1 + X2β2)= β1 + Aβ2.
(ii)var(β∗1) = var[(XT
1 X1)−1XT1 y]
= (XT1 X1)−1XT
1 (σ2I)X1(XT1 X1)−1
= σ2(XT1 X1)−1.
• This result says that when underfitting, β∗1 is biased by an amountthat depends upon both the omitted and included explanatory vari-ables.
Corollary If XT1 X2 = 0, i.e.. if the columns of X1 are orthogonal to the
columns of X2, then β∗1 is unbiased.
126
Note that in the above theorem the var-cov matrix of β∗1 , σ2(XT1 X1)−1
is not the same as the var-cov matrix of β1, the corresponding portion ofthe OLS estimator β = (XT X)−1XT y from the full model. How thesevar-cov matrices differ is established in the following theorem:
Theorem: Let β = (XT X)−1XT y from the full model (†) be partitionedas
β =(
β1
β2
)
and let β∗1 = (XT1 X1)−1XT
1 y be the estimator from the reduced model ♣.Then
var(β1)− var(β∗1) = AB−1AT
a n.n.d. matrix. Here, A = (XT1 X1)−1XT
1 X2 and B = XT2 X2 −XT
2 X1A.
• Thus var(βj) ≥ var(β∗j ), meaning that underfitting results in smallervariances of the βj ’s and overfitting results in larger variances of theβj ’s.
Proof: Partitioning XT X to conform to the partitioning of X and β, wehave
var(β) = var(
β1
β2
)= σ2(XT X)−1 = σ2
(XT
1 X1 XT1 X2
XT2 X1 XT
2 X2
)−1
= σ2
(H11 H12
H21 H22
)−1
= σ2
(H11 H12
H21 H22
),
where Hij = XTi Xj and Hij is the corresponding block of the inverse
matrix (XT X)−1 (see p. 54).
So, var(β1) = σ2H11. Using the formulas for inverses of partitioned ma-trices,
H11 = H−111 + H−1
11 H12B−1H21H−111 ,
whereB = H22 −H21H−1
11 H12.
127
In the previous theorem, we showed that var(β∗1) = σ2(XT1 X1)−1 =
σ2H−111 . Hence,
var(β1)− var(β∗1) = σ2(H11 −H−111 )
= σ2(H−111 + H−1
11 H12B−1H21H−111 −H−1
11 )
= σ2(H−111 H12B−1H21H−1
11 )
= σ2[(XT1 X1)−1(XT
1 X2)B−1(XT2 X1)(XT
1 X1)−1]
= σ2AB−1AT .
We leave it as homework for you to show that AB−1AT is n.n.d.
• To summarize, we’ve seen that underfitting reduces the variances ofregression parameter estimators, but introduces bias. On the otherhand, overfitting produces unbiased estimators with increased vari-ances. Thus it is the task of a regression model builder to find anoptimum set of explanatory variables to balance between a biasedmodel and one with large variances.
128
The Model in Centered Form
For some purposes it is useful to write the regression model in centeredform; that is, in terms of the centered explanatory variables (the explana-tory variables minus their means).
The regression model can be written
yi = β0 + β1xi1 + β2xi2 + · · ·+ βkxik + ei
= α + β1(xi1 − x1) + β2(xi2 − x2) + · · ·+ βk(xik − xk) + ei,
for i = 1, . . . , n, where
α = β0 + β1x1 + β2x2 + · · ·+ βkxk, (♥)
and where xj = 1n
∑ni=1 xij .
In matrix form, the equivalence between the original model and centeredmodel that we’ve written above becomes
y = Xβ + e = (jn,Xc)(
αβ1
)+ e,
where β1 = (β1, . . . , βk)T , and
Xc = (I− 1nJn,n)
︸ ︷︷ ︸=PL(jn)⊥
X1 =
x11 − x1 x12 − x2 · · · x1k − xk
x21 − x1 x22 − x2 · · · x2k − xk...
.... . .
...xn1 − x1 xn2 − x2 · · · xnk − xk
,
and X1 is the matrix consisting of all but the first columns of X, theoriginal model matrix.
• PL(jn)⊥ = (I− 1nJn,n) is sometimes called the centering matrix.
Based on the centered model, the least squares estimators become:(
αβ1
)= [(jn,Xc)T (jn,Xc)]−1(jn,Xc)T y =
(n 00 XT
c Xc
)−1 (jTnXT
c
)y
=(
n−1 00 (XT
c Xc)−1
)(ny
XTc y
)=
(y
(XTc Xc)−1XT
c y
),
129
orα = y, and
β1 = (XTc Xc)−1XT
c y.
β1 here is the same as the usual least-squares estimator. That is, it isthe same as β1, . . . , βk from β = (XT X)−1XT y. However, the interceptα differs from β0. The relationship between α and β is just what you’dexpect from the reparameterization (see (♥)):
α = β0 + β1x1 + β2x2 + · · ·+ βkxk.
From the expression for the estimated mean based on the centered model:
E(yi) = α + β1(xi1 − x1) + β2(xi2 − x2) + · · ·+ βk(xik − xk)
it is clear that the fitted regression plane passes through the point ofaverages: (y, x1, x2, . . . , xk).
In general, we can write SSE, the error sum of squares, as
SSE = (y −Xβ)T (y −Xβ) = (y −PC(X)y)T (y −PC(X)y)
= yT y − yT PC(X)y − yT PC(X)y + yT PC(X)y
= yT y − yT PC(X)y = yT y − βT XT y.
From the centered model we see that E(y) = Xβ = [ jn,Xc](
αβ1
), so
SSE can also be written as
SSE = yT y − (α, βT1 )
(jTnXT
c
)y
= yT y − y jTny − βT1 XT
c y
= (y − y jn)T y − βT1 XT
c y
= (y − y jn)T (y − y jn)− βT1 XT
c y
=n∑
i=1
(yi − y)2 − βT1 XT
c y (∗)
130
R2, the Estimated Coefficient of Determination
Rearranging (*), we obtain a decomposition of the total variability in thedata:
n∑
i=1
(yi − y)2 = βT1 XT
c y + SSE
or SST = SSR + SSE
• Here SST is the (corrected) total sum of squares. The term “cor-rected” here indicates that we’ve taken the sum of the squared y’safter correcting, or adjusting, them for the mean. The uncorrectedsum of squares would be
∑ni=1 y2
i , but this quantity arises less fre-quently, and by “SST” or “total sum of squares” we will generallymean the corrected quantity unless stated otherwise.
• Note that SST quantifies the total variability in the data (if we addeda 1
n−1 multiplier in front, SST would become the sample variance).
• The first term on the right-hand side is called the regression sum ofsquares. It represents the variability in the data (the portion of SST)that can be explained by the regression terms β1x1+β2x2+ · · ·βkxk.
• This interpretation can be seen by writing SSR as
SSR = βT1 XT
c y = βT1 XT
c Xc(XTc Xc)−1XT
c y = (Xcβ1)T (Xcβ1).
The proportion of the total sum of squares that is due to regression is
R2 =SSRSST
=βT
1 XTc Xcβ1∑n
i=1(yi − y)2=
βT XT y − ny2
yT y − ny2.
• This quantity is called the coefficient of determination, and itis usually denoted as R2. It is the sample estimate of the squaredmultiple correlation coefficient we discussed earlier (see p. 77).
131
Facts about R2:
1. The range of R2 is 0 ≤ R2 ≤ 1, with 0 corresponding to the explana-tory variables x1, . . . , xk explaining none of the variability in y and1 corresponding to x1, . . . , xk explaining all of the variability in y.
2. R, the multiple correlation coefficient or positive square root of R2,is equal to the sample correlation coefficient between the observedyi’s and their fitted values, the yi’s. (Here the fitted value is just theestimated mean: yi = E(yi) = xT
i β.)
3. R2 will always stay the same or (typically) increase if an explanatoryvariable xk+1 is added to the model.
4. If β1 = β2 = · · · = βk = 0, then
E(R2) =k
n− 1.
• From properties 3 and 4, we see that R2 tends to be higher fora model with many predictors than for a model with few pre-dictors, even if those models have the same explanatory power.That is, as a measure of goodness of fit, R2 rewards complexityand penalizes parsimony, which is certainly not what we wouldlike to do.
• Therefore, a version of R2 that penalizes for model complexitywas developed, known as R2
a or adjusted R2:
R2a =
(R2 − k
n−1
)(n− 1)
n− k − 1=
(n− 1)R2 − k
n− k − 1.
132
5. Unless the xj ’s j = 1, . . . , k are mutually orthogonal, R2 cannot bewritten as a sum of k components uniquely attributable to x1, . . . , xk.(R2 represents the joint explanatory power of the xj ’s not the sumof the explanatory powers of each of the individual xj ’s.)
6. R2 is invariant to a full-rank linear transformation of X and to ascale change on y (but not invariant to a joint linear transformationon [y,X]).
7. Geometrically, R, the multiple correlation coefficient, is equal to R =cos(θ) where θ is the angle between y and y corrected for their means,y jn. This is depicted in the picture below.
133
Inference in the Multiple Regression Model
Testing a Subset of β: Testing Nested Models
All testing of linear hypotheses (nonlinear hypotheses are rarely encoun-tered in practice) in linear models reduces essentially to putting linearconstraints on the model space. The test amounts to comparing the re-sulting constrained model against the original unconstrained model.
We start with a model we know (assume, really) to be valid:
y = µ + e, where µ = Xβ ∈ C(X) ≡ V, e ∼ Nn(0, σ2In)
and then ask the question of whether or not a simpler model holds corre-sponding to µ ∈ V0 where V0 is a proper subset of V . (E.g., V0 = C(X0)where X0 is a matrix consisting of a subset of the columns of X.)
For example, consider the second order response surface model
yi = β0 +β1xi1 +β2xi2 +β3x2i1 +β4x
2i2 +β5xi1xi2 +ei, i = 1, . . . , n. (†)
This model says that E(y) is a quadratic function of x1 and x2.
A hypothesis we might be interested in here is that the second-order termsare unnecessary; i.e., we might be interested in H0 : β3 = β4 = β5 = 0,under which the model is linear in x1 and x2:
yi = β∗0 + β∗1xi1 + β∗2xi2 + e∗i , i = 1, . . . , n. (‡)
• Testing H0 : β3 = β4 = β5 = 0 is equivalent to testing H0 :model (‡)holds versus H1 :model (†) holds but (‡) does not.
• I.e., we test H0 : µ ∈ C([ jn,x1,x2]) versus
H1 : µ ∈ C([ jn,x1,x2,x1∗x1,x2∗x2,x1∗x2]) and µ /∈ C([ jn,x1,x2]).
Here ∗ denotes the element-wise product and µ = E(y).
134
Without loss of generality, we can always arrange the linear model so theterms we want to test appear last in the linear predictor. So, we write ourmodel as
y = Xβ + e = (X1,X2)(
β1
β2
)+ e
= X1︸︷︷︸n×(k+1−h)
β1 + X2︸︷︷︸n×h
β2 + e, e ∼ N(0, σ2I) (FM)
where we are interested in the hypothesis H0 : β2 = 0.
Under H0 : β2 = 0 the model becomes
y = X1β∗1 + e∗, e∗ ∼ N(0, σ2I) (RM)
The problem is to test
H0 : µ ∈ C(X1) (RM) versus H1 : µ /∈ C(X1)
under the maintained hypothesis that µ ∈ C(X) = C([X1,X2]) (FM).
We’d like to find a test statistic whose size measures the strength of theevidence against H0. If that evidence is overwhelming (the test statisticis large enough) then we reject H0.
The test statistic should be large, but large relative to what?
Large relative to its distribution under the null hypothesis.
How large?
That’s up to the user, but an α−level test rejects H0 if, assuming H0 istrue, the probability of getting a test statistic at least as far from expectedas the one obtained (the p−value) is less than α.
135
• E.g., suppose we compute a test statistic and obtain a p−value ofp = 0.02. This says that assuming H0 is true, the results that weobtained were very unlikely (results this extreme should happen only2% of the time). If these results are so unlikely assuming H0 is true,perhaps H0 is not true. The cut-off for how unlikely our results mustbe before we’re willing to reject H0 is the significance level α. (Wereject if p < α.)
So, we want a test statistic that measures the strength of the evidenceagainst H0 : µ ∈ C(X1) (i.e., one that is small for µ ∈ C(X1) and largefor µ /∈ C(X1)) whose distribution is available.
• This will lead to an F test which is equivalent to the likelihood ratiotest, and which has some optimality properties.
Note that under RM, µ ∈ C(X1) ⊂ C(X) = C([X1,X2]). Therefore, ifRM is true, then FM must be true as well. So, if RM is true, then theleast squares estimates of the mean µ: PC(X1)y and PC(X)y are estimatesof the same thing.
This suggests that the difference between the two estimates
PC(X)y −PC(X1)y = (PC(X) −PC(X1))y
should be small under H0 : µ ∈ C(X1).
• Note that PC(X) −PC(X1) is the projection matrix onto C(X1)⊥ ∩C(X), the orthogonal complement of C(X1) with respect to C(X),and C(X1) ⊕ [C(X1)⊥ ∩ C(X)] = C(X). (See bottom of p. 43 ofthese notes.)
So, under H0, (PC(X) − PC(X1))y should be “small”. A measure of the“smallness” of this vector is its squared length:
‖(PC(X) −PC(X1))y‖2 = yT (PC(X) −PC(X1))y.
136
By our result on expected values of quadratic forms,
E[yT (PC(X) −PC(X1))y] = σ2 dim[C(X1)⊥ ∩ C(X)] + µT (PC(X) −PC(X1))µ
= σ2h + [(PC(X) −PC(X1))µ]T [(PC(X) −PC(X1))µ]
= σ2h + (PC(X)µ−PC(X1)µ)T (PC(X)µ−PC(X1)µ)
Under H0, µ ∈ C(X1) and µ ∈ C(X), so
(PC(X)µ−PC(X1)µ) = µ− µ = 0.
Under H1,PC(X)µ = µ, but PC(X1)µ 6= µ.
I.e., letting µ0 denote p(µ|C(X1)),
E[yT (PC(X) −PC(X1))y] ={
σ2h, under H0;σ2h + ‖µ− µ0‖2, under H1.
• That is, under H0 we expect the squared length of
PC(X)y −PC(X1)y ≡ y − y0
to be small, on the order of σ2h. If H0 is not true, then the squaredlength of y− y0 will be larger, with expected value σ2h+‖µ−µ0‖2.
Therefore, if σ2 is known
‖y − y0‖2σ2h
=‖y − y0‖2/h
σ2
{≈ 1, under H0
> 1, under H1
is an appropriate test statistic for testing H0.
137
Typically, σ2 will not be known, so it must be estimated. The appropriateestimator is s2 = ‖y− y‖2/(n− k− 1), the mean squared error from FM,the model which is valid under H0 and under H1. Our test statistic thenbecomes
F =‖y − y0‖2/h
s2=
‖y − y0‖2/h
‖y − y‖2/(n− k − 1)
{≈ 1, under H0
> 1, under H1.
By the theorems on pp. 84–85, the following results on the numerator anddenominator of F hold:
Theorem: Suppose y ∼ N(Xβ, σ2I) where X is n × (k + 1) of full rankwhere Xβ = X1β1+X2β2, and X2 is n×h. Let y = p(y|C(X)) = PC(X)y,y0 = p(y|C(X1)) = PC(X1)y, and µ0 = p(µ|C(X1)) = PC(X1)µ. Then
(i) 1σ2 ‖y − y‖2 = 1
σ2 yT (I−PC(X))y ∼ χ2(n− k − 1);
(ii) 1σ2 ‖y − y0‖2 = 1
σ2 yT (PC(X) −PC(X1))y ∼ χ2(h, λ1), where
λ1 =1
2σ2‖(PC(X) −PC(X1))µ‖2 =
12σ2
‖µ− µ0‖2;
and
(iii) 1σ2 ‖y − y‖2 and 1
σ2 ‖y − y0‖2 are independent.
Proof: Parts (i) and (ii) folllow immediately from part (3) of the theoremon p. 84. Part (iii) follows because
‖y − y‖2 = ||p(y|C(X)⊥)||2
and‖y − y0‖2 = ||p(y|C(X1)⊥ ∩ C(X)︸ ︷︷ ︸
⊂C(X)
)||2
are squared lengths of projections onto orthogonal subspaces, so they areindependent according to the theorem on p. 85.
138
From this result, the distribution of our test statistic F follows easily:
Theorem: Under the conditions of the previous theorem,
F =‖y − y0‖2/h
s2=
yT (PC(X) −PC(X1))y/h
yT (I−PC(X))y/(n− k − 1)
∼{
F (h, n− k − 1), under H0; andF (h, n− k − 1, λ1), under H1,
where λ1 is as given in the previous theorem.
Proof: Follows the previous theorem and the definition of the F distribu-tion.
Therefore, the α−level F−test for H0 : β2 = 0 versus H1 : β2 6= 0(equivalently, of RM vs. FM) is:
reject H0 if F > F1−α(h, n− k − 1).
• It is worth noting that the numerator of this F test can be obtainedas the difference in the SSE’s under FM and RM divided by thedifference in the dfE (degrees of freedom for error) for the two models.This is so because the Pythagorean Theorem yields
‖y − y0‖2 = ‖y − y0‖2 − ‖y − y‖2 = SSE(RM)− SSE(FM).
The difference in the dfE’s is (n − h − k − 1) − (n − k − 1) = h.Therefore,
F =[SSE(RM)− SSE(FM)]/[dfE(RM)− dfE(FM)]
SSE(FM)/dfE(FM).
• In addition, because SSE = SST− SSR,
‖y − y0‖2 = SSE(RM)− SSE(FM)= SST− SSR(RM)− [SST− SSR(FM)]= SSR(FM)− SSR(RM) ≡ SS(β2|β1)
which we denote as SS(β2|β1), and which is known as the “extra”regression sum of squares due to β2 after accounting for β1.
139
The results leading to the F -test for H0 : β2 = 0 that we have justdeveloped can be summarized in an ANOVA table:
Source of Sum of df Mean FVariation Squares Squares
Due to β2 SS(β2|β1) hSS(β2|β1)
hMS(β2|β1)
MSEadjusted for β1 = yT (PC(X) −PC(X1))y
Error SSE n− k − 1 SSEn−k−1
= yT (I−PC(X))y
Total (Corr.) SST= yT y − ny2
An additional column is sometimes added to the ANOVA table for E(MS),or expected mean squares. The expected mean squares here are
E{MS(β2|β1)} =1h
E{SS(β2|β1)} =σ2
hE{SS(β2|β1)/σ2}
=σ2
h{h + 2λ1} = σ2 +
1h||µ− µ0||2
and
E(MSE) =1
n− k − 1E(SSE) =
1n− k − 1
(n− k − 1)σ2 = σ2.
These expected mean squares give additional insight into why F is anappropriate test of H0 : β2 = 0. Any mean square can be thought of asan estimate of its expectation. Therefore, MSE estimates σ2 (always),and MS(β2|β1) estimates σ2 under H0, and estimates σ2 plus a positivequantity under H1. Therefore, our test statistic F will behave as
F
{≈ 1, under H0
> 1, under H1
where how much larger F is than 1 depends upon “how false” H0 is.
140
Overall Regression Test:
An important special case of the test of H0 : β2 = 0 that we have justdeveloped is when we partition β so that β1 contains just the interceptand when β2 contains all of the regression coefficients. That is, if we writethe model as
y = X1β1 + X2β2 + e
= β0jn +
x11 x12 · · · x1k
x21 x22 · · · x2k...
.... . .
...xn1 xn2 · · · xnk
︸ ︷︷ ︸=X2
β1
β2...
βk
︸ ︷︷ ︸=β2
+e
then our hypothesis H0 : β2 = 0 is equivalent to
H0 : β1 = β2 = · · · = βk = 0,
which says that the collection of explanatory variables x1, . . . , xk have nolinear effect on (do not predict) y.
The test of this hypothesis is called the overall regression test andoccurs as a special case of the test of β2 = 0 that we’ve developed. UnderH0,
y0 = p(y|C(X1)) = p(y|L(jn)) = y jn
and h = k, so the numerator of our F -test statistic becomes
1kyT (PC(X) −PL(jn))y =
1k
(yT PC(X)y − yT PL(jn)y)
=1k{(PC(X)y)T y − yT PT
L(jn) PL(jn)y︸ ︷︷ ︸=yjn
}
=1k
(βT XT y − ny2) = SSR/k ≡ MSR
141
Thus, the test statistic of overall regression is given by
F =SSR/k
SSE/(n− k − 1)=
MSR
MSE
∼{
F (k, n− k − 1), under H0 : β1 = · · · = βk = 0F (k, n− k − 1, 1
2σ2 βT2 XT
2 PL(jn)⊥X2β2), otherwise.
The ANOVA table for this test is given below. This ANOVA table istypically part of the output of regression software (e.g., PROC REG inSAS).
Source of Sum of df Mean FVariation Squares Squares
Regression SSR k SSRk
MSRMSE
= βT XT y − ny2
Error SSE n− k − 1 SSEn−k−1
= yT (I−PC(X))y
Total (Corr.) SST= yT y − ny2
142
F test in terms of R2:
The F test statistics we have just developed can be written in terms of R2,the coefficient of determination. This relationship is given by the followingtheorem.
Theorem: The F statistic for testing H0 : β2 = 0 in the full rank modely = X1β1 + X2β2 + e (top of p. 138) can be written in terms of R2 as
F =(R2
FM −R2RM )/h
(1−R2FM )/(n− k − 1)
,
where R2FM corresponds to the full model y = X1β1+X2β2+e, and R2
RM
corresponds to the reduced model y = X1β∗1 + e∗.
Proof: Homework.
Corollary: The F statistic for overall regression (for testing H0 : β1 =β2 = · · · = βk = 0) in the full rank model, yi = β0+β1xi1+· · ·+βkxik +ei,
i = 1, . . . , n, e1, . . . , eniid∼ N(0, σ2) can be written in terms of R2, the
coefficient of determination from this model as follows:
F =R2/k
(1−R2)/(n− k − 1).
Proof: For this hypothesis h, the dimension of the regression parameterbeing tested, is k. In addition, the reduced model here is
y = jnβ0 + e,
so (Xβ)RM , the estimated mean of y, under the reduced model is (Xβ)RM =jny. So, R2
RM in the previous theorem is (cf. p. 131):
R2RM =
[(Xβ)TRMy − ny2]
yT y − ny2
=[y
=ny︷︸︸︷jTny −ny2]
yT y − ny2= 0.
The result now follows from the previous theorem.
143
The General Linear Hypothesis H0 : Cβ = t
The hypothesis H0 : Cβ = t is called the general linear hypothesis. HereC is a q × (k + 1) matrix of (known) coefficients with rank(C) = q. Wewill consider the slightly simpler case H : Cβ = 0 (i.e., t = 0) first.
Most of the questions that are typically asked about the coefficients of alinear model can be formulated as hypotheses that can be written in theform H0 : Cβ = 0, for some C. For example, the hypothesis H0 : β2 = 0in the model
y = X1β1 + X2β2 + e, e ∼ N(0, σ2I)
can be written as
H0 : Cβ = ( 0︸︷︷︸h×(k+1−h)
, Ih)(
β1
β2
)= β2 = 0.
The test of overall regression can be written as
H0 : Cβ = ( 0︸︷︷︸k×1
, Ik)
β0
β1...
βk
=
β1...
βk
= 0.
Hypotheses encompassed by H:Cβ = 0 are not limitted to ones in whichcertain regression coefficients are set equal to zero. Another example thatcan be handled is the hypothesis H0 : β1 = β2 = · · · = βk. For example,suppose k = 4, then this hypothesis can be written as
H0 : Cβ =
0 1 −1 0 00 0 1 −1 00 0 0 1 −1
β0
β1
β2
β3
β4
=
β1 − β2
β2 − β3
β3 − β4
= 0.
Another equally good choice for C in this example is
C =
0 1 −1 0 00 1 0 −1 00 1 0 0 −1
144
The test statistic for H0 : Cβ = 0 is based on comparing Cβ to its nullvalue 0, using a squared statistical distance (quadratic form) of the form
Q = {Cβ − E0(Cβ)︸ ︷︷ ︸=0
}T {var0(Cβ)}−1{Cβ − E0(Cβ)}
= (Cβ)T {var0(Cβ)}−1(Cβ).
• Here, the 0 subscript is there to indicate that the expected value andvariance are computed under H0.
Recall that β ∼ Nk+1(β, σ2(XT X)−1). Therefore,
Cβ ∼ Nq(Cβ, σ2C(XT X)−1CT ).
We estimate σ2 using s2 = MSE = SSE/(n− k − 1), so
var0(Cβ) = s2C(XT X)−1CT
and Q becomes
Q = (Cβ)T {s2C(XT X)−1CT }−1Cβ
=(Cβ)T {C(XT X)−1CT }−1Cβ
SSE/(n− k − 1)
145
To use Q to form a test statistic, we need its distribution, which is givenby the following theorem:
Theorem: If y ∼ Nn(Xβ, σ2In) where X is n × (k + 1) of full rank andC is q × (k + 1) of rank q ≤ k + 1, then
(i) Cβ ∼ Nq[Cβ, σ2C(XT X)−1CT ];(ii) (Cβ)T [C(XT X)−1CT ]−1Cβ/σ2 ∼ χ2(q, λ), where
λ = (Cβ)T [C(XT X)−1CT ]−1Cβ/(2σ2);
(iii) SSE/σ2 ∼ χ2(n− k − 1); and(iv) (Cβ)T [C(XT X)−1CT ]−1Cβ and SSE are independent.
Proof: Part (i) follows from the normality of β and that Cβ is an affinetransformation of a normal. Part (iii) has been proved previously (p. 138).
(ii) Recall the theorem on the bottom of p. 82 (thm 5.5A in our text).This theorem said that if y ∼ Nn(µ,Σ) and A was n× n of rank r,then yT Ay ∼ χ2(r, 1
2µT Aµ) iff AΣ is idempotent. Here Cβ playsthe role of y, Cβ plays the role of µ, σ2C(XT X)−1CT plays the roleof Σ, and {σ2C(XT X)−1CT }−1 plays the role of A. Then the resultfollows because AΣ = {σ2C(XT X)−1CT }−1σ2C(XT X)−1CT = Iis obviously idempotent.
(iv) Since β and SSE are independent (p. 115) then (Cβ)T [C(XT X)−1CT ]−1Cβ
(a function of β) and SSE must be independent.
Therefore,
F = Q/q =(Cβ)T {C(XT X)−1CT }−1Cβ/q
SSE/(n− k − 1)=
SSH/q
SSE/(n− k − 1)
has the form of a ratio of independent χ2’s each divided by its d.f.
• Here, SSH denotes (Cβ)T {C(XT X)−1CT }−1Cβ, the sum of squaresdue to the Hypothesis H0.
146
Theorem: If y ∼ Nn(Xβ, σ2In) where X is n × (k + 1) of full rank andC is q × (k + 1) of rank q ≤ k + 1, then
F =(Cβ)T {C(XT X)−1CT }−1Cβ/q
SSE/(n− k − 1)
=SSH/q
SSE/(n− k − 1)
∼{
F (q, n− k − 1), if H0 : Cβ = 0 is true;F (q, n− k − 1, λ), if H0 : Cβ = 0 is false,
where λ is as in the previous theorem.
Proof: Follows from the previous theorem and the definition of the Fdistribution.
So, to conduct a hypothesis test of H0 : Cβ = 0, we compute F and rejectat level α if F > F1−α(q, n− k − 1) (F1−α denotes the (1− α)th quantile,or upper αth quantile of the F distribution).
147
The general linear hypothesis as a test of nested models:
We have seen that the test of β2 = 0 in the model y = X1β1 + X2β2 + ecan be formulated as a test of Cβ = 0. Therefore, special cases of thegeneral linear hypothesis correspond to tests of nested (full and reduced)models. In fact, all F tests of the general linear hypothesis H0 : Cβ = 0can be formulated as tests of nested models.
Theorem: The F test for the general linear hypothesis H0 : Cβ = 0 is afull-and-reduced-model test.
Proof: The book, in combination with a homework problem, provides aproof based on Lagrange multipliers. Here we offer a different proof basedon geometry.
Under H0,
y = Xβ + e and Cβ = 0
⇒ C(XT X)−1XT Xβ = 0
⇒ C(XT X)−1XT µ = 0
⇒ TT µ = 0 where T = X(XT X)−1CT .
That is, under H0, µ = Xβ ∈ C(X) = V and µ ⊥ C(T), or
µ ∈ [C(T)⊥ ∩ C(X)] = V0
where V0 = C(T)⊥ ∩ C(X) is the orthogonal complement of C(T) withrespect to C(X).
• Thus, under H0 : Cβ = 0, µ ∈ V0 ⊂ V = C(X), and under H1 :Cβ 6= 0, µ ∈ V but µ /∈ V0. That is, these hypotheses correspondto nested models. It just remains to establish that the F test forthese nested models is the F test for the general linear hypothesisH0 : Cβ = 0 given on p. 147.
148
The F test statistic for nested models given on p. 139 is
F =yT (PC(X) −PC(X1))y/h
SSE/(n− k − 1)
Here, we replace PC(X1) by the projection matrix onto V0:
PV0 = PC(X) −PC(T)
and replace h with dim(V ) − dim(V0), the reduction in dimension of themodel space when we go from the full to the reduced model.
Since V0 is the orthogonal complement of C(T) with respect to C(X),dim(V0) is given by
dim(V0) = dim(C(X))− dim(C(T)) = rank(X)− rank(T) = k + 1− q
Here, rank(T) = q by the following argument:
rank(T) = rank(TT ) ≥ rank(TT X) = rank(C(XT X)−1XT X) = rank(C) = q
and
rank(T) = rank(TT T) = rank(C(XT X)−1XT X(XT X)−1CT )
= rank(C(XT X)−1CT
︸ ︷︷ ︸q×q
) ≤ q.
Therefore, q ≤ rank(T) ≤ q ⇒ rank(T) = q.
So,h = dim(V )− dim(V0) = (k + 1)− [(k + 1)− q] = q.
149
Thus the full vs. reduced model F statistic becomes
F =yT [PC(X) −PV0 ]y/q
SSE/(n− k − 1)=
yT [PC(X) − (PC(X) −PC(T))]y/q
SSE/(n− k − 1)
=yT PC(T)y/q
SSE/(n− k − 1)
where
yT PC(T)y = yT T(TT T)−1TT y
= yT X(XT X)−1CT {C(XT X)−1XT X(XT X)−1CT }−1C(XT X)−1XT y
= yT X(XT X)−1
︸ ︷︷ ︸=
ˆβT
CT {C(XT X)−1CT }−1C (XT X)−1XT y︸ ︷︷ ︸=
ˆβ
= βT CT {C(XT X)−1CT }−1Cβ
which is our test statistic for the general linear hypothesis H0 : Cβ = 0from p. 147.
The case H0 : Cβ = t where t 6= 0:
Extension to this case is straightforward. The only requirement is that thesystem of equations Cβ = t be consistent, which is ensured by C havingfull row rank q.
Then the F test statistic for H0 : Cβ = t is given by
F =(Cβ − t)T [C(XT X)−1CT ]−1(Cβ − t)/q
SSE/(n− k − 1)∼
{F (q, n− k − 1), under H0
F (q, n− k − 1, λ), otherwise,
where λ = (Cβ − t)T [C(XT X)−1CT ]−1(Cβ − t)/(2σ2).
150
Tests on βj and on aT β:
Tests of H0 : βj = 0 or H0 : aT β = 0 occur as special cases of the testswe have already considered. To test H0 : aT β = 0, we use aT in place ofC in our test of the general linear hypothesis Cβ = 0. In this case q = 1and the test statistic becomes
F =(aT β)T [aT (XT X)−1a]−1aT β
SSE/(n− k − 1)=
(aT β)2
s2aT (XT X)−1a
∼ F (1, n− k − 1) under H0 : aT β = 0.
• Note that since t2(ν) = F (1, ν), an equivalent test of H0 : aT β = 0is given by the t-test with test statistic
t =aT β
s√
aT (XT X)−1a∼ t(n− k − 1) under H0.
An important special case of the hypothesis H0 : aT β = 0 occurs whena = (0, . . . , 0, 1, 0, . . . , 0)T where the 1 appears in the j+1th position. Thisis the hypothesis H0 : βj = 0, and it says that the jth explanatory variablexj has no partial regression effect on y (no effect above and beyond theeffects of the other explanatory variables in the model).
The test statistic for this hypothesis simplifies from that given above toyield
F =β2
j
s2gjj∼ F (1, n− k − 1) under H0 : βj = 0,
where gjj is the jth diagonal element of (XT X)−1. Equivalently, we coulduse the t test statistic
t =βj
s√
gjj=
βj
s.e.(βj)∼ t(n− k − 1) under H0 : βj = 0.
151
Confidence and Prediction Intervals
Hypothesis tests and confidence regions (e.g., intervals) are really two dif-ferent ways to look at the same problem.
• For an α-level test of a hypothesis of the form H0 : θ = θ0, a100(1 − α)% confidence region for θ is given by all those values ofθ0 such that the hypothesis would not be rejected. That is, theacceptance region of the α-level test is the 100(1 − α)% confidenceregion for θ.
• Conversely, θ0 falls outside of a 100(1−α)% confidence region for θiff an α level test of H0 : θ = θ0 is rejected.
• That is, we can invert the statistical tests that we have derived toobtain confidence regions for parameters of the linear model.
Confidence Region for β:
If we set C = Ik+1 and t = β in the F statistic on the bottom of p. 150,we obtain
(β − β)T XT X(β − β)/(k + 1)s2
∼ F (k + 1, n− k − 1)
From this distributional result, we can make the probability statement,
Pr
{(β − β)T XT X(β − β)
s2(k + 1)≤ F1−α(k + 1, n− k − 1)
}= 1− α.
Therefore, the set of all vectors β that satisfy
(β − β)T XT X(β − β) ≤ (k + 1)s2F1−α(k + 1, n− k − 1)
forms a 100(1− α)% confidence region for β.
152
• Such a region is an ellipse, and is only easy to draw and make easyinterpretation of for k = 1 (e.g., simple linear regression).
• If one can’t plot the region and then plot a point to see whether itsin or out of the region (i.e., for k > 1) then this region isn’t any moreinformative than the test of H0 : β = β0. To decide whether β0 isin the region, we essentially have to perform the test!
• More useful are confidence intervals for the individual βj ’s and forlinear combinations of the form aT β.
Confidence Interval for aT β:
If we set C = aT and t = aT β in the F statistic on the bottom of p. 150,we obtain
(aT β − aT β)2
s2aT (XT X)−1a∼ F (1, n− k − 1)
which implies(aT β − aT β)
s√
aT (XT X)−1a∼ t(n− k − 1).
From this distributional result, we can make the probability statement,
Pr
tα/2(n− k − 1)︸ ︷︷ ︸−t1−α/2(n−k−1)
≤ (aT β − aT β)s√
aT (XT X)−1a≤ t1−α/2(n− k − 1)
= 1− α.
Rearranging this inequality so that aT β falls in the middle, we get
Pr{aT β − t1−α/2(n− k − 1)s
√aT (XT X)−1a ≤ aT β
≤ aT β + t1−α/2(n− k − 1)s√
aT (XT X)−1a}
= 1− α.
Therefore, a 100(1− α)% CI for aT β is given by
aT β ± t1−α/2(n− k − 1)s√
aT (XT X)−1a.
153
Confidence Interval for βj:
A special case of this interval occurs when a = (0, . . . , 0, 1, 0, . . . , 0)T ,where the 1 is in the j + 1th position. In this case aT β = βj , aT β = βj ,and aT (XT X)−1a = {(XT X)−1}jj ≡ gjj . The confidence interval for βj
is then given byβj ± t1−α/2(n− k − 1)s
√gjj .
Confidence Interval for E(y):
Let x0 = (1, x01, x02, . . . , x0k)T denote a particular choice of the vectorof explanatory variables x = (1, x1, x2, . . . , xk)T and let y0 denote thecorresponding response.
We assume that the model y = Xβ + e, e ∼ N(0, σ2I) applies to (y0,x0)as well. This may be because (y0,x0) were in the original sample to whichthe model was fit (i.e., xT
0 is a row of X), or because we believe that(y0,x0) will behave similarly to the data (y,X) in the sample. Then
y0 = xT0 β + e0, e0 ∼ N(0, σ2)
where β and σ2 are the same parameters in the fitted model y = Xβ + e.
Suppose we wish to find a CI for
E(y0) = xT0 β.
This quantity is of the form aT β where a = x0, so the BLUE of E(y0) isxT
0 β and a 100(1− α)% CI for E(y0) is given by
xT0 β ± t1−α/2(n− k − 1)s
√xT
0 (XT X)−1x0.
154
• This confidence interval holds for a particular value xT0 β. Sometimes,
it is of interest to form simultaneous confidence intervals around eachand every point xT
0 β for all x0 in the range of x. That is, we some-times desire a simultaneous confidence band for the entire regressionline (or plane, for k > 1). The confidence interval given above, ifplotted for each value of x0, does not give such a simultaneous band;instead it gives a “point-wise” band. For discussion of simultaneousintervals see §8.6.7 of our text.
• The confidence interval given above is for E(y0), not for y0 itself.E(y0) is a parameter, y0 is a random variable. Therefore, we can’testimate y0 or form a confidence interval for it. However, we can pre-dict its value, and an interval around that prediction that quantifiesthe uncertainty associated with that prediction is called a predictioninterval.
Prediction Interval for an Unobserved y-value:
For an unobserved value y0 with known explanatory vector x0 assumed tofollow our linear model y = Xβ + e, we predict y0 by
y0 = xT0 β.
• Note that this predictor of y0 coincides with our estimator of E(y0).However, the uncertainty associated with the quantity xT
0 β as apredictor of y0 is different from (greater than) its uncertainty as anestimator of E(y0). Why? Because observations (e.g., y0) are morevariable than their means (e.g., E(y0)).
155
To form a CI for the estimator xT0 β of E(y0) we examine the variance of
the error of estimation:
var{E(y0)− xT0 β} = var(xT
0 β).
In contrast, to form a PI for the predictor xT0 β of y0, we examine the
variance of the error of prediction:
var(y0 − xT0 β) = var(y0) + var(xT
0 β)− 2 cov(y0,xT0 β)︸ ︷︷ ︸
0
= var(xT0 β + e0) + var(xT
0 β)
= var(e0) + var(xT0 β) = σ2 + σ2xT
0 (XT X)−1x0.
Since σ2 is unknown, we must estimate this quantity with s2, yielding
var(y0 − y0) = s2{1 + xT0 (XT X)−1x0}.
It’s not hard to show thaty0 − y0
s√
1 + xT0 (XT X)−1x0
∼ t(n− k − 1),
therefore
Pr
{−t1−α/2(n− k − 1) ≤ y0 − y0
s√
1 + xT0 (XT X)−1x0
≤ t1−α/2(n− k − 1)
}= 1−α.
Rearranging,
Pr{
y0 − t1−α/2(n− k − 1)s√
1 + xT0 (XT X)−1x0 ≤ y0
≤ y0 + t1−α/2(n− k − 1)s√
1 + xT0 (XT X)−1x0
}= 1− α.
Therefore, a 100(1− α)% prediction interval for y0 is given by
y0 ± t1−α/2(n− k − 1)s√
1 + xT0 (XT X)−1x0.
• Once again, this is a point-wise interval. Simultaneous predictionintervals for predicting multiple y-values with given coverage proba-bility are discussed in §8.6.7.
156
Equivalence of the F−test and Likelihood Ratio Test:
Recall that for the classical linear model y = Xβ + e, with normal, ho-moscedastic errors, the likelihood function is given by
L(β, σ2;y) = (2πσ2)−n/2 exp{−‖y −Xβ‖2/(2σ2)},
or expressing the likelihood as a function of µ = Xβ instead of β:
L(µ, σ2;y) = (2πσ2)−n/2 exp{−‖y − µ‖2/(2σ2)}.
• L(µ, σ2;y) gives the probability of observing y for specified valuesof the parameters µ and σ2 (to be more precise, the probability thatthe response vector is “close to” the observed value y).
– or, roughly, it measures how likely the data are for given valuesof the parameters.
The idea behind a likelihood ratio test (LRT) for some hypothesis H0 is tocompare the likelihood function maximized over the parameters subject tothe restriction imposed by H0 (the constrained maximum likelihood) withthe likelihood function maximized over the parameters without assumingH0 is true (the unconstrained maximum likelihood).
• That is, we compare how probable the data are under the most favor-able values of the parameters subject to H0 (the constrained MLEs),with how probable the data are under the most favorable values ofthe parameters under the maintained hypothesis (the unconstrainedMLEs).
• If assuming H0 makes the data substantially less probable than notassuming H0, then we reject H0.
157
Consider testing H0 : µ ∈ V0 versus H1 : µ /∈ V0 under the maintainedhypothesis that µ is in V . Here V0 ⊂ V and dim(V0) = k+1−h ≤ k+1 =dim(V ).
Let y = p(y|V ) and y0 = p(y|V0). Then the unconstrained MLEs of(µ, σ2) are µ = y and σ2 = ‖y − y‖2/n and the constrained MLEs areµ0 = y0 and σ2
0 = ‖y − y0‖2/n.
Therefore, the likelihood ratio statistic is
LR =supµ∈V0
L(µ,σ2;y)supµ∈V L(µ, σ2;y)
=L(y0, σ
20)
L(y, σ2)
=(2πσ2
0)−n/2 exp{−‖y − y0‖2/(2σ20)}
(2πσ2)−n/2 exp{−‖y − y‖2/(2σ2)}
=(
σ20
σ2
)−n/2 exp(−n/2)exp(−n/2)
=(
σ20
σ2
)−n/2
• We reject for small values of LR. Typically in LRTs, we work withλ = −2 log(LR) so that we can reject for large values of λ. In thiscase, λ = n log
(σ2
0/σ2).
• Equivalently we reject for large values of(σ2
0/σ2)
where
(σ2
0
σ2
)=‖y − y0‖2‖y − y‖2 =
‖y − y‖2 + ‖y − y0‖2‖y − y‖2
= 1 +‖y − y0‖2‖y − y‖2 = 1 +
(h
n− k − 1
)F
a monotone function of F (cf. the F statistic on the top of p. 138).
Therefore, large values of λ correspond to large values of F and the decisionrules based on LR and on F are the same.
• Therefore, the LRT and the F−test are equivalent.
158
Analysis of Variance Models: The Non-Full Rank Linear Model
• To this point, we have focused exclusively on the case when themodel matrix X of the linear model is of full rank. We now considerthe case when X is n× p with rank(X) = k < p.
• The basic ideas behind estimation and inference in this case are thesame as in the full rank case, but the fact that (XT X)−1 doesn’texist and therefore the normal equations have no unique solutioncauses a number of technical complications.
• We wouldn’t bother to dwell on these technicalities if it weren’t forthe fact that the non-full rank case does arise frequently in applica-tions in the form of analysis of variance models .
The One-way Model:
Consider the balanced one-way layout model for yij a response on the jth
unit in the ith treatment group. Suppose that there are a treatments and nunits in the ith treatment group. The cell-means model for this situationis
yij = µi + eij , i = 1, . . . , a, j = 1, . . . , n,
where the eij ’s are i.i.d. N(0, σ2).
An alternative, but equivalent, linear model is the effects model for theone-way layout:
yij = µ + αi + eij , i = 1, . . . , a, j = 1, . . . , n,
with the same assumptions on the errors.
159
The cell means model can be written in vector notation as
y = µ1x1 + µ2x2 + · · ·+ µaxa + e, e ∼ N(0, σ2I),
and the effects model can be written as
y = µjN + α1x1 + α2x2 + · · ·+ αaxa + e, e ∼ N(0, σ2I),
where xi is an indicator for treatment i, and N = an is the total samplesize.
• That is, the effects model has the same model matrix as the cell-means model, but with one extra column, a column of ones, in thefirst position.
• Notice that∑
i xi = jN . Therefore, the columns of the model matrixfor the effects model are linearly dependent.
Let X1 denote the model matrix in the cell-means model, X2 = (jN ,X1)denote the model matrix in the effects model.
• Note that C(X1) = C(X2).
In general, two linear models y = X1β1+e1, y = X2β2+e2 with the sameassumptions on e1 and e2 are equivalent linear models if C(X1) = C(X2).
Why?
Because the mean vectors µ1 = X1β1 and µ2 = X2β2 in the two casesare both restricted to fall in the same subspace C(X1) = C(X2).
In addition,µ1 = p(y|C(X1)) = p(y|C(X2)) = µ2
is the same in both models, and
S2 =1
n− dim(C(X1))‖y − µ1‖2 =
1n− dim(C(X2))
‖y − µ2‖2
is the same in both models.
160
• The cell-means and effects models are simply reparameterizations ofone-another. The relationship between the parameters in this caseis very simple: µi = µ + αi, i = 1, . . . , a.
• Let V = C(X1) = C(X2). In the case of the cell-means model,rank(X1) = a = dim(V ) and β1 is a × 1. In the case of the effectsmodel, rank(X2) = a = dim(V ) but β2 is (a + 1) × 1. The effectsmodel is overparameterized.
To understand overparameterization, consider the model
yij = µ + αi + eij , i = 1, . . . , a, j = 1, . . . , n.
This model says that E(yij) = µ + αi = µi, or
E(y1j) = µ + α1 = µ1 j = 1, . . . , n,
E(y2j) = µ + α2 = µ2 j = 1, . . . , n,
...
Suppose the true treatment means are µ1 = 10 and µ2 = 8. In terms ofthe parameters of the effects model, µ and the αi’s, these means can berepresented in an infinity of possible ways,
E(y1j) = 10 + 0 j = 1, . . . , n,
E(y2j) = 10 + (−2) j = 1, . . . , n,
(µ = 10, α1 = 0, and α2 = −2), or
E(y1j) = 8 + 2 j = 1, . . . , n,
E(y2j) = 8 + 0 j = 1, . . . , n,
(µ = 8, α1 = 2, and α2 = 0), or
E(y1j) = 1 + 9 j = 1, . . . , n,
E(y2j) = 1 + 7 j = 1, . . . , n,
(µ = 1, α1 = 9, and α2 = 7), etc.
161
Why would we want to consider an overparameterized model like theeffects model?
In a simple case like the one-way layout, I would argue that we wouldn’t.
The most important criterion for choice of parameterization of a model isinterpretability. Without imposing any constraints, the parameters of theeffects model do not have clear interpretations.
However, subject to the constraint∑
i αi = 0, the parameters of theeffects model have the following interpretations:
µ =grand mean response across all treatmentsαi =deviation from the grand mean placing µi (the ith treatmentmean) up or down from the grand mean; i.e., the effect of the ith
treatment.
Without the constraint, though, µ is not constrained to fall in the centerof the µi’s. µ is in no sense the grand mean, it is just an arbitrary baselinevalue.
In addition, adding the constraint∑
i αi = 0 has essentially the effect ofreparameterizing from the overparameterized (non-full rank) effects modelto a just-parameterized (full rank) model that is equivalent (in the senseof having the same model space) as the cell means model.
To see this consider the one-way effects model with a = 3, n = 2. Then∑ai=1 αi = 0 implies α1 + α2 + α3 = 0 or α3 = −(α1 + α2). Subject to the
constraint, the effects model is
y = µjN + α1x1 + α2x2 + α3x3 + e, where α3 = −(α1 + α2),
162
ory = µjN + α1x1 + α2x2 + (−α1 − α2)x3 + e
= µjN + α1(x1 − x3) + α2(x2 − x3) + e
= µ
111111
+ α1
1100−1−1
+ α2
0011−1−1
+ e,
which has the same model space as the cell-means model.
Thus, when faced with a non-full rank model like the one-way effectsmodel, we have three ways to proceed:
(1) Reparameterize to a full rank model.
– E.g., switch from the effects model to the cell-means model.
(2) Add constraints to the model parameters to remove the overparam-eterization.
– E.g., add a constraint such as∑a
i=1 αi = 0 to the one-wayeffects model.
– Such constraints are usually called side-conditions.
– Adding side conditions essentially accomplishes a reparameter-ization to a full rank model as in (1).
(3) Analyze the model as a non-full rank model, but limit estimation andinference to those functions of the (overparameterized) parametersthat can be uniquely estimated.
– Such functions of the parameters are called estimable.
– It is only in this case that we are actually using an overparam-eterized model, for which some new theory is necessary. (Incases (1) and (2) we remove the overparameterization some-how.)
163
Why would we choose option (3)?
Three reasons:
i. We can. Although we may lose nice parameter interpretations inusing an unconstrained effects model or other unconstrained, non-full-rank model, there is no theoretical or methodological reason toavoid them (they can be handled with a little extra trouble).
ii. It is often easier to formulate an appropriate (and possibly overpa-rameterized) model without worrying about whether or not its of fullrank than to specify that model’s “full-rank version” or to identifyand impose the appropriate constraints on the model to make it fullrank. This is especially true in modelling complex experimental datathat are not balanced.
iii. Non-full rank model matrices may arise for reasons other than thestructure of the model that’s been specified. E.g., in an observationalstudy, several explanatory variables may be colinear.
So, let’s consider the overparameterized (non-full-rank) case.
• In the non-full-rank case, it is not possible to obtain linear unbiasedestimators of all of the components of β.
To illustrate this consider the effects version of the one-way layout modelwith no parameter constraints.
Can we find an unbiased linear estimator of α1?
To be linear, such an estimator (call it T ) would be of the form T =∑i
∑j dijyij for some coefficients {dij}. For T to be unbiased we require
E(T ) = α1. However,
E(T ) = E(∑
i
∑
j
dijyij) =∑
i
∑
j
dij(µ + αi) = µd·· +∑
i
di·αi
Thus, the unbiasedness requirement E(T ) = α1 implies d·· = 0, d1· = 1,d2· = · · · = da· = 0. This is impossible!
164
• So, α1 is non-estimable. In fact, all of the parameters of the uncon-strained one-way effects model are non-estimable. More generally,in any non-full rank linear model, at least one of the individual pa-rameters of the model is not estimable.
If the parameters of a non-full rank linear model are non-estimable,what does least-squares yield?
Even if X is not of full rank, the least-squares criterion is still a reasonableone for estimation, and it still leads to the normal equation:
XT Xβ = XT y. (♣)
Theorem: For X and n× p matrix of rank k < p ≤ n, (♣) is a consistentsystem of equations.
Proof: By the Theorem on p. 60 of these notes, (♣) is consistent iff
XT X(XT X)−XT y = XT y.
But this equation holds by result 3, on p. 57.
So (♣) is consistent, and therefore has a non-unique (for X not of fullrank) solution given
β = (XT X)−XT y,
where (XT X)− is some (any) generalized inverse of XT X.
What does β estimate in the non-full rank case?
Well, in general a statistic estimates its expectation, so for a particulargeneralized inverse (XT X)−, β estimates
E(β) = E{(XT X)−XT y} = (XT X)−XT E(y) = (XT X)−XT Xβ 6= β.
• That is, in the non-full rank case, β = (XT X)−XT y is not unbiasedfor β. This is not surprising given that we said earlier that β is notestimable.
165
• Note that E(β) = (XT X)−XT Xβ depends upon which (of manypossible) generalized inverses (XT X)− is used in β = (XT X)−XT y.That is, β, a solution of the normal equations, is not unique, andeach possible choice estimates something different.
• This is all to reiterate that β is not estimable, and β is not an esti-mator of β in the not-full rank model. However, certain linear com-binations of β are estimable, and we will see that the correspondinglinear combinations of β are BLUEs of these estimable quantities.
Estimability: Let λ = (λ1, . . . , λp)T be a vector of constants. The pa-rameter λT β =
∑j λjβj is said to be estimable if there exists a vector a
in Rn such that
E(aT y) = λT β, for all β ∈ Rp. (†)
Since (†) is equivalent to aT Xβ = λT β for all β, it follows that λT β isestimable if and only if there exists a such that XT a = λ (i.e., iff λ lies inthe row space of X).
This and two other necessary and sufficient conditions for estimability ofλT β are given in the following theorem:
Theorem: In the model y = Xβ + e, where E(y) = Xβ and X is n × pof rank k < p ≤ n, the linear function λT β is estimable if and only if anyone of the following conditions hold:
(i) λ lies in the row space of X. I.e., λ ∈ C(XT ), or, equivalently, ifthere exists a vector a such that
λ = XT a.
(ii) λ ∈ C(XT X). I.e., if there exists a vector r such that
λ = (XT X)r.
(iii) λ satisfiesXT X(XT X)−λ = λ,
where (XT X)− is any symmetric generalized inverse of XT X.
166
Proof: Part (i) follows from the comment directly following the definitionof estimability. That is, (†), the definition of estimability, is equivalent toaT Xβ = λT β for all β, which happens iff aT X = λT ⇔ λ = XT a, ⇔λ ∈ C(XT ).
Now, condition (iii) is equivalent to condition (i) because (i) implies (iii):(i) implies
λT (XT X)−XT X = aT X(XT X)−XT X︸ ︷︷ ︸=X
= aT X = λT
which, taking transposes of both sides, implies (iii); and (iii) implies (i):(iii) says
λ = XT X(XT X)−λ︸ ︷︷ ︸=a
which is of the form XT a for a = X(XT X)−λ.
Finally, condition (ii) is equivalent to condition (iii) because (ii) implies(iii): (ii) implies
XT X(XT X)−λ = XT X(XT X)−XT Xr = XT Xr = λ;
and (iii) implies (ii): (iii) says
λ = XT X (XT X)−λ︸ ︷︷ ︸=a
which is of the form XT Xa for a = (XT X)−λ.
167
Example: Let x1 = (1, 1, 1, 1)T , x2 = (1, 1, 1, 0)T and x3 = 3x1 − 2x2 =(1, 1, 1, 3)T . Then X = (x1,x2,x3) is 4× 3 but has rank of only 2.
Consider the linear combination η = 5β1 + 3β2 + 9β3 = λT β, whereλ = (5, 3, 9)T . η is estimable because λ is in the row space of X:
λ =
539
=
1 1 1 11 1 1 01 1 1 3
︸ ︷︷ ︸=XT
1112
The parameters β1 and β1−β2 are not estimable. Why? Because for λT βto be estimable, there must exist an a so that
XT a = λ
orxT
1 a = λ1
xT2 a = λ2 and
3xT1 a− 2xT
2 a = λ3
which implies 3λ1 − 2λ2 = λ3 must hold for λT β to be estimable. Thisequality does not hold for β1 = 1β1 + 0β2 + 0β3 or for β1 − β2 = 1β1 +(−1)β2 + 0β3. It does hold for λ = (5, 3, 9)T because 3(5)− 2(3) = 9.
168
Theorem: In the non-full rank linear model y = Xβ + e, the number oflinearly independent estimable functions of β is the rank of X.
Proof: This follows from the fact that estimable functins λβ must satisfyλ ∈ C(XT ) and dim{C(XT )} = rank(XT ) = rank(X).
• Let xTi be the ith row of X. Since each xi is in the row space of X,
it follows that every xTi β (every element of µ = Xβ) is estimable,
i = 1, . . . , n.
• Similarly, from the theorem on p. 165, every row (element) of XT Xβis estimable, and therefore XT Xβ itself is estimable.
• In fact, all estimable functions can be obtained from Xβ or XT Xβ.
Theorem: In the model y = Xβ + e, where E(y) = Xβ and X is n × pof rank k < p ≤ n, any estimable function λT β can be obtained by takinga linear combination of the elements of Xβ or of the elements of XT Xβ.
Proof: Follows directly from the theorem on p. 166.
Example: The one-way layout model (effects version).
Consider again the effects version of the (balanced) one way layout model:
yij = µ + αi + eij , i = 1, . . . , a, j = 1, . . . , n.
Suppose that a = 3 and n = 2. Then, in matrix notation, this model is
y11
y12
y21
y22
y31
y32
=
1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1
µα1
α2
α3
+ e.
169
The previous theorem says that any estimable function of β can be ob-tained as a linear combination of the elements of Xβ. In addition, by thetheorem on p. 166, vice versa (any linear combination of the elements ofXβ is estimable).
So, any linear combination aT Xβ for some a is estimable.
Examples:
aT = (1, 0,−1, 0, 0, 0) ⇒ aT Xβ = (0, 1,−1, 0)β= α1 − α2
aT = (0, 0, 1, 0,−1, 0) ⇒ aT Xβ = (0, 0, 1,−1)β= α2 − α3
aT = (1, 0, 0, 0,−1, 0) ⇒ aT Xβ = (0, 1, 0,−1)β= α1 − α3
aT = (1, 0, 0, 0, 0, 0) ⇒ aT Xβ = (1, 1, 0, 0)β= µ + α1
aT = (0, 0, 1, 0, 0, 0) ⇒ aT Xβ = (1, 0, 1, 0)β= µ + α2
aT = (0, 0, 0, 0, 1, 0) ⇒ aT Xβ = (1, 0, 0, 1)β= µ + α3
• So, all treatment means (quantities of the form µ+αi) are estimable,and all pairwise differences in the treatment effects (quantities of theform αi − αj are estimable in the one-way layout model. Actually,any contrast in the treatment effects is estimable. A contrast is alinear combination whose coefficients sum to zero.
• Thus, even though the individual parameters (µ, α1, α2, . . .) of theone-way layout model are non-estimable, it is still useful, becauseall of the quantities of interest in the model (treatment means andcontrasts) are estimable.
170
Estimation in the non-full rank linear model:
A natural candidate for an estimator of an estimable function λT β is λT β,where β is a solution of the least squares normal equation XT Xβ = XT y(that is, where β = (XT X)−XT y for some generalized inverse (XT X)−).
The following theorem shows that this estimator is unbiased, and eventhough β is not unique, λT β is.
Theorem: Let λT β be an estimable function of β in the model y =Xβ + e, where E(y) = Xβ and X is n × p of rank k < p ≤ n. Let β beany solution of the normal equation XT Xβ = XT y. Then the estimatorλT β has the following properties:
(i) (unbiasedness) E(λT β) = λT β; and
(ii) (uniqueness) λT β is invariant to the choice of β (to the choice ofgeneralized inverse (XT X)− in the formula β = (XT X)−XT y.
Proof: Part (i):
E(λT β) = λT E(β) = λT (XT X)−XT Xβ = λT β
where the last equality follows from part (iii) of the theorem on p. 165.
Part (ii): Because λT β is estimable, λ = XT a for some a. Therefore,
λT β = aT X(XT X)−XT y = aT PC(X)y.
The result now follows from the fact that projection matrices are unique(see pp. 57–58).
• Note that λT β can be written as λT β = rT XT y for r a solution ofXT Xr = λ. (This fact is used quite a bit in our book).
171
Theorem: Under the conditions of the previous theorem, and wherevar(e) = var(y) = σ2I, the variance of λT β is unique, and is given by
var(λT β) = σ2λT (XT X)−λ,
where (XT X)− is any generalized inverse of XT X.
Proof:
var(λT β) = λT var((XT X)−XT y)λ
= λT (XT X)−XT σ2IX{(XT X)−}T λ
= σ2 λT (XT X)−XT X︸ ︷︷ ︸=λT
{(XT X)−}T λ
= σ2λT {(XT X)−}T λ
= σ2aT X{(XT X)−}T XT a (for some a)
= σ2aT X(XT X)−XT a = σ2λT (XT X)−λ.
Uniqueness: since λT β is estimable λ = XT a for some a. Therefore,
var(λT β) = σ2λT (XT X)−λ
= σ2aT X(XT X)−XT a = σ2aT PC(X)a
Again, the result follows from the fact that projection matrices are unique.
Theorem: Let λT1 β and λT
2 β be two estimable function in the modelconsidered in the previous theorem (the spherical errors, non-full-rank lin-ear model). Then the covariance of the least-squares estimators of thesequantities is
cov(λT1 β, λT
2 β) = σ2λT1 (XT X)−λ2.
Proof: Homework.
172
In the full rank linear model, the Gauss-Markov theorem established thatλT β = λT (XT X)−1XT y was the BLUE of its mean λT β. This resultholds in the non-full rnak linear model as well, as long as λT β is estimable.
Theorem: (Gauss-Markov in the non-full rank case) If λT β is estimablein the spherical errors non-full rank linear model y = Xβ + e, then λT βis its BLUE.
Proof: Since λT β is estimable, λ = XT a for some a. λT β = aT Xβ is alinear estimator because it is of the form
λT β = aT X(XT X)−XT y = aT PC(X)y = cT y
where c = PC(X)a. We have already seen that λT β is unbiased. Considerany other linear estimator dT y of λT β. For dT y to be unbiased, the meanof dT y, which is E(dT y) = dT Xβ, must satisfy E(dT y) = λT β, for all β,or equivalently, it must satisfy dT Xβ = λT β, for all β, which implies
dT X = λT .
The covariance between λT β and dT y is
cov(λT β,dT y) = cov(cT y,dT y) = σ2cT d
= σ2λT (XT X)−XT d = σ2λT (XT X)−λ.
Now
0 ≤ var(λT β − dT y) = var(λT β) + var(dT y)− 2cov(λT β,dT y)
= σ2λT (XT X)−λ + var(dT y)− 2σ2λT (XT X)−λ
= var(dT y)− σ2λT (XT X)−λ︸ ︷︷ ︸=var(λT ˆβ)
Therefore,var(dT y) ≥ var(λT β)
with equality holding iff dT y = λT β. I.e., an arbitrary linear unbiasedestimator dT y has variance ≥ to that of the least squares estimator withequality iff the arbitrary estimator is the least squares estimator.
173
ML Estimation:
In the not-necessarily-full-rank model with normal errors:
y = Xβ + e, e ∼ Nn(0, σ2In), (∗)
where X is n× p with rank k ≤ p ≤ n, the ML estimators of β, σ2 changeas expected from their values in the full rank case. That is, we replaceinverses with generalized inverses in the formulas for the MLEs β and σ2,and the MLE of β coincides with the OLS estimator, which is BLUE.
Theorem: In model (*) MLEs of β and σ2 are given by
β = (XT X)−XT y,
σ2 =1n
(y −Xβ)T (y −Xβ) =1n||y −Xβ||2.
Proof: As in the full rank case, the loglikelihood is
`(β, σ2;y) = −n
2log(2π)− n
2log(σ2)− 1
2σ2‖y −Xβ‖2
(cf. p. 110). By inspection, it is clear that the maximum of `(β, σ2;y)with respect to β is the same as the minimizer of ‖y − Xβ‖2, which isthe least-squares criterion. Differentiating the LS criterion w.r.t. β givesthe normal equations, which we know has solution β = (XT X)−XT y.Plugging β back into `(β, σ2;y) gives the profile loglikelihood for σ2, whichwe then maximize w.r.t. σ2. These steps follow exactly as in the full rankcase, leading to σ2 = 1
n ||y −Xβ||2.
• Note that β is not the (unique) MLE, but is an MLE of β corre-sponding to one particular choice of generalized inverse (XT X)−.
• σ2 is the unique MLE of σ2, though, because σ2 is a function ofXβ = X(XT X)−XT y = p(y|C(X)), which is invariant to the choiceof (XT X)−.
174
s2 = MSE is an unbiased estimator of σ2:
As in the full rank case, the MLE σ2 is biased as an estimator of σ2, andis therefore not the preferred estimator. The bias of σ2 can be seen asfollows:
E(σ2) =1n
E{(y −Xβ)T (y −Xβ)}
=1n
E{[(I−X(XT X)−XT )y]T (I−X(XT X)−XT )︸ ︷︷ ︸P
C(X)⊥
y}
=1n
E{yT PC(X)⊥y}
=1n{σ2 dim[C(X)⊥] + (Xβ)PC(X)⊥ (Xβ)︸ ︷︷ ︸
∈C(X)
=1n
σ2(n− dim[C(X)]) + 0 =1n
σ2(n− rank(X)) = σ2 n− k
n.
Therefore, an unbiased estimator of σ2 is
s2 =n
n− kσ2 =
1n− k
(y −Xβ)T (y −Xβ) =SSEdfE
= MSE.
Theorem: In the model y = Xβ + e, E(e) = 0, var(e) = σ2I, and whereX is n× p of rank k ≤ p ≤ n, we have the following properties of s2:
(i) (unbiasedness) E(s2) = σ2.(ii) (uniqueness) s2 is invariant to the choice of β (i.e., to the choice of
generalized inverse (XT X)−).
Proof: (i) follows from the construction of s2 as nσ2/(n− k) and the biasof σ2. (ii) follows from the uniqueness (invariance) of σ2.
175
Distributions of β and s2:
In the normal-errors, not-necessarily full rank model (*), the distributionof β and s2 can be obtained. These distributional results are essentiallythe same as in the full rank case, except for the mean and variance of β:
Theorem: In model (*),
(i) For any given choice of (XT X)−,
β ∼ Np[(XT X)−XT Xβ, σ2(XT X)−XT X{(XT X)−}T ],
(ii) (n− k)s2/σ2 ∼ χ2(n− k), and
(iii) For any given choice of (XT X)−, β and s2 are independent.
Proof: Homework. Proof is essentially the same as in the full rank case.Adapt the proof on p. 115.
• In the full rank case we saw that with normal with spherical var-covstructure, β and s2 were minimimum variance unbiased estimators.This result continues to hold in the not-full-rank case.
176
Reparameterization:
The idea in reparameterization is to transform from the vector of non-estimable parameters β in the model y = Xβ + e where X is n × p withrank k < p ≤ n, to a vector of linearly independent estimable functions ofβ:
uT1 β
uT2 β...
uTk β
= Uβ ≡ γ.
Here U is the k × p matrix with rows uT1 , . . . ,uT
k , so that the elements ofγ = Uβ are a “full set” of linearly independent estimable functions of β.
The new full-rank model is
y = Zγ + e, (∗)where Z is n× k of full rank, and Zγ = Xβ (the mean under the non-fullrank model is the same as under the full rank model, we’ve just changedthe parameterization; i.e., switched from β to γ.)
To find the new (full rank) model matrix Z, note that Zγ = Xβ andγ = Uβ for all β imply
ZUβ = Xβ, for all β, ⇒ ZU = X
⇒ ZUUT = XUT
⇒ Z = XUT (UUT )−1.
• Note that U is of full rank, so (UUT )−1 exists.
• Note also that we have constructed Z to be of full rank:
rank(Z) ≥ rank(ZU) = rank(X) = k
butrank(Z) ≤ k, because Z is n× k.
Therefore, rank(Z) = k.
177
Thus the reparameterized model (*) is a full rank model, and we can obtainthe BLUE of E(y) as
µ = PC(Z)y = Z(ZT Z)−1ZT y,
and the BLUE of γ asγ = (ZT Z)−1ZT y.
If we are interested in any other estimable function of the original param-eter β than those given by γ = Uβ, such quantities are easily estimatedfrom γ. Any estimable function λT β must satisfy λT = aT X for some a.So
λT β = aT Xβ = aT Zγ = bT γ
for b = ZT a. Therefore, any estimable function λT β can be written asbT γ for some b and the BLUE of λT β is given by
λT β = bT γ.
• Note that the choice of a “full set” of linearly independent estimablefunctions Uβ = γ is not unique. We could choose another set of LINestimable functions Vβ = δ, and then reparameterize to a differentfull rank linear model y = Wδ + e where Wδ = Zγ = Xβ. Anyreparameterization leads to the same estimator of λT β.
178
Example: The Two-way ANOVA Model
In a two-way layout, observations are taken at all combinations of thelevels of two treatment factors. Suppose factor A has a levels and factor Bhas b levels, then in a balanced two-way layout n observations are obtainedin each of the ab treatments (combinations of A and B).
Let yijk = the kth observation at the ith level of A combined with the jth
level of B.
One way to think about the analysis of a two-way layout, is that if weignore factors A and B, then what we have is really just a one-way experi-ment with ab treatments. Therefore, a one-way layout-type model, with amean for each treatment can be used. This leads to the cell-means modelfor the two-way layout, which is a full-rank model:
yijk = µij + eijk, i = 1, . . . , a, j = 1, . . . , b, k = 1, . . . , n.
Often, an effects model is used instead for the two-way layout. In theeffects model, the (i, j)th treatment mean is decomposed into a constantterm plus additive effects for factor A, factor B, and factor A combinedwith factor B:
µij = µ + αi + βj + (αβ)ij
This leads to the effects model for the two-way layout:
yijk = µ + αi + βj + (αβ)ij + eijk, i = 1, . . . , a, j = 1, . . . , b, k = 1, . . . , n.
179
Suppose a = 2, b = 3 and n = 2. Then the effects model is
y111
y112
y121
y122
y131
y132
y211
y212
y221
y222
y231
y232
= µ
111111111111
+ α1
111111000000
+ α2
000000111111
+ β1
110000110000
+ β2
001100001100
+ β3
000011000011
+ (I6 ⊗(
11
))
(αβ)11(αβ)12(αβ)13(αβ)21(αβ)22(αβ)23
+ e,
ory = Xβ + e,
where
X =
1 1 0 1 0 0 1 0 0 0 0 01 1 0 1 0 0 1 0 0 0 0 01 1 0 0 1 0 0 1 0 0 0 01 1 0 0 1 0 0 1 0 0 0 01 1 0 0 0 1 0 0 1 0 0 01 1 0 0 0 1 0 0 1 0 0 01 0 1 1 0 0 0 0 0 1 0 01 0 1 1 0 0 0 0 0 1 0 01 0 1 0 1 0 0 0 0 0 1 01 0 1 0 1 0 0 0 0 0 1 01 0 1 0 0 1 0 0 0 0 0 11 0 1 0 0 1 0 0 0 0 0 1
β =
µα1
α2
β1
β2
β3
(αβ)11(αβ)12(αβ)13(αβ)21(αβ)22(αβ)23
.
180
Obviously, the effects model is overparameterized and X is not of full rank.In fact, rank(X) = ab = 6.
One way to reparameterize the effects model is to choose γ to be the vectorof treatment means. That is, take
γ =
γ1
γ2
γ3
γ4
γ5
γ6
=
µ + α1 + β1 + (αβ)11µ + α1 + β2 + (αβ)12µ + α1 + β3 + (αβ)13µ + α2 + β1 + (αβ)21µ + α2 + β2 + (αβ)22µ + α2 + β3 + (αβ)23
=
1 1 0 1 0 0 1 0 0 0 0 01 1 0 0 1 0 0 1 0 0 0 01 1 0 0 0 1 0 0 1 0 0 01 0 1 1 0 0 0 0 0 1 0 01 0 1 0 1 0 0 0 0 0 1 01 0 1 0 0 1 0 0 0 0 0 1
µα1
α2
β1
β2
β3
(αβ)11(αβ)12(αβ)13(αβ)21(αβ)22(αβ)23
= Uβ.
To reparameterize in terms of γ, we can use
Z =
1 0 0 0 0 01 0 0 0 0 00 1 0 0 0 00 1 0 0 0 00 0 1 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 1 00 0 0 0 0 10 0 0 0 0 1
= I6 ⊗(
11
).
181
I leave it to you to verify that Zγ = Xβ, and that ZU = X.
• Note that this choice of γ and Z amounts to a reparameteriza-tion from the effects model y = Xβ + e to the cell-means modely = Zγ + e. That is, γ = (γ1, . . . , γ6)T is just a relabelling of(µ11, µ12, µ13, µ21, µ22, µ23)T .
• Any estimable function λT β can be obtained as bT γ for some b.For example, the main effect of A corresponds to
λT β = {α1 +13[(αβ)11 + (αβ)12 + (αβ)13]} − {α2 +
13[(αβ)21 + (αβ)22 + (αβ)23]}
= {α1 + ¯(αβ)1·} − {α2 + ¯(αβ)2·}
(that is, λ = (0, 1,−1, 0, 0, 0, 13 , 1
3 , 13 ,− 1
3 ,− 13 ,− 1
3 )T ),which can bewritten as bT γ for b = ( 1
3 , 13 , 1
3 ,− 13 ,− 1
3 ,− 13 )T .
Side Conditions:
Another approach for removing the rank deficiency of X in the non-fullrank case is to impose linear constraints on the parameters, called sideconditions. We have already seen one example (pp. 162–163): the one-way effects model with effects that sum to zero
yij = µ + αi + eij ,∑
j
αj = 0,
for i = 1, . . . , a, j = 1, . . . , n.
Consider the case a = 3 and n = 2. Then the model can be written as
y11
y12
y21
y22
y31
y32
=
1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1
︸ ︷︷ ︸=X
µα1
α2
α3
+ e, where α1 + α2 + α3 = 0.
182
Imposing the constraint α3 = −(α1 + α2) on the model equation, we canrewrite it as
y =
1 1 01 1 01 0 11 0 11 −1 −11 −1 −1
︸ ︷︷ ︸=X
µα1
α2
+ e,
where now we are back in the full rank case with the same model space,since C(X) = C(X).
Another Example - The Two-way Layout Model w/o Interaction:
We have seen that there are two equivalent models for the two-way layoutwith interaction: the cell-means model,
yijk = µij + eijk,
i = 1, . . . , aj = 1, . . . , bk = 1, . . . , nij
(∗)
and the effects model
yijk = µ + αi + βj + γij + eijk. (∗∗)
• Model (**) is overparameterized and has a non-full rank model ma-trix. Model (*) is just-parameterized and has a full rank modelmatrix. However, they both have the same model space, and aretherefore equivalent.
• We’ve now developed theory that allows us to use model (**) “asis” by restricting attention to estimable functions, using generalizedinverses, etc.
• We’ve also seen that we can reparameterize from model (**) to model(*) by identifying the µij ’s where µij = µ+αi +βj +γij as a full setof LIN estimable functions of the parameters in (**).
183
Another way to get around the overparameterization of (**) is to imposeside conditions. Side conditions are not unique; there are lots of validchoices. But in this model, the set of side conditions that is most comonlyused is
∑
i
αi = 0,∑
j
βj = 0,
∑
i
γij = 0 for each j, and∑
j
γij = 0 for each i.
As in the one-way effects model example, substituting these constraintsinto the model equation leads to an equivalent full rank model.
• These side conditions, and those considered in the one-way model,are often called the “sum-to-zero constraints”, or the “usual con-straints”, or sometimes “the anova constraints”.
• The sum-to-zero constraints remove the rank deficiency in the modelmatrix, but they also give the parameters nice interpretations. E.g.,in the on-way layout model, the constraint
∑i αi = 0 forces µ to fall
in the middle of all of the µ+αi’s, rather than being some arbitraryconstant. Thus, µ is the overall mean response averaged over all ofthe treatment groups, and the αi’s are deviations from this “grandmean” associated with the ith treatment.
In both the one-way model and the two-way model with interaction, there’san obvious alternative (the cell-means model) to reparameterize to, so per-haps these aren’t the best examples to motivate the use of side conditions.
A better example is the two-way model with no interaction. That is,suppose we want to assume that there is no interaction between the twotreatment factors. That is, we want to assume that the difference betweenany two levels of factor A is the same across levels of factor B.
184
How could we formulate such a model?
The easiest way is just to set the interaction effects γij in the effects model(**) to 0, yielding the (still overparameterized) model
yijk = µ + αi + βj + eijk (∗ ∗ ∗)
• In contrast, it is not so easy to see how a no-interaction, full-rankversion of the cell-means model can be formed. And, therefore, repa-rameterization from (***) is not an easy option, since its not soobvious what the model is that we would like to reparameterize to.
Side conditions are a much easier option to remove the overparameteriza-tion in (***). Again, the “sum-to-zero” constraints are convenient becausethey remove the rank deficiency and give the parameters nice interpreta-tions.
Under the sum-to-zero constraints, model (***) becomes
yijk = µ + αi + βj + eijk,
∑i αi = 0∑j βj = 0 (†)
In model (†) we can substitute
αa = −a−1∑
i=1
αi and βb = −b−1∑
j=1
βj
into the model equation
yijk = µ + αi + βj + eijk,
i = 1, . . . , aj = 1, . . . , bk = 1, . . . , nij
185
For example, consider the following data from an unbalanced two-waylayout in which rats were fed diets that differed in factor A, protein level(high and low), and factor B, food type (beef, cereal, pork). The responseis weight gain.
High Protein Low Protein
Beef Cereal Pork Beef Cereal Pork
73 98 94 90 107 49102 74 79 76 95 82
56 90
Letting αi represent the effect of the ith level of protein and βj be the effectof the jth food type, the model matrix for model (***) (the unconstrainedversion of model (†)) based on these data is
X =
1 1 0 1 0 01 1 0 1 0 01 1 0 0 1 01 1 0 0 1 01 1 0 0 1 01 1 0 0 0 11 1 0 0 0 11 0 1 1 0 01 0 1 1 0 01 0 1 1 0 01 0 1 0 1 01 0 1 0 1 01 0 1 0 0 11 0 1 0 0 1
186
If we add the constraints so that we use the full-rank model (†) instead of(***) we can substitute
α2 = −α1, and β3 = −β1 − β2
so that the model mean becomes
X
µα1
−α1
β1
β2
−β1 − β2
= X
1 0 0 00 1 0 00 −1 0 00 0 1 00 0 0 10 0 −1 −1
µα1
β1
β2
Therefore, the constrained model (†) is equivalent to a model with uncon-strained parameter vector (µ, α1, β1, β2)T and full rank model matrix
X = X
1 0 0 00 1 0 00 −1 0 00 0 1 00 0 0 10 0 −1 −1
=
1 1 1 01 1 1 01 1 0 11 1 0 11 1 0 11 1 −1 −11 1 −1 −11 −1 1 01 −1 1 01 −1 1 01 −1 0 11 −1 0 11 −1 −1 −11 −1 −1 −1
• Another valid set of side conditions in this model is
α2 = 0, β3 = 0.
I leave it to you to derive the reduced model matrix (the model ma-trix under the side conditions) for these conditions, and to convinceyourself that these side consitions, like the sum-to-zero constraints,leave the model space unchanged.
187
• In the previous example, note that the two sets of side conditionswere
(0 1 1 0 0 00 0 0 1 1 1
)
µα1
α2
β1
β2
β3
=(
α1 + α2
β1 + β2 + β3
)=
(00
)
and
(0 0 1 0 0 00 0 0 0 0 1
)
µα1
α2
β1
β2
β3
=(
α2
β3
)=
(00
).
In each case, the side condition was of the form Tβ = 0 where Tβwas a vector of non-estimable functions of β.
This result is general. Side conditions must be restrictions on non-estimablefunctions of β. If constraints are placed on estimable functions of β thenthis actually changes the model space, which is not our goal.
• Note also that in the example the rank deficiency of X was 2 (X had6 columns but rank equal to 4). Therefore, two side conditions werenecessary to remove this rank deficiency (the number of elements ofTβ was 2).
188
In general, for the linear model
y = Xβ + e, E(e) = 0, var(e) = σ2In,
where X is n × p with rank(X) = k < p ≤ n, we define side conditionsto be a set of constraints of the form Tβ = 0 where T has rank q whereq = p− k (q =the rank deficiency), and
i. rank(
XT
)= p, and
ii. rank(
XT
)= rank(T) + rank(X).
• Note that (i) and (ii) imply Tβ is nonestimable.
Theorem: In the spherical errors linear model y = Xβ + e where X isn×p of rank k < p, the unique least-squares estimator of β under the sidecondition Tβ = 0 is given by
β = (XT X + TT T)−1XT y.
Proof: As in problem 8.19 from your homework, we can use the methodof Lagrange multipliers. Introducing a Lagrange multiplier λ, the con-strained least squares estimator of β minimizes
u = (y −Xβ)T (y −Xβ) + λT (Tβ − 0)
Differentiating u with respect to β and λ leads to the equations
XT Xβ +12TT λ = XT y
Tβ = 0,
which can be written as the single equation(
XT X TT
T 0
)(β12λ
)=
(XT y
0
).
189
Under the conditions for Tβ = 0 to be a side condition, it can be shown
that(
XT X TT
T 0
)is nonsingular with inverse given by
(XT X TT
T 0
)−1
=(
H−1XT XH−1 H−1TT
TH−1 0
),
where H = XT X+TT T. (See Wang and Chow, Advanced Linear Models,§5.2, for details.)
Therefore, the constrained least-squares equations have a unique solutiongiven by
(β12 λ
)=
(H−1XT XH−1 H−1TT
TH−1 0
)(XT y
0
)
=(
H−1XT XH−1XT yTH−1XT y
)=
(H−1XT y
0
)
Here, the last equality follows because
i. XT XH−1XT = XT andii. TH−1XT = 0.
190
To show (i) and (ii) let X1 be a k × p matrix containing the linearly
independent rows of X and let L =(
X1
T
). Then L is a p×p nonsingular
matrix.
There exists an n×k matrix C such that X = CX1 = (C,0)L. In addition,we can write T = 0X1 + T = (0, Iq)L.
Therefore,
XT X = LT
(CT C 0
0 0
)L, and TT T = LT
(0 00 Iq
)L.
Note that CT C is nonsingular (this follows from result 3 on rank, p. 15),so direct calculation gives
XT X(XT X + TT T)−1XT = LT
(CT C 0
0 0
)L
{LT
(CT C 0
0 Iq
)L
}−1
XT
= LT
(CT C 0
0 0
)L
{L−1
((CT C)−1 0
0 Iq
)(LT )−1
}XT
= LT
(Ik 00 0
)(LT )−1XT = LT
(Ik 00 0
)(LT )−1LT
(CT
0
)
= LT
(Ik 00 0
)(CT
0
)= LT
(CT
0
)= XT
establishing (i) and
T(XT X + TT T)−1XT
= (0, Iq)L{L−1
((CT C)−1 0
0 Iq
)(LT )−1
}LT
(CT
0
)= 0
which establishes (ii).
191
Example - Weight Gain in Rats (Continued):
Returning to the data of p. 186, we now have two equivalent methods ofobtaining the unique least-squares parameter estimates of the constrainedmodel:
yijk = µ + αi + βj + eijk,
∑i αi = 0∑j βj = 0 (†)
First, we can solve the constraints to yield
α2 = −α1, and β3 = −β1 − β2.
Substituting into the model equation gives the full rank model matrix Xgiven on p. 187. Thus the least-squares estimate for the unconstrainedparameter vector δ = (µ, α1, β1, β2)T based on model (†) is given by
δ = (XT X)−1XT y =
82.733−0.9413.2783.455
Alternatively, we can use the method of Lagrange multipliers to obtain theleast-squares estimate of the original parameter vector β = (µ, α1, α2, β1, β2, β3)T
subject to the constraints. This estimator is given by
β = (XT X + TT T)−1XT y (‡)where T is the constraint matrix given on the top of p. 187:
T =(
0 1 1 0 0 00 0 0 1 1 1
)
Substituting the original model matrix X from p. 186 and T into (‡) weobtain
β =
82.733−0.9410.9413.2783.455−6.733
.
192
• Note that the solution β is one (of infinitely many) valid solutionsto the unconstrained model (***). It corresponds to one possiblechoice of generalized inverses for XT X.
• In particular it corresponds to choosing (XT X + TT T)−1 as thegeneralized inverse of XT X. That (XT X+TT T)−1 is a generalizedinverse of XT X follows from result (i) on p.190. By (i),
XT X(XT X + TT T)−1XT
︸ ︷︷ ︸=XT
X = XT X
which is the defining property of a generalized inverse.
• Whichever approach we take, we obtain the same estimate of µ =Xβ and of any estimable functions of β (for example, α1 − α2, thedifference in treatment effects for high and low protein) in our orig-inal overparameterized model. See ratsexamp.txt.
Hypothesis Testing:
For inference, we need distributional assumptions, so throughout this sec-tion we assume the non-full rank linear model with normal errors:
y = Xβ + e, e ∼ Nn(0, σ2In), (♠)
where X is n× p with rank k < p ≤ n.
193
Testable hypotheses:
Suppose we are interested in testing a hypothesis of the form
H0 : Cβ = 0.
• Not all such hypothesis are testable. E.g., if Cβ is nonestimable (isa vector with non-estimable elements) then H0 cannot be tested.
• This should come as no surprise. E.g., in the one-way layout modelyij = µ + αi + eij without any constraints, we cannot test µ = µ0
because µ is not identifiable. µ could be any one of an infinite numberof values without changing the model (as long as we change the αi’saccordingly), therefore how could we test whether its equal to a givennull value?
A hypothesis is said to be testable when we can calculate an F−statisticthat is suitable for testing it. There are three conditions that C mustsatisfy for H0 to be testable:
1. Cβ must be estimable (must have estimable elements).⇒ CT must lie in the row space of X.⇒ there must exist a A so that CT = XT A or, equivalently, C =AT X for some A.(It would be meaningless to test hypotheses on nonestimable hy-potheses anyway.)
194
2. C must have full row rank. I.e., for C m×p, we require rank(C) = m.⇒ this ensures that the hypothesis contains no redundant state-ments.
– E.g., suppose β is 3 × 1 and we wish to test the hypothesisthat β1 = β2 = β3. We can express this hypothesis in the formH0 : Cβ for (infinitely) many choices of C. A valid choice ofC is
C =(
1 −1 00 1 −1
)
Notice that this 2× 3 matrix has row rank 2.
An invalid choice of C is
C =
1 −1 00 1 −11 0 −1
Notice that the last row is redundant (given that β1 = β2 andβ2 = β3 its redundant to require that β1 = β3), and rank(C) =2.
3. C must have no more than rank(X) = k rows.This is because, in general, one can only construct up to rank(X)linearly independent estimable functions.
Subject to these conditions, H0 is testable. As in the full rank case, thereare two equivalent approaches to testing H0:
1. Formulate a test statistic based on a comparison between Cβ andits null value 0.
2. Recast the null hypothesis H0 : Cβ = 0 in an equivalent form H0 :µ ∈ V0 for µ = E(y) = Xβ and an appropriate subspace V0 ⊂ V =C(X). That is, translate testing H0 into a full vs. reduced modeltesting problem.
• In the full rank case we described both of these approaches. Becauseof time constraints, we’ll only describe approach 1 in the non-fullrank case, but approach 2 follows in much the same way as before.
195
The General Linear Hypothesis:
As in the full rank case, our F test is based on the quadratic form
{Cβ − E0(Cβ)}T {var0(Cβ)}−1{Cβ − E0(Cβ)}= (Cβ)T {var0(Cβ)}−1(Cβ)
(Here the 0 subscript indicates that the expected value and variance arecomputed under H0 : Cβ = 0.)
The theorem leading to the F statistic is as follows:
Theorem: In model (♠), if C is m × p of rank m ≤ k = rank(X) suchthat Cβ is a set of m LIN estimable functions, and if β = GXT y, forsome generalized inverse G of XT X, then
(i) CGCT is nonsingular and invariant to the choice of G;
(ii) Cβ ∼ Nm(Cβ, σ2CGCT );
(iii) SSH/σ2 = (Cβ)T (CGCT )−1Cβ/σ2 ∼ χ2(m,λ), where
λ = (Cβ)T (CGCT )−1Cβ/(2σ2);
(iv) SSE/σ2 = yT (I−XGXT )y/σ2 ∼ χ2(n− k); and
(v) SSH and SSE are independent.
Proof:
(i) Since Cβ is a vector of estimable function, there must exist an A sothat C = AT X. Therefore,
CGCT = AT XGXT A = AT PC(X)A,
which is unique. To show CGCT is nonsingular we show that it isof full rank.
196
In general, we have the following two results about rank: (i) for anymatrix M, rank(MT M) = rank(MT ); and (ii) for any matrices Mand N, rank(MN) ≤ rank(M).
In addition, G can be chosen to be a symmetric generalized inverseof XT X (this is always possible), so G can be written G = LT L forsome L.
Therefore,
rank(CGCT ) = rank(CLT LCT ) = rank(CLT )
≥ rank(CLT L) = rank(CG) = rank(AT XG)
≥ rank(AT XGXT X) = rank(AT X) = rank(C) = m
So we’ve established that rank(CGCT ) ≥ m. But, since CGCT ism × m it follows that rank(CGCT ) ≤ m. Together, these resultsimply
rank(CGCT ) = m.
(ii) By the theorem on p. 176,
β ∼ Np[GXT Xβ, σ2GXT XG].
Therefore, Since Cβ is an affine transformation of a normal, andC = AT X for some A,
Cβ ∼ Nm[CGXT Xβ, σ2CGXT XGCT ]
= Nm[AT XGXT X︸ ︷︷ ︸=X
β, σ2AT XGXT X︸ ︷︷ ︸=X
GXT A]
= Nm(Cβ, σ2CGCT ].
197
(iii) By part (ii), var(Cβ) = σ2CGCT . Therefore, SSH/σ2 is a quadraticform in a normal random vector. Since,
σ2[CGCT ]−1CGCT /σ2 = I,
the result follows from the theorem on the bottom of p. 81.
(iv) This was established in part (ii) of the theorem on p. 176.
(v) Homework.
Putting the results of this theorem together, we obtain the F statistic fortesting H0 : Cβ = 0 for H0 a testable hypothesis:
Theorem: In the setup of the previous theorem, then the F statistic fortesting H0 : Cβ = 0 is as follows:
F =SSH/m
SSE/(n− k)
=(Cβ)T [CGCT ]−1Cβ/m
SSE/(n− k)∼ F (m,n− k, λ),
where G is a generalized inverse of XT X and
λ ={
12σ2 (Cβ)T (CGCT )−1(Cβ) in general0 under H0.
Proof: Follows directly from the previous theorem and the definition ofthe noncentral F distribution.
• As in the full rank case, this F statistic can be extended to test ahypothesis of the form H0 : Cβ = t for t a vector of constants,Cβ estimable, and C of full row rank. The resulting test statistic isidentical to F above, but with Cβ replaced by Cβ − t.
198
Breaking up Sums of Squares:
Consider again the problem of testing a full versus reduced model. Thatis suppose we are interested in testing the hypothesis H0 : β2 = 0 in themodel
y = Xβ + e = (X1,X2)(
β1
β2
)+ e
= X1β1 + X2β2 + e, e ∼ N(0, σ2I) (FM)
Under H0 : β2 = 0 the model becomes
y = X1β∗1 + e∗, e∗ ∼ N(0, σ2I) (RM)
The problem is to test
H0 : µ ∈ C(X1) (RM) versus H1 : µ /∈ C(X1)
under the maintained hypothesis that µ ∈ C(X) = C([X1,X2]) (FM).
When discussing the full rank CLM, we saw that the appropriate teststatistic for this problem was
F =‖y − y0‖2/h
s2=
yT (PC(X) −PC(X1))y/h
yT (I−PC(X))y/(n− p)
∼{
F (h, n− p), under H0; andF (h, n− p, λ1), under H1,
whereλ1 =
12σ2
‖(PC(X) −PC(X1))µ‖2 =1
2σ2‖µ− µ0‖2,
h = dim(β2) = rank(X2) = rank(X) − rank(X1), and p = dim(β) =rank(X) = the number of columns in X (which we called k+1, previously).
199
More generally, in the not-necessarily full rank CLM where X is n×p withrank(X) = k ≤ p < n, this result generalizes:
F =‖y − y0‖2/m
s2
=yT (PC(X) −PC(X1))y/m
yT (I−PC(X))y/(n− k)
∼{
F (m,n− k), under H0; andF (m,n− k, λ1), under H1,
where λ1 is as before and m = {rank(X)− rank(X1)}.
Recall that the squared projection length yT (PC(X) − PC(X1))y in thenumerator of this F statistic is equal to
SS(β2|β1) ≡ SSR(FM)− SSR(RM) = SSE(RM)− SSE(FM)
so that F = SS(β2|β1)/mMSE .
The quantity SS(β2|β1) goes by several different names: the extra regres-sion sum of squares or the reduction in error sum of squares due to β2
after fitting β1, and several different notations: R(β|β1), SSR(X2|X1),etc.
Regardless of the notation or terminology, SS(β2|β1) is a sum of squaresthat quantifies the amount of variability in y accounted for by addingX2β2 to the regresion model that includes only X1β1.
That is, it quantifies the contribution to the regression model of the ex-planatory variables in X2 above and beyond the explanatory variables inX1.
200