multivariate statistics: a vector space approach || multivariate statistics: a vector space approach
TRANSCRIPT
Multivariate Statistics: A Vector Space ApproachAuthor(s): Morris L. EatonSource: Lecture Notes-Monograph Series, Vol. 53, Multivariate Statistics: A Vector SpaceApproach (2007), pp. i-viii, 1-463, 465-501, 503-512Published by: Institute of Mathematical StatisticsStable URL: http://www.jstor.org/stable/20461449 .
Accessed: 14/06/2014 17:27
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp
.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].
.
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access toLecture Notes-Monograph Series.
http://www.jstor.org
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
Institute of Mathematical Statistics
LECTURE NOTES-MONOGRAPH SERIES
Volume 53
Multivariate Statistics A Vector Space Approach
Morris L. Eaton
Institute of Mathematical Statistics 2IS Beachwood, Ohio, USA
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
Institute of Mathematical Statistics Lecture Notes-Monograph Series
Series Editor: R. A. Vitale
The production of the Institute of Mathematical Statistics Lecture Notes-Monograph Series is managed by the
IMS Office: Jiayang Sun, Treasurer and Elyse Gustafson, Executive Director.
Library of Congress Control Number: 2006940290
International Standard Book Number 9780940600690,
0-940600-69-2
International Standard Serial Number 0749-2170
Copyright ? 2007 Institute of Mathematical Statistics
All rights reserved
Printed in Lithuania
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
Contents Preface ............................. v
Notation ..............................viii
1. VECTOR SPACE THEORY ................................................ 1
1.1. Vector Spaces .......................................................... 2
1.2. Linear Transformations . ................................................. 6
1.3. Inner Product Spaces .................................................. 13
1.4. The Cauchy-Schwarz Inequality .......... ............................. 25
1.5. The Space L(V, W) ...................................................... 29
1.6. Determinants and Eigenvalues ............ ............................. 38
1.7. The Spectral Theorem . ................................................. 49
Problems . ............................................................ 63
Notes and References ...................................................... 69
2. RANDOM VECTORS ...................................................... 70
2.1. Random Vectors ....................................................... 70
2.2. Independence of Random Vectors .......... ............................. 76
2.3. Special Covariance Structures .......................................... 81
Problems . ............................................................ 98
Notes and References ..................................................... 102
3. THE NORMAL DISTRIBUTION ON A VECTOR SPACE ............... 103
3.1. The Normal Distribution ............................................. 104
3.2. Quadratic Forms ...................................................... 109
3.3. Independence of Quadratic Forms ......... ........................... 113
3.4. Conditional Distributions .............. ............................... 116
3.5. The Density of the Normal Distribution ....... ....................... 120
Problems .......................................................... 127
Notes and References ..................................................... 131
4. LINEAR STATISTICAL MODELS .......... ............................ 132
4.1. The Classical Linear Model ........................................... 132
4.2. More About the Gauss-Markov Theorem ............................. 140
4.3. Generalized Linear Models ........................................... 146
Problems .......................................................... 154
Notes and References ..................................................... 157
5. MATRIX FACTORIZATIONS AND JACOBIANS ...... ................. 159
5.1. Matrix Factorizations ................................................. 159
5.2. Jacobians ............................................................ 166
Problems .......................................................... 180
Notes and References ..................................................... 183
6. TOPOLOGICAL GROUPS AND INVARIANT MEASURES ..... ....... 184
6.1. Groups ............................................................ 185
6.2. Invariant Measures and Integrals ..................................... 194
6.3. Invariant Measures on Quotient Spaces ....... ........................ 207
6.4. Transformations and Factorizations of Measures ...... ................ 218
Problems ................................................................. 228
Notes and References ...................................................... 232
7. FIRST APPLICATIONS OF INVARIANCE ....... ...................... 233
7.1. Left On Invariant Distributions on n x p Matrices ...... ............... 233
7.2. Groups Acting on Sets ................................................ 241
iii
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
iv
7.3. Invariant Probability Models ......................................... 251
7.4. The Invariance of Likelihood Methods ...... .......................... 258
7.5. Distribution Theory and Invariance ....... ............................ 267
7.6. Independence and Invariance ........ .................................. 284
Problems ................................................................. 296
Notes and References ........................ ............................. 299
8. THE WISHART DISTRIBUTION ........................................ 302
8.1. Basic Properties ............. ......................................... 302
8.2. Partitioning a Wishart Matrix ....... ................................. 309
8.3. The Noncentral Wishart Distribution ...... ........................... 316
8.4. Distributions Related to Likelihood Ratio Tests ....................... 318
Problems . ................................................................ 329
Notes and References ............. ........................................ 332
9. INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS .... 334
9.1. The MANOVA Model .......... ...................................... 336
9.2. MANOVA Problems with Block Diagonal Covariance Structure ....... 350
9.3. Intraclass Covariance Structure ....... ................................ 355
9.4. Symmetry Models: An Example ....... ............................... 361
9.5. Complex Covariance Structures ....................................... 370
9.6. Additional Examples of Linear Models ...... .......................... 381
Problems . ................................................................ 397
Notes and References ............. ........................................ 401
10. CANONICAL CORRELATION COEFFICIENTS ..... .................. 403
10.1. Population Canonical Correlation Coefficients ..... .................. 403
10.2. Sample Canonical Correlations ....... ............................... 419
10.3. Some Distribution Theory ........ ................................... 427
10.4. Testing for Independence ............................................ 443
10.5. Multivariate Regression ......... .................................... 451
Problems .............................. ................................... 456
Notes and References ............................... 463
APPENDIX . ................................................................ 465
COMMENTS ON SELECTED PROBLEMS ...... .......................... 471
BIBLIOGRAPHY . .......................................................... 503
INDEX ..................................................................... 507
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
Preface
The purpose of this book is to present a version of multivariate statistical the ory in which vector space and invariance methods replace, to a large extent, more
traditional multivariate methods. The book is a text. Over the past ten years, var
ious versions have been used for graduate multivariate courses at the University of Chicago, the University of Copenhagen, and the University of Minnesota. Designed for a one year lecture course or for independent study, the book contains a full
complement of problems and problem solutions. My interest in using vector space methods in multivariate analysis was aroused
by William Kruskal's success with such methods in univariate linear model the ory. In the late 1960s, I had the privilege of teaching from Kruskal's lecture notes
where a coordinate free (vector space) approach to univariate analysis of variance
was developed. (Unfortunately, Kruskal's notes have not been published.) This ap proach provided an elegant unification of linear model theory together with many useful geometric insights. In addition, I found the pedagogical advantages of the
approach far outweighed the extra effort needed to develop the vector space ma
chinery. Extending the vector space approach to multivariate situations became a goal, which is realized here. Basic material on vector spaces, random vectors, the
normal distribution, and linear models take up most of the first half of this book.
Invariance (group theoretic) arguments have long been an important research tool in multivariate analysis as well as in other areas of statistics. In fact, invariance
considerations shed light on most multivariate hypothesis testing, estimation, and distribution theory problems. When coupled with vector space methods, invariance provides an important complement to the traditional distribution theory-likelihood approach to multivariate analysis. Applications of invariance to multivariate prob
lems occur throughout the second half of this book. A brief summary of the contents and flavor of the ten chapters herein follows. In
Chapter 1, the elements of vector space theory are presented. Since my approach to
the subject is geometric rather than algebraic, there is an emphasis on inner product
spaces where the notions of length, angle, and orthogonal projection make sense.
Geometric topics of particular importance in multivariate analysis include singular
value decompositions and angles between subspaces. Random vectors taking values
in inner product spaces is the general topic of Chapter 2. Here, induced distribu
tions, means, covariances, and independence are introduced in the inner product
space setting. These results are then used to establish many traditional properties of
the multivariate normal distribution in Chapter 3. In Chapter 4, a theory of linear
models is given that applies directly to multivariate problems. This development, suggested by Kruskal's treatment of univariate linear models, contains results that
identify all the linear models to which the Gauss-Markov Theorem applies. Chapter 5 contains some standard matrix factorizations and some elementary
Jacobians that are used in later chapters. In Chapter 6, the theory of invariant
integrals (measures) is outlined. The many examples here were chosen to illustrate
the theory and prepare the reader for the statistical applications to follow. A host
of statistical applications of invariance, ranging from the invariance of likelihood
methods to the use of invariance in deriving distributions and establishing inde
v
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
vi
pendence, are given in Chapter 7. Invariance arguments are used throughout the
remainder of the book. The last three chapters are devoted to a discussion of some traditional and not so
traditional problems in multivariate analysis. Here, I have stressed the connections
between classical likelihood methods, linear model considerations, and invariance arguments. In Chapter 8, the Wishart distribution is defined via its representa
tion in terms of normal random vectors. This representation, rather than the form
of the Wishart density, is used to derive properties of the Wishart distribution.
Chapter 9 begins with a thorough discussion of the multivariate analysis of vari
ance (MANOVA) model. Variations on the MANOVA model including multivariate
linear models with structured covariances are the main topic of the rest of Chap
ter 9. An invariance argument that leads to the relationship between canonical
correlations and angles between subspaces is the lead topic in Chapter 10. After
a discussion of some distribution theory, the chapter closes with the connection
between testing for independence and testing in multivariate regression models. Throughout the book, I have assumed that the reader is familiar with the basic
ideas of matrix and vector algebra in coordinate spaces and has some knowledge
of measure and integration theory. As for statistical prerequisites, a solid first year
graduate course in mathematical statistics should suffice. The book is probably best read and used as it was written from front to back. However, I have taught short
(one quarter) courses on topics in MANOVA using the material in Chapters 1, 2,
3, 4, 8, and 9 as a basis.
It is very difficult to compare this text with others on multivariate analysis. Al
though there may be a moderate amount of overlap with other texts, the approach
here is sufficiently different to make a direct comparison inappropriate. Upon reflec
tion, my attraction to vector space and invariance methods was, I think, motivated
by a desire for a more complete understanding of multivariate statistical models
and techniques. Over the years, I have found vector space ideas and invariance ar
guments have served me well in this regard. There are many multivariate topics not
even mentioned here. These include discrimination and classification, factor analy
sis, Bayesian multivariate analysis, asymptotic results and decision theory results.
Discussions of these topics can be found in one or more of the books listed in the
Bibliography. As multivariate analysis is a relatively old subject within statistics, a bibliog
raphy of the subject is very large. For example, the entries in A Bibliography of
Multivariate Analysis by T. W. Anderson, S. Das Gupta, and G. H. P. Styan, pub
lished in 1972, number over 6000. The condensed bibliography here contains a few
of the important early papers plus a sample of some recent work that reflects my
bias. A more balanced view of the subject as a whole can be obtained by perusing
the bibliographies of the multivariate texts listed in the Bibliography.
My special thanks go to the staff of the Institute of Mathematical Statistics at the
University of Copenhagen for support and encouragement. It was at their invitation
that I spent the 1971-1972 academic year at the University of Copenhagen lecturing
on multivariate analysis. These lectures led to Multivariate Statistical Analysis,
which contains some of the ideas and the flavor of this book. Much of the work herein
was completed during a second visit to Copenhagen in 1977-1978. Portions of the
work have been supported by the National Science Foundation and the University
of Minnesota. This generous support is gratefully acknowledged.
A number of people have read different versions of my manuscript and have made
a host of constructive suggestions. Particular thanks go to Michael Meyer, whose good sense of pedagogy led to major revisions in a number of places. Others whose
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
vii
help I would like to acknowledge are Murray Clayton, Siu Chuen Ho, and Takeaki
Kariya. Most of the typing of the manuscript was done by Hanne Hansen. Her efforts are
very much appreciated. For their typing of various corrections, addenda, changes, and so on, I would like to thank Melinda Hutson, Catherine Stepnes, and Victoria
Wagner.
Morris L. Eaton Minneapolis, Minnesota
May 1983
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
Notation
(V, (.,)) an inner product space, vector space V and inner product (,.) L(V, W) the vector space of linear transformations on V to W
Gl(V) the group of nonsingular linear transformations on V to V e(V) the orthogonal group of the inner product space (V, (., )) Rn Euclidean coordinate space of all n-dimensional column vectors Ip,n the linear space of all n x p real matrices
Gln the group of n x n nonsingular matrices
On the group of n x n orthogonal matrices
-Fp,n the space of n x p real matrices whose p columns form an ortho
normal set in Rn
G+ the group of lower triangular matrices with positive diagonal
elements-dimension implied by context G+ the group of upper triangular matrices with positive diagonal
elements-dimension implied by context S+ the set of p x p real symmetric positive definite matrices
A > 0 the matrix or linear transformation A is positive definite
A ) 0 A is positive semidefinite (non-negative definite) det determinant tr trace x L] y the outer product of the vectors x and y
A 0 B the Kronecker product of the linear transformations A and B
Ar the right-hand modulus of a locally compact topological group
LO( ) the distributional law of "*" N(u, E) the normal distribution with mean ,u and covariance Z on an inner
product space W(E, p, n) the Wishart distribution with n degrees of freedom and p x p
parameter matrix E
viii
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 1
Vector Space Theory
In order to understand the structure and geometry of multivariate distribu tions and associated statistical problems, it is essential lhat we be able to
distinguish those aspects of multivariate distributions that can be described without reference to a coordinate system and those that cannot. Finite dimensional vector space theory provides us with a framework in which it becomes relatively easy to distinguish between coordinate free and coordi nate concepts. It is fair to say that the material presented in this chapter furnishes the language we use in the rest of this book to describe many of
the geometric (coordinate free) and coordinate properties of multivariate probability models. The treatment of vector spaces here is far from com plete, but those aspects of the theory that arise in later chapters are covered. Halmos (1958) has been followed quite closely in the first two sections of this chapter, and because of space limitations, proofs sometimes read "see
Halmos (1958)." The material in this chapter runs from the elementary notions of basis,
dimension, linear transformation, and matrix to inner product space, or thogonal projection, and the spectral theorem for self-adjoint linear trans formations. In particular, the linear space of linear transformations is studied in detail, and the chapter ends with a discussion of what is
commonly known as the singular value decomposition theorem. Most of the vector spaces here are finite dimensional real vector spaces, although excursions into infinite dimensions occur via applications of the Cauchy-Schwarz Inequality. As might be expected, we introduce complex coordinate spaces in the discussion of determinants and eigenvalues.
Multilinear algebra and tensors are not covered systematically, although the outer product of vectors and the Kronecker product of linear transfor
mations are covered. It was felt that the simplifications and generality obtained by introducing tensors were not worth the price in terms of added notation, vocabulary, and abstractness.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
2 VECTOR SPACE THEORY
1.1. VECTOR SPACES
Let R denote the set of real numbers. Elements of R, called scalars, are
denoted by a, /B,....
Definition 1.1. A set V, whose elements are called vectors, is called a real vector space if:
(I) to each pair of vectors x, y E V, there is a vector x + y e V, called the
sum of x and y, and for all vectors in V,
(i) x+y=y+x.
(ii) (x+y)+z=x+(y+z). (iii) There exists a unique vector 0 E V such that x + 0 = x for all x.
(iv) For each x e V, there is a unique vector - x such that x + (-x)
= 0.
(II) For each a E R and x E V, there is a vector denoted by ax E V, called
the product of a and x, and for all scalars and vectors,
(i) a(/3x) = (a/)x.
(ii) lx = x.
(iii) (a + =)x =ax + Ax.
(iv) a(x + y) =ax + ay.
In II(iii), (a + ,B)x means the sum of the two scalars, a and /3, times x,
while ax + fix means the sum of the two vectors, ax and /3x. This multiple
use of the plus sign should not cause any confusion. The reason for calling V a real vector space is that multiplication of vectors by real numbers is
permitted. A classical example of a real vector space is the set Rn of all ordered
n-tuples of real numbers. An element of R', say x, is represented as
X2 x= . X xiE R 1.. n,
and xi is called the ith coordinate of x. The vector x + y has ith coordinate
xi + y, and ax, a E R, is the vector with coordinates axi, i = ,..., n. With
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
VECTOR SPACES 3
0 E R' representing the vector of all zeroes, it is routine to check that Rn is
a real vector space. Vectors in the coordinate space Rn are always repre
sented by a column of n real numbers as indicated above. For typographical
convenience, a vector is often written as a row and appears as x' = (x
xn). The prime denotes the transpose of the vector x E R=.
The following example provides a method of constructing real vector spaces and yields the space Rn as a special case.
* Example 1.1. Let 9X be a set. The set V is the collection of all the
real-valued functions defined on %. For any two elements xl, x2 E
V, define x, + x2 as the function on 9C whose value at t is
xl(t) + x2(t). Also, if a E R and x E V, ax is the function on 9
given by (ax)(t) ax(t). The symbol 0 E Vis the zero function. It is easy to verify that V is a real vector space with these definitions of addition and scalar multiplication. When QX = (1, 2,. . ., n}, then
V is just the real vector space Rn and x E Rn has as its ith
coordinate the value of x at i E 9X. Every vector space discussed in
the sequel is either V (for some set 9X) or a linear subspace (to be
defined in a moment) of some V.
Before defining the dimension of a vector space, we need to discuss linear dependence and independence. The treatment here follows Halmos (1958, Sections 5-9). Let V be a real vector space.
Definition 1.2. A finite set of vectors (xii = 1,..., k) is linearly dependent if there exist real numbers a,,..., ak, not all zero, such that 2aixi = 0.
Otherwise, (xiAi = 1,..., k) is linearly independent.
A brief word about summation notation. Ordinarily, we do not indicate indices of summation on a summation sign when the range of summation is clear from the context. For example, in Definition 1.2, the index i was
specified to range between 1 and k before the summation on i appeared;
hence, no range was indicated on the summation sign. An arbitrary subset S c V is linearly independent if every finite subset of
S is linearly independent. Otherwise, S is linearly dependent.
Definition 1.3. A basis for a vector space V is a linearly independent set S such that every vector in V is a linear combination of elements of S. V is
finite dimensional if it has a finite set S that is a basis.
* Example 1.2. Take V = R' and let E, = (0,. . ., 0, 1, 0, . . ., 0) where
the one occurs as the ith coordinate of e, i = 1,...,n. For x E R ,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
4 VECTOR SPACE THEORY
it is clear that x = ExiEi where xi is the ith coordinate of x. Thus
every vector in Rn is a linear combination of E1,..., e,. To show that {ejli - 1,..., n) is a hnearly independent set, suppose Eaiei = 0 for
some scalars ai, i =1,...,n. Then x = aiei 0= has ai as its ith
coordinate, so ai = , i =,..., n.Thus {eii = 1,...,nis a basis
for Rn and Rn is finite dimensional. The basis {eji = 1,..., n} is called the standard basis for Rn.
Let V be a finite dimensional real vector space. The basic properties of
linearly independent sets and bases are:
(i) If (x,,..., xm) is a linearly independent set in V, then there exist
vectors xm+19... xm+k such that xm,..., Xm+k} is a basis for V.
(ii) All bases for V have the same number of elements. The dimension
of V is defined to be the number of elements in any basis.
(iii) Every set of n + 1 vectors in an n-dimensional vector space is
linearly dependent.
Proofs of the above assertions can be found in Halmos (1958, Sections 5-8).
The dimension of a finite dimensional vector space is denoted by dim(V). If
{x,,..., xn) is a basis for V, then every x E V is a unique linear combina
tion of (xl,..., xn)-say x = E2aixi. That every x can be so expressed
follows from the definition of a basis and the uniqueness follows from the
linear independence of (xl,..., xn). The numbers a,,..., -an are called the
coordinates of x in the basis {x,,..., xn). Clearly, the coordinates of x
depend on the order in which we write the basis. Thus by a basis we always
mean an ordered basis.
We now introduce the notion of a subspace of a vector space.
Definition 1.4. A nonempty subset M c V is a subspace (or linear mani
fold) of V if, for each x, y E M and a, ,B E R, ax + f,y E M.
A subspace M of a real vector space V is easily shown to satisfy the
vector space axioms (with addition and scalar multiplication inherited from V), so subspaces are real vector spaces. It is not difficult to verify the
following assertions (Halmos, 1958, Sections 10-12):
(i) The intersection of subspaces is a subspace.
(ii) If M is a subspace of a finite dimensional vector space V, then
di-M < 1
di_m(V__)T.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.1 5
(iii) Given an m-dimensional subspace M of an n-dimensional vector space V, there is a basis (xI..., xm,..., x") for V such that (xl,..., xm) is a basis for M.
Given any set S c V, span(S) is defined to be the intersection of all the
subspaces that contain S-that is, span(S) is the smallest subspace that contains S. It is routine to show that span(S) is equal to the set of all linear
combinations of elements of S. The subspace span(S) is often called the subspace spanned by the set S.
If M and N are subspaces of V, then span(M U N) is the set of all
vectors of the form x + y where x E M and y E N. The suggestive notation
M + N (zlz = x + y, x E M, y E N) is used for span(M U N) when M
and N are subspaces. Using the fact that a linearly independent set can be
extended to a basis in a finite dimensional vector space, we have the
following. Let V be finite dimensional and suppose M and N are subspaces of V.
(i) Let m = dim(M), n = dim(N), and k = dim(M r) N). Then there
exist vectors XI,... . Xk, Yk+1... ' Ym' and Zk+ 1 ... Zn such that
(xi, ..., xk) is a basis for M n N, (XI ..., Xk, Yk+1l... Ym) is a
basis for M, (xI ... 9 Xk, Zk+1,... Zn} is a basis for N, and (xl,....
Xk, Yk+ I** Ymm Zk+ 1 *... I Zn) is a basis for M + N. If k O, then
(xi,...I, Xk) is interpreted as the empty set.
(ii) dim(M + N) = dim(M) + dim(N) - dim(M n N). (iii) There exists a subspace M1 c V such that M n M1 = (0) and
M+M1= V.
Definition 1.5. If M and N are subspaces of V that satisfy M n N = (0) and M + N = V, then M and N are complementary subspaces.
The technique of decomposing a vector space into two (or more) comple mentary subspaces arises again and again in the sequel. The basic property of such a decomposition is given in the following proposition.
Proposition 1.1. Suppose M and N are complementary subspaces in V. Then each x E V has a unique representation x = y + z with y e M and z e N.
Proof. Since M + N = V, each x E V can be written x = y, + z1 with y1 E M and z1 E N. If x = Y2 + Z2 withy2 e M and Z2 E N, then 0 = x -
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
6 VECTOR SPACE THEORY
x = (Y- Y2) + (ZI - Z2). Hence (Y2 - Yi) = (z, - z2) so (Y2 - YE M
nN = (0). Thusy, - y2. Similarly, z, = Z2 ?
The above proposition shows that we can decompose the vector space V into two vector spaces M and N and each x in V has a unique piece in M
and in N. Thus x can be represented as (y, z) withy E M and z E N. Also,
note that if xl, x2 E V and have the representations (y1, z1), (Y2' Z2), then
ax, + fix2 has the representation (ay, + fly2, az, + f3Z2), for a, ,B e R. In
other words the function that maps x into its decomposition (y, z) is linear. To make this a bit more precise, we now define the direct sum of two vector
spaces.
Definition 1.6. Let V1 and V2 be two real vector spaces. The direct sum of V1 and V2, denoted by VI @ V2, is the set of all ordered pairs {x, y),
x E VI, y E V2, with the linear operations defined by a1{x,, yl) +
2{X2, Y2) =(aix + a2x2, a1ly + a2y2).
That V, ED V2 is a real vector space with the above operations can easily
be verified. Further, identifying V1 with ({x1,0}Ix E V1) V1 and V2 with ((0, y} y e V2}- V2, we can think of V1 and V2 as complementary sub
spaces of V, E V2, since V, + V2 = V, @ V2 and V1, n V2 = (0,0), which is
the zero element in V1 3 V2. The relation of the direct sum to our previous
decomposition of a vector space should be clear.
* Example 1.3. Consider V = R', n > 2, and let p and q be positive
integers such that p + q = n. Then RP and Rq are both real vector
spaces. Each element of Rn is a n-tuple of real numbers, and we can
construct subspaces of Rn by setting some of these coordinates
equal to zero. For example, consider M = (x E R' x = (O) with
y E RP,0 E Rq) and N = (x E RnIx = (?) with 0 E RP and z E
Rq). It is clear that dim(M) = p, dim(N) = q, M n N = (0), and
M + N = Rn. The identification of RP with M and Rq with N
shows that it is reasonable to write R P d R@ = RP+.
1.2. LINEAR TRANSFORMATIONS
Linear transformations occupy a central position, both in vector space theory and in multivariate analysis. In this section, we discuss the basic
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.2 7
properties of linear transforms, leaving the deeper results for consideration after the introduction of inner products. Let V and W be real vector spaces.
Definition 1.7. Any function A defined on V and taking values in W is
called a linear transformation if A(a1x1 + a2x2) = a,A(x1) + a2A(x2) for
all xl, x2 E V and a,, a2 E R.
Frequently, A(x) is written Ax when there is no danger of confusion. Let
tP (V, W) be the set of all linear transformations on V to W. For two linear
transformations A1 and A2 in t (V, W), Al + A2 is defined by (Al + A2)(x) = Alx + A2x and (aA)(x) = aAx for a E R. The zero linear transforma
tion is denoted by 0. It should be clear that f,(V, W) is a real vector space with these definitions of addition and scalar multiplication.
* Example 1.4. Suppose dim(V) = m and let x,,..., XM be a basis
for V. Also, let y1,. . ., Ym be arbitrary vectors in W. The claim is
that there is a unique linear transformation.A such that Axi = yi, i = 1,. .., m. To see this, consider x E V and express x as a unique
linear combination of the basis vectors, x = Eaixi. Define A by
n n Ax = EaiAxi =
Eaiy i I
The linearity of A is easy to check. To show that A is unique, let B
be another linear transformation with Bxi = yi, i = 1,... , n. Then
(A - B)(xi)= 0 for i = 1,..., n, and (A - B)(x)= (A -
B)(Eaixi) = Eai(A - B)(xi) = 0 for all x e V. Thus A = B.
The above example illustrates a general principle-namely, a linear transformation is completely determined by its values on a basis. This principle is used often to construct linear transformations with specified properties. A modification of the construction in Example 1.4 yields a basis for f (V, W) when V and W are both finite dimensional. This basis is given in the proof of the following proposition.
Proposition 1.2. If dim(V) = m and dim(W) = n, then dim(fD(V, W)) = mn.
Proof. Let xl,..., xm be a basis for V and let y,,. .., yn be a basis for W. Define a linear transformation Aji, i = 1, . . ., m andj = ,..., n, by
(0 if k *
Aji (Xk) = yj ifk=i
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
8 VECTOR SPACE THEORY
For each (j, i), Aji has been defined on a basis in V so the linear
transformation Aji is uniquely determined. We now claim that (A1jii =
1,. .., Im; j = 1,. . ., n) is a basis for l (V, W). To show linear independence,
suppose EY2ajiAji = 0. Then for each k = 1,..., m,
0 = E EajiAji(Xk)
= E ajkYj j i j
Since {Yl,..., Yin) is a linearly independent set, this implies that ajk = 0 for
all j and k. Thus linear independence holds. To show every A E f (V, W) is
a linear combination of the Aji, first note that Axk is a vector in W and thus
is a unique linear combination of yl... , y, say Axk = Ejajkyj
where
ajk E R. However, the linear transformation E2ajiAji evaluated at Xk is
E EajiAji(Xk) =
EajkYj.
Since A and Y2ajiAji agree on a basis in V, they are equal. This completes
the proof since there are mn elements in the basis (Ajili = 1,..., m;
j= 1 ,... ,n) for C(V,W). El
Since E(V, W) is a vector space, general results about vector spaces, of
course, apply to E(V, W). However, linear transformations have many
interesting properties not possessed by vectors in general. For example,
consider vector spaces Vi, i = 1, 2, 3. If A e 4(VI, V2) and B c e_ (V2, V3),
then we can compose the functions B and A by defining (BA)(x) = B(A(x)).
The linearity of A and B implies that BA is a linear transformation on V, to
V3- that is, BA e E (V1, V3). Usually, BA is called the product of B and A.
There are two special cases of P,(V, W) that are of particular interest.
First, if A, B e fS(V, V), then AB E 4(V, V) and BA E f,(V, V), so we
have a multiplication defined in f (V, V). However, this multiplication is not
commutative-that is, AB is not, in general, equal to BA. Clearly, A(B +
C) = AB + AC for A, B, C E C(V, V). The identity linear transformation
in e (V, V), usually denoted by I, satisfies AI = IA = A for all A e f,(V, V),
since Ix = x for all x E V. Thus E(V, V) is not only a vector space, but
there is a multiplication defined in f (V, V).
The second special case of C (V, W) we wish to consider is when W = R
-that is, W is the one-dimensional real vector space R. The space e ( V, R)
is called the dual space of V and, if dim(V) = n, then dim(E (V, R)) = n.
Clearly, I (V, R) is the vector space of all real-valued linear functions
defined on V. We have more to say about f (V, R) after the introduction of
inner products on V.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.3 9
Understanding the geometry of linear transformations usually begins with a specification of the range and null space of the transformation. These
objects are now defined. Let A e C (V, W) where V and W are finite
dimensional.
Definition 1.8. The range of A, denoted by @A (A), is
6A (A) {(uIu E W, Ax = u for some xe V).
The null space of A, denoted by OL(A), is
9L(A) (xlx e V, Ax = 0).
It is routine to verify that 6i(A) is a subspace of W and 6L(A) is a
subspace of V. The rank of A, denoted by r(A), is the dimension of 6A (A).
Proposition 1.3. If A EE f(V, W) and n = dim(V), then r(A) +
dim(9L(A)) = n.
Proof Let M be a subspace of V such that M D 9I (A) = V, and consider
a basis (xI,..., Xk) for M. Since dim(M) + dim(9L(A)) = n, we need to
show that k = r(A). To do this, it is sufficient to show that (Ax,,..., Axk) is a basis for 6i(A). If 0 = EaiAxi = A(Eaixi), then 2aixi e M n %(A)
so Eaixi = O. Hence ail = ... = a k= as x,,..., Xk) is a basis for M.
Thus (Ax,,..., Axk) is a linearly independent set. To verify that (Ax,,..., Axk) spans @%R(A), suppose w e 6R4A). Then w = Ax for some x E V.
Write x = y + z where y E M and z e %(A). Then w = A(y + z) = Ay.
Since y EM, y = Xaixi for some scalars a,,..., ak. Therefore, w=
A(Eaixi)= EaiAxi. a
Definition 1.9. A linear transformation A E f&(V, V) is called invertible if there exists a linear transformation, denoted by A - such that AA- -l
A-'A = I.
The following assertions hold; see Halmos (1958, Section 36):
(i) A is invertible iff 6{(A) = V iff Ax = 0 implies x = 0.
(ii) If A, B, C e E I(V, V) and if AB = CA = I, then A is invertible and
B = C = A-'.
(iii) If A and B are invertible, then AB is invertible and (AB)-' = B-'A. If A is invertible and a * 0, then (aA) = a-A-' and
(A-)-' =A.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
10 VECTOR SPACE THEORY
In terms of bases, invertible transformations are characterized by the following.
Proposition 1.4. Let A E E(V, V) and suppose (x,,..., xn) is a basis for V. The following are equivalent:
(i) A is invertible. (ii) (Ax1,..., Ax,) is a basis for V.
Proof. Suppose A is invertible. Since dim(V) = n, we must show (Ax1,...,
Ax,) is a linearly independent set. Thus if 0 = 2ajAxi = A(Eaixi), then
2aixi = 0 since A is invertible. Hence ai = 0, i = 1 ... ., n, as (x,,..., xn) is
a basis for V. Therefore, (Ax,'I, , Axn) is a basis.
Conversely, suppose (Ax,,..., Ax,) is a basis. We show that Ax = 0
implies x = 0. First, write x = Eaixi so Ax = 0 implies EaiAxi = 0. Hence
ai = 0, i = ,..., n, as (Ax,,..., Ax,) is a basis. Thus x = 0, so A is
invertible. ?
We now introduce real matrices and consider their relation to linear
transformations. Consider vector spaces V and W of dimension m and n,
respectively, and bases (xl,..., xm) and {y. ., yn) for V and W. Each
x E V has a unique representation x = Eaixi. Let [x] denote the column
vector of coordinates of x in the given basis. Thus [x] E Rm and the ith
coordinate of [xl is ai, i = 1,..., m. Similarly, [y] e Rn is the column
vector of y E W in the basis (yl,..., yn}. Consider A E f (V, W) and
express Axj in the given basis of W; Axj = Eiaijyy for unique scalars aij, = 1,..., n,j = 1,..., m. The n X m rectangular array of real scalars
all a12 ... alm
a21 a22 ... a2m
[A]= .-aij)
a ~~~~a anl anm_
is called the matrix of A relative to the two given bases. Conversely, given
any n x m rectangular array of real scalars (aij), i = 1,..., n,j = 1,. .., m,
the linear transformation A defined by Axj = 1iaijyj has as its matrix
[A] = (aij).
Definition 1.10. A rectangular array (aij): m x n of real scalars is called
an m x n matrix. If A = (aij): m x n is a matrix and B = (bij}: n X p is a
matrix, then C = AB, called the matrix product of A and B (in that order) is
defined to be the matrix (cij): m x p with cij = Y2kaikbkj.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.5 11
In this book, the distinction between linear transformations, matrices, and the matrix of a linear transformation is always made. The notation [A]
means the matrix of a linear transformation with respect to two given bases.
However, symbols like A, B, or C may represent either linear transforma
tions or real matrices; care is taken to clearly indicate which case is under
consideration. Each matrix A = (aij): m x n defines a linear transformation on Rn to
Rm as follows. For x E Rn with coordinates xl,..., xn,, Ax is the vectory in
Rm with coordinates yi = Ejaijxj, i = 1,..., m. Of course, this is the usual
row by column rule of a matrix operating on a vector. The matrix of this
linear transformation in the standard bases for R' and Rm is just the matrix
A. However, if the bases are changed, then the matrix of the linear
transformation changes. When m = n, the matrix A = (aij) determines a
linear transformation on Rn to Rn via the above definition of a matrix times
a vector. The matrix A is called nonsingular (or invertible) if there exists a matrix, denoted by A - 1, such that AA - 1 = A - 'A = In where In is the n X n
identity matrix consisting of ones on the diagonal and zeroes off the diagonal. As with linear transformations, A -' is unique and exists iff
Ax = O implies x = 0.
The symbol En mdenotes the real vector space of m x n real matrices
with the usual operations of addition and scalar multiplication. In other words, if A = (aij) and B = (bij) are elements of enm' then A + B = (a + bIj) and aA = ({aij). Notice that n, m is the set of m x n matrices (m
and n are in reverse order). The reason for writing En, m is that an m x n
matrix determines a linear transformation from Rn to Rm. We have made the choice of writing E(V, W) for linear transformations from V to W, and
it is an unpleasant fact that the dimensions of a matrix occur in reverse order to the dimensions of the spaces V and W. The next result summarizes
the relations between linear transformations and matrices.
Proposition 1.5. Consider vector spaces V,, V2, and V3 with bases {xI,.... xnl), (YI. . ., yn2), and (z,,... , zn3), respectively. For x E V1, y E V2, and
z E V3, let [x], [y], and [z] denote the vector of coordinates of x, y, and z in
the given bases, so [x] E Rn', [y] E Rn2, and [z] E For A E fS(V,, V2) and B E f,(V2, V3) let [A] ([B]) denote the matrix of A(B) relative to the
bases (xl,..., xnl) and (yl,..., Yn2) ({I,- *., yn2) and (z,,..., zn3)). Then:
(i) [Ax] = [A][x]. (ii) [BA] = [B][AJ.
(iii) If V, = V2 andA is invertible, [A-] = [A]-'. Here, [A-] and [A]
are matrices in the bases (x,,..., x)ni and (x,,..., xnl).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
12 VECTOR SPACE THEORY
Proof. A few words are in order concerning the notation in (i), (ii), and (iii). In (i), [Ax] is the vector of coordinates of Ax E V2 with respect to the basis (y,,.. .,yn2) and [A][x] means the matrix [A] times the coordinate vector [x] as defined previously. Since both sides of (i) are linear in x, it suffices to verify (i) for x = xj, j = 1,.. ., ni. But [A][xj] is just the column vector with coordinates aiJ, i = l,..., n2, and Axj = Eiaijyi, so [Axj] is the
column vector with coordinates aij, i = 1,.. . n2. Hence (i) holds.
For (ii), [B][A] is just the matrix product of [B] and [A]. Also, [BA] is the matrix of the linear transformation BA E E(V,, V3) with respect to the
bases (xl,..., xnl) and {z,,..., zn3). To show that [BA] = [B][A], we must
verify that, for all x E V, [BA][x] = [B][A][x]. But by (i), [BA][x] = [BAx]
and, using (i) twice, [B][A][x] = [B][Ax] = [BAx]. Thus (ii) is established. In (iii), [A]'- denotes the inverse of the matrix [A]. Since A is invertible,
AA- 1 = A - 'A = I where I is the identity linear transformation on VI to V1. Thus by (ii), with In denoting the n X n identity matrix, In = [I] = [AA - '] = [A][A-] = [A-A] = [A ][A]. By the uniqueness of the matrix inverse, [A-'] = [A]'.
Projections are the final topic in this section. If V is a finite dimensional vector space and M and N are subspaces of V such that M @ N = V, we
have seen that each x e V has a unique piece in M and a unique piece in N.
In other words, x = y + z where y e M, z E N, and y and z are unique.
Definition 1.11. Given subspaces M and N in V such that M E N = V, if x = y + z with y e M and z E N, then y is called the projection of x on M
along N and z is called the projection of x on N along M.
Since M and N play symmetric roles in the above definition, we con centrate on the projection on M.
Proposition 1.6. The function P mapping V into V whose value at x is the
projection of x on M along N is a linear transformation that satisfies
(i) 6@(P) = M, 6%(P)
= N.
(ii) p2= p.
Proof. We first show that P is linear. If x = y + z with y E M, z E N, then by definition, Px = y. Also, if xl = y, + z, and x2 = Y2 + Z2 are the
decompositions of xl and x2, respectively, then a1x1 + a2x2 = (a,y, +
a2y2) + (aIzI + a2z2) is the decomposition of a,x, + a2x2. Thus P(alx, + a2x2) = alPxl + a2Px2 so P is linear. By definition Px E M, so 6A(P)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.7 13
C M. But if x e M, Px = x and 6A(P) = M. Also, if x E N, Px = 0 so
6L(P) D N. However, if Px = 0, then x = 0 + x, and therefore x E N.
Thus %L(P) = N. To show p2 = P, note that Px E M and Px = x for x E M. Hence, Px = P(Px) = P2X, which implies that P = p2. o
A converse to Proposition 1.6 gives a complete description of all linear transformations on V to V that satisfy A2 = A.
Proposition 1.7. If A e 1 (V, V) and satisfies A2 = A, then 6t(A) ED OL(A) = V and A is the projection on @{(A) along PZ,(A).
Proof. To show 6AJ(A) ED L(A) = V, we must verify that iAR(A) nl %1(A) = (0) and that each x E V is the sum of a vector in @Th(A) and a vector in 9 (A). If x E 6A(A) n L(A), then x = Ay for some y e V and Ax = O.
Since A2 = A, O = Ax = A2y = Ay = x and 6RA(A) n 6X(A) = (0). For x E
V, write x = Ax + (I - A)x and let y = Ax and z = (I - A)x. Then y e 6i(A) by definition and Az = A(I - A)x = (A - A2)x = 0, so z E
(X (A). Thus @A (A) @ PL(A) = V.
The verification that A is the projection on 6i(A) along 9L(A) goes as follows. A is zero on %DL(A) by definition. Also, for x E 6A (A), x = Ay for some y e V. Thus Ax = A2y = Ay = x, so Ax = x and x E 'it(A). How
ever, the projection on 6i(A) along 9L(A), say P, also satisfies Px = x for x e @Ai(A) and Px = 0 for x E 9L(A). This implies that P = A since
6AW(A) @ L(A) = V. El
The above proof shows that the projection on M along N is the unique linear transformation that is the identity on M and zero on N. Also, it is clear that P is the projection on M along N iff I - P is the projection on N along M.
1.3. INNER PRODUCT SPACES
The discussion of the previous section was concerned mainly with the linear aspects of vector spaces. Here, we introduce inner products on vector spaces so that the geometric notions of length, angle, and orthogonality become
meaningful. Let us begin with an example.
* Example 1.5. Consider coordinate space Rn with the standard basis (1,. ..., En). For x, y E R', define x'y =
Exiyi where x and y have coordinates xl,..., xn and Yi ... y,y Of course, x' is the
transpose of the vector x and x'y can be thought of as the 1 x n
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
14 VECTOR SPACE THEORY
matrix x' times the n x 1 matrix y. The real number x'y is some
times called the scalar product (or inner product) of x and y. Some properties of the scalar product are:
(i) x'y = y'x (symmetry). (ii) x'y is linear in y for fixed x and linear in x for fixed y.
(iii) x'x = YIx'i > 0 and is zero iff x = 0.
The norm of x, defined by llxll = (x'x)'/2, can be thought of as the distance between x and 0 E Rn. Hence, llx - Yll = (E(x, - Y )2)1/2 is usually called the distance between x and y. When x and y are
both not zero, then the cosine of the angle between x and y is
x'y/llxIIllyll (see Halmos, 1958, p. 118). Thus we have a geometric interpretation of the scalar product. In particular, the angle between
x and y is r/2(cos r/2 = 0) iff x'y = 0. Thus we say x and y are
orthogonal (perpendicular) iff x'y = 0.
Let V be a real vector space. An inner product on V is obtained by
simply abstracting the properties of the scalar product on Rn.
Definition 1.12. An inner product on a real vector space V is a real valued
function on V x V, denoted by (-, .), with the following properties:
(i) (x, y) = (y, x) (symmetry).
(ii) (a,x, + a2X2, y) = a,(x,, y) + a2(x2, y) (linearity).
(iii) (x, x) > 0 and (x, x) = 0 only if x = 0 (positivity).
From (i) and (ii) it follows that (x, a,y1 + a2y2) = a,(x, y,) + a2(x, Y2). In other words, inner products are linear in each variable when the other
variable is fixed. The norm of x, denoted by llxll, is defined to be llxll = (x, x)112 and the distance between x and y is lix - yIi. Hence geometrically
meaningful names and properties related to the scalar product on Rn have
become definitions on V. To establish the existence of inner products on finite dimensional vector spaces, we have the following proposition.
Proposition 1.8. Suppose (x,,. . ., xn) is a basis for the real vector space V.
The function (-, *) defined on V x V by (x, y) = Enai/i, where x = E2a x
and y = ,23ixi, is an inner product on V.
Proof Clearly (x, y) = (y, x). If x = 2aixi and z = Eyixi, then (ax +
yz, y) = -(aaj + yyi)f3i
= aEaj38 + yE2yifi = a(x, y) + y(z, y). This
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.9 15
establishes the linearity. Also, (x, x) = Xa , which is zero iff all the ai are zero and this is equivalent to x being zero. Thus (-, *) is an inner product on V. El
A vector space V with a given inner product (,*) is called an inner
product space.
Definition 1.13. Two vectors x and y in an inner product space (V, (, *)) are orthogonal, written x I y, if (x, y) = 0. Two subsets SI and S2 of V are
orthogonal, written S1 I S2, if x I y for all x E S, and y e S2.
Definition 1.14. Let (V, (-, *)) be a finite dimensional inner product space. A set of vectors {xl,..., xk) is called an orthonormal set if (xi, xj) = 8ij for i, j= l,..,k where Si.= 1 if i and 0 if i j. A set (x1,..., xk) is called an orthonormal basis if the set is both a basis and is orthonormal.
First note that an orthonormal set (x,,..., Xk) is linearly independent. To see this, suppose 0 = Yaixi. Then 0 = (0, xj) = (Eaixi, xj)
=
Eai(xi9 xj) = Eiai8ij = aj. Hence aj = 0 for j = ,..., k and the set (xl,..., Xk) is linearly independent.
In Proposition 1.8, the basis used to define the inner product is, in fact, an orthonormal basis for the inner product. Also, the standard basis for Rn
is an orthonormal basis for the scalar product on Rn-this scalar product is called the standard inner product on R'. An algorithm for constructing orthonormal sets from linearly independent sets is now given. It is known as the Gram-Schmidt orthogonalization procedure.
Proposition 1.9. Let (xl,..., Xk) be a linearly independent set in the inner product space (V, ( , )). Define vectors y,,.. ., Yk as follows:
xi Yl =
and
Xiy I (Xi+lI Yj)Yj
Yi+ I = i 5
jjxi+ - ? (xi+I yj)yjll j=,
for i = 1,..., k - 1. Then {Y,I... .Yk) is an orthonormal set and
span(xl,.. ., Xi) = span{y1,. .., yi), i = 1,..., k.
Proof. See Halmos (1958, Section 65). 0
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
16 VECTOR SPACE THEORY
An immediate consequence of Proposition 1.9 is that if (xl,..., x") is a basis for V, then {y,...., yn} constructed above is an orthonormal basis for (V,(-, )). If (y...,y y) is an orthonormal basis for (V,(., *)), then each x
in V has the representation x = 2(x, yi)yi in the given basis. To see this, we
know x = 2aiyi for unique scalars a,,..., a,. Thus
(x, y1) = Liai(yi, yj) = E
= a1
Therefore, the coordinates of x in the orthonormal basis are (x, yi), i =
1,..., n. Also, it follows that (x, x) = X(x, yi)2. Recall that the dual space of V was defined to be the set of all real-valued
linear functions on V and was denoted by P,(V, R). Also dim(V) =
dim(E(V, R)) when V is finite dimensional. The identification of V with e (V, R) via a given inner product is described in the following proposition.
Proposition 1.10. If (V, (-, *)) is a finite dimensional inner product space
and if f E f (V, R), then there exists a vector xo E V such that f(x) =
(xo, x) for x E V. Conversely, (xo, ) is a linear function on V for each
xo e V.
Proof Let xl,..., xn be an orthonormal basis for V and set ai = f(xi) for
i = 1,.. ., n. For xo = Eaixi, it is clear that (xo, xj) = aj = f(xj). Since the
two linear functions f and (xo, ) agree on a basis, they are the same
function. Thus f(x) = (xo, x) for x E V. The converse is clear. o
Definition 1.15. If S is a subset of V, the orthogonal complement of S,
denoted by S', is S' = (xlx I y for ally E S).
It is easily verified that S' is a subspace of V for any set S, and S I S'.
The next result provides a basic decomposition for a finite dimensional
inner product space.
Proposition 1.11. Suppose M is a k-dimensional subspace of an n-dimen
sional inner product space (V, (*, )). Then
(i) M n M = (O}.
(ii) M@M'= V.
(iii) (M I) I= M.
Proof. Let {xl,..., x.) be a basis for V such that (xl,..., Xk) is a basis for
M. Applying the Gram-Schmidt process to {x,,..., x ), we get an ortho
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.12 17
normal basis (Y,,.. . yn) such that {Yl,'* ,Yk} is a basis for M. Let
N = span(yk?+,,.. .,Yn}. We claim that N = M'. It is clear that N c M'
sinceyj t M forj= k + 1,..., n. But if x e M', then x = En(x, yi)yi and
(x, yi) =
O for i = 1,. . ., k since x E- M', that is, X =
En+ (X, yi)yi E- N.
Therefore, M = N'. Assertions (i) and (ii) now follow easily. For (iii), M' is spanned by {Yk+ 1*.. yn)} and, arguing as above, (M')' must be
spanned by y1,.. . ., Yk, which is just M. El
The decomposition, V = M ED M', of an inner product space is called
an orthogonal direct sum decomposition. More generally, if M,,..., Mk are
subspaces of V such that M, I Mj for i * j and V = M, ED M2 ED ... ED Mk,
we also speak of the orthogonal direct sum decomposition of V. As we have
seen, every direct sum decomposition of a finite dimensional vector space has associated with it two projections. When V is an inner product space
and V = M @E M', then the projection on M along M' is called the
orthogonal projection onto M. If P is the orthogonal projection onto M, then I - P is the orthogonal projection onto M'. The thing that makes a
projection an orthogonal projection is that its null space must be the orthogonal complement of its range. After introducing adjoints of linear transformations, a useful characterization of orthogonal projections is given.
When (V,(., *)) is an inner product space, a number of special types of
linear transformations in E(V, V) arise. First, we discuss the adjoint of a linear transformation. For A E Q(V, V), consider (x, Ay). For x fixed, (x, Ay) is a linear function of y, and, by Proposition 1.9, there exists a
unique vector (which depends on x) z(x) E V such that (x, Ay) = (z(x), y)
for all y e V. Thus z defines a function from V to V that takes x into z(x).
However, the verification that z(a,x, + a2x2) = a,z(x,) + a2z(x2) is
routine. Thus the function z is a linear transformation on V to V, and this
leads to the following definition.
Definition 1.16. For A E Q(V, V), the unique linear transformation in E(V, V), denoted by A', which satisfies (x, Ay) = (A'x, y), for all x, y E V,
is called the adjoint (or transpose) of A.
The uniqueness of A' in Definition 1.16 follows from the observation that if (Bx, y) = (Cx, y) for all x, y E V, then ((B - C)x, y) = 0. Taking
y = (B - C)x yields ((B - C)x, (B - C)x) = O for all x, so (B - C)x = 0
for all x. Hence B = C.
Proposition 1.12. If A, B e Q(V, V), then (AB)' = B'A', and if A is
invertible, then (A-')' = (A')-'. Also, (A')' = A.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
18 VECTOR SPACE THEORY
Proof (AB)' is the transformation in f2(V, V) that satisfies ((AB)'x, y) = (x, ABy). Using the definition of A' and B', (x, ABy) = (A'x, By) -
(B'A'x, y). Thus (AB)' = B'A'. The other assertions are proved similarly.
Definition 1.17. A linear transformation in E(V, V) is called:
(i) Self-adjoint (or symmetric) if A = A'. (ii) Skew symmetric if A' = -A.
(iii) Orthogonal if (Ax, Ay) = (x, y) for x, y c V.
For self-adjoint transformations, A is:
(iv) Non-negative definite (or positive semidefinite) if (x, Ax) > 0 for x e V.
(v) Positive definite if (x, Ax) > 0 for all x * 0.
The remainder of this section is concerned with a variety of descriptions
and characterizations of the classes of transformations defined above.
Proposition 1.13. Let A E I (V, V). Then
(i) 6R,(A) = (9L (A'))1
(ii) 6@(A) = 6(AA').
(iii) %(A) = 9L(A'A).
(iv) r(A) = r(A')
Proof. Assertion (i) is equivalent to (6 (A))' = '(A'). But x E % (A')
means that 0 = (y, A'x) for all y E V, and this is equivalent to x I 6AP(A)
since (y, A'x) = (Ay, x). This proves (i). For (ii), it is clear that 6AK(AA') c
6YL(A). If x E i(A), then x = Ay for somey E V. Writey = yl + y2 where
yI 6E (A') and y2 C (6IY(A'))'. From (i), (6A (A'))' = 6X(A), so Ay2 = 0.
Sincey, E IK(A'), y1 = A'z for some z E V. Thus x = Ay = Ay1 = AA'z, so
x c 6fR(AA'). To -prove (iii), if Ax = 0, then A'Ax = 0, so 6X(A) C L(A'A). Con
versely, if A'Ax = 0, then 0 = (x, A'Ax) = (Ax, Ax), so Ax = 0, and
%(A'A) c %(A). Since dim(6A(A)) + dim(%(A)) = dim(V), dim(6Y (A')) + dim(6Y (A'))
= dim(V), and @(A) = (9L (A'))' , it follows that r(A) = r(A'). O
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.14 19
If A E C(V, V) and r(A) = 0, then A = 0 since A must map everything
into 0 e V. We now discuss the rank one linear transformations and show
that these can be thought of as the "building blocks" for E(V, V).
Proposition 1.14. For A E (V, V), the following are equivalent:
(i) r(A)= 1. (ii) There exist x0 * 0 and YO * 0 in V such that Ax = (ye, x)xo for
x E V.
Proof That (ii) implies (i) is clear since, if Ax = (ye, x)x0, then 6il(A) =
span(xo), which is one-dimensional. Thus suppose r(A) = 1. Since @R(A) is one-dimensional, there exists x0 E @Ai(A) with x0 * 0 and @AR(A) = span(xo}. As Ax e i3{(A) for all x, Ax = a(x)xo where a(x) is some scalar that depends on x. The linearity of A implies that a(/,3x, + 182x2) =
f3la(x1) + fi2a(x2). Thus a is a linear function on V and, by Proposition 1.10, a(x) = (yo, x) for some YO E V. Since a(x) * 0 for some x E V,
yo * 0. Therefore, (i) implies (ii). El
This description of the rank one linear transformations leads to the following definition.
Definition 1.18. Given x, y E V, the outer product of x and y, denoted by x E y, is the linear transformation on V to V whose value at z is (x El y)z
(y, z)x. Thus x Ey E E(V, V) and xE 0y = 0 iff x ory is zero. When x * 0 and
y*0, 6A (xOy)=span{x) and (xOy}=(span(y})'. The result of Proposition 1.14 shows that every rank one transformation is an outer product of two nonzero vectors. The following properties of outer products are easily verified:
(i) xO(aly1 + a2y2)= aIxO Y1 + a2xO Y2. (ii) (a,x, + a2X2)Oy = a1x1 ? y + a2X2 ? Y
(iii) (xfly)'=yox.
(iv) (x1 O y1)(x20 Y2) = (Y1, x2)x1 1 Y2*
One word of caution: the definition of the outer product depends on the inner product on V. When there is more than one inner product for V, care
must be taken to indicate which inner product is being used to define the outer product. The claim that rank one linear transformations are the building blocks for f (V, V) is partially justified by the following proposi tion.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
20 VECTOR SPACE THEORY
Proposition 1.15. Let (x,,..., x,,) be an orthonormal basis for (V,(., )). Then (xio xj; i, j - 1,.. ., n) is a basis for C(V, V).
Proof. If A E t(V, V), A is determined by the n2 numbers ai = (x, Axj). But the linear transformation B = -2aijxi EO xj satisfies
(xi, Bxj) = i(x ( E aklXk O
x)xj) =
E Eakl(XI Ij)(X;, Xk) aj1.
Thus B = A so every A e IS (V, V) is a linear combination of (xi O xj Ii, j =
1,. .., n). Since dim(f,(V, V)) = n2, the result follows.
Using outer products, it is easy to give examples of self-adjoint linear transformations. First, since linear combinations of self-adjoint linear trans formations are again self-adjoint, the set M of self-adjoint transformations is a subspace of l (V, V). Also, the set N of skew symmetric transformations
is a subspace of E(V, V). It is clear that the only transformation that is both
self-adjoint and skew symmetric is 0, so M nl N = (0). But if A e C(V, V),
then
A=A +A' + A-A' A
+A'E=M, n A-A' N 2 2 2 and 2 EN.
This shows that C (V, V) = M @ N. To give examples of elements of M, let
x1,..., x n be an orthonormal basis for (V,(-, *)). For each i, xi Oxi is
self-adjoint, so for scalars ai, B = 2aixiOE xi is self-adjoint. The geometry
associated with the transformation B is interesting and easy to describe. Since lIxill = 1, (xiE xi)2 = xiO xi, so xiEl xi is a projection on span(xi) along (span{xi})' -that is, xiO xi is the orthogonal projection on span{xi) as the null space of xi, xi is the orthogonal complement of its range. Let
Mi= span{xi), i = 1,..., k. Each Mi is a one-dimensional subspace of
(V,(, *)), Mi I Mj if i - j, and Ml (1 M2 ED * e Mn = V. Hence, Vis the
direct sum of n mutually orthogonal subspaces and each x e V has the
unique representation x = 2(x, xi)xi where (x, xi)xi = (xi O xi)x is
the projection of x onto Mi, i = 1,..., n. Since B is linear, the value of Bx is
completely determined by the value of B on each Mi, i = 1,.. ., n. However,
if y e Mj, then y = axj for some a E R and By = aBxj = aEai(xi0 xi)xj = aajxj = ajy. Thus when B is restricted to Mj, B is aj times the identity
transformation, and understanding how B transforms vectors has become
particularly simple. In summary, take x E V and write x = 2(x, xi)xi; then
Bx = Eai(x, xi)xi. What is especially fascinating and useful is that every
self-adjoint transformation in l (V, V) has the representation 2aixiO xi for
some orthonormal basis for V and some scalars al,..., a.. This fact is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.16 21
known as the spectral theorem and is discussed in more detail later in this chapter. For the time being, we are content with the following observation about the self-adjoint transformation B = 2aixi 0 xi: B is positive definite
iff ai > 0, i = 1,..., n. This follows since (x, Bx) = Eai(x, xi)2 and x = 0
iff (x, xi)2 = 0 for all i = 1,..., n. For exactly the same reasons, B is non-negative definite iff ai >0 for i = 1,. . ., n. Proposition 1.16 introduces
a useful property of self-adjoint transformations.
Proposition 1.16. If A1 and A2 are self-adjoint linear transformations in E(V, V) such that (x, A,x) = (x, A2x) for all x, then AI = A2.
Proof It suffices to show that (x, A,y) = (x, A2y) for all x, y E V. But
(x + y, A,(x + y)) = (x, Alx) + (y, Aly) + 2(x, Aly)
= (x +y, A2(x +y))
= (x, A2x) + (y, A2y) + 2(x, A2y).
Since (z, A,z) = (z, A2z) for all z E V, we see that (x, AIy) = (x, A2y).
U
In the above discussion, it has been observed that, if x E V and llxll = 1,
then x O x is the orthogonal projection onto the one-dimensional subspace span{x). Recall that P E = (V, V) is an orthogonal projection if P is a
projection (i.e., p2 = P) and if OL(P)= (6A(P))'. The next result char acterizes orthogonal projections as those projections that are self-adjoint.
Proposition 1.17. If P E C(V, V), the following are equivalent:
(i) P is an orthogonal projection. (ii) p2= p= pi.
Proof. If (ii) holds, then P is a projection and P is self-adjoint. By
Proposition 1.13, %(P) = (6A (P')) I= (?IW(P))I since P = P'. Thus P is
an orthogonal projection. Conversely, if (i) holds, then p2 = P since P is a projection. We must show that if P is a projection and O6(P) = (6(P))', then P = P'. Since V= =6(P) @ 6X(P), consider x, y E V and write x =
Xi + X2, Y = Yi + Y2 with xI, yi E St(P) and x2, Y2 E 9L(P)
= (6A (p))'.
Using the fact that P is the identity on (P), compute as follows:
(P'x, y) = (x, Py) = (xi + X2, PyI) =
(xI, IyY) = (Px1, yI)
= (P(x1 + x2), YI + Y2)
= (Px, Y).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
22 VECTOR SPACE THEORY
Since P' is the unique linear transformation that satisfies (x, Py) = (P'x, y), we have P = P'. a
It is sometimes convenient to represent an orthogonal projection in terms of outer products. If P is the orthogonal projection onto M, let {xl,..., Xk) be an orthonormal basis for M in (V,(-, *)). Set A = ExiUxi so A is
self-adjoint. If x E M, then x = E(x, xi)xi and Ax = (Exicl xi)x =
E(x, xi)xi = x. If x E M', then Ax = 0. Since A agrees with P on M and
Ml, A = P = ExIO xi. Thus all orthogonal projections are sums of rank
one orthogonal projections (given by outer products) and different terms in the sum are orthogonal to each other (i.e., (xiEl xi)(xj1 xj) = 0 if i * j).
Generalizing this a little bit, two orthogonal projections P, and P2 are called
orthogonal if PI P2 = 0. It is not hard to show that P1 and P2 are orthogonal
to each other iff the range of P1 and the range of P2 are orthogonal to each
other, as subspaces. The next result shows that a sum of orthogonal
projections is an orthogonal projection iff each pair of summands is orthogonal.
Proposition 1.18. Let Pi,... ., Pk be orthogonal projections on (V,(-, *))
Then P = P1 + . + Pk is an orthogonal projection iff PiPPj = 0 for i *j.
Proof. See Halmos (1958, Section 76). El
We now turn to a discussion of orthogonal linear transformations on an
inner product space (V, (- ,-)). Basically, an orthogonal transformation is
one that preserves the geometric structure (distance and angles) of the inner product. A variety of characterizations of orthogonal transformations is possible.
Proposition 1.19. If (V,(-, *)) is a finite dimensional inner product space and if A e E(V, V), then the following are equivalent:
(i) (Ax, Ay) = (x, y) for all x, y E V.
(ii) IlAxll = llxll for all x E V. (iii) AA' = A'A = I.
(iv) If (xl,..., x") is an orthonormal basis for (V,(-, *)), then (Ax1, . . ., Ax,) is also an orthonormal basis for (V, (- , )).
Proof. Recall that (i) is our definition of an orthogonal transformation. We prove that (i) implies (ii), (ii) implies (iii), (iii) implies (i), and then show that (i) implies (iv) and (iv) implies (ii). That (i) implies (ii) is clear since
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.20 23
IIAxI12 = (Ax, Ax). For (ii) implies (iii), (x, x) = (Ax, Ax) = (x, A'Ax) implies that A'A = I since A'A and I are self-adjoint (see Proposition 1.16).
But, by the uniqueness of inverses, this shows that A' = A -I so I = AA-'
= AA' and (iii) holds. Assuming (iii), we have (x, y) = (x, A'Ay) = (Ax, Ay) and (i) holds. If (i) holds and (xl,..., x") is an orthonormal basis for (V, (*, * )), then Bij = (xi, xj) = (Axi, Axj), which implies that (Ax,,...,
Ax,,) is an orthonormal basis. Now, assume (iv) holds. For x E V, we have
x = 2(x, xj)xi and 11xI12 = Y(x, xi)2. Thus
IlAxI2- 2
(Ax, Ax) = (?(x, xi)Axi, E(x, xj)Axj)
= E (x, xi)(x, xj)(Axi, Axj) = ? E (x, xi)(x, xi) 8i i j i j
= y(X, Xi)2 = 1IX112.
i
Therefore (ii) holds. O
Some immediate consequences of the preceding proposition are: if A is orthogonal, so is A - ' = A' and if A, and A2 are orthogonal, then AIA2 is
orthogonal. Let 0(V) denote all the orthogonal transformations on the inner product space (V, (., * )). Then 0(V) is closed under inverses, I E 0(V), and (9(V) is closed under products of linear transformations. In other words, 0(V) is a group of linear transformations on (V, (*, *)) and 0(V) is called the orthogonal group of (V, (, ,*)). This and many other groups of
linear transformations are studied in later chapters. One characterization of orthogonal transformations on (V, (., )) is that
they map orthonormal bases into orthonormal bases. Thus given two orthonormal bases, there exists a unique orthogonal transformation that maps one basis onto the other. This leads to the following question. Suppose (xl1. . ., Xk) and .y1,.. ., Yk} are two finite sets of vectors in (V(., *)). Under
what conditions will there exist an orthogonal transformation A such that Axi = yi for i = 1,..., k? If such an A e ((V) exists, then (xi, xj) =
(Axi, Axj) = (y1, yj) for all i, j = 1,..., k. That this condition is also
sufficient for the existence of an A e 0(V) that maps xi to yi, i = 1,. . ., k,
is the content of the next result.
Proposition 1.20. Let (xl,..., Xk) and {.Y,.. ., Yk) be finite sets in (V, (I, )). The following are equivalent:
(i) (xi, xj) = (yi, yj) for i, j = I,..., k.
(ii) There exists an A e 0(V) such that Axi = yi for i = I,..., k.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
24 VECTOR SPACE THEORY
Proof. That (ii) implies (i) is clear, so assume that (i) holds. Let M =
span(x ,..., Xk). The idea of the proof is to define A on M using linearity and then extend the definition of A to V using linearity again. Of course, it
must be verified that all this makes sense and that the A so defined is
orthogonal. The details of this, which are primarily computational, follow. First, by (i), Eaixi = 0 iff Eaiy, = 0 since
(Eaixi~,Ea^x) =
EEaia,(xi, x1)
- XXa-aj(yi, yj) = (Eaiyi, Eajyj). Let N = span{y1,. ..,Yk} and define B
on M to N by B(Eaixi) Eaiyi. B is well defined since Eaixi = Yf3ixi implies that Eaiyi = E2ifyi and the linearity of B on M is easy to check.
Since B maps M onto N, dim(N) < dim(M). But if B(Xaixi) = 0, then
Eaiyi = 0, so Ea,xA = 0. Therefore the null space of B is (0) c M and
dim(M) = dim(N). Let M' and N' be the orthogonal complements of M
and N, respectively, and let {u,,..., u,) and {v,,..., v,} be orthonormal
bases for M' and N', respectively. Extend the definition of B to V by first
defining B(ui) = vi for i = 1,..., s and then extend by linearity. Let A be
the linear transformation so defined. We now claim that jIAw112 = 11w112 for all w E V. To see this write w = w1 + w2 where w, E M and w2 e M'.
Then Awl e N and Aw2 e N'. Thus tIAwII2 = IlAw, + Aw2ll2 = IlAw1ll2 +
ltAw2ll2. But w1 = Yaixi for some scalars ai. Thus
IlAw111 = (A( aixi), A(Y2ajxj)) = YE aiaj(Axi, Axj)
E E Laia1(y1, yj) = E Eaiaj(xi, xj)
(Eaixi, Eajxj) = ||W1112.
Similarly, llAw2ll2 = llw2ll2. Since l1wll2 = llw1l12 + 11w2112, the claim that llAwl12 = jjwI12 is established. By Proposition 1.19, A is orthogonal. o
* Example 1.6. Consider the real vector space Rn with the standard basis and the usual inner product. Also, let CE, , be the real vector space of all n x n real matrices. Thus each element of En, n de
termines a linear transformation on R' and vice versa. More pre
cisely, if A is a linear transformation on Rn to Rn and [A] denotes
the matrix of A in the standard basis on both the range and domain
of A, then [Ax] = [A]x for x E Rn. Here, [Ax] E Rn is the vector
of coordinates of Ax in the standard basis and [A]x means the
matrix [A] = (aij) times the coordinate vector x E R . Conversely,
if [A] E en n and we define a linear transformation A by Ax = [A]x, then the matrix of A is [A]. It is easy to show that if A is a linear
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
THE CAUCHY-SCHWARZ INEQUALITY 25
transformation on Rn to R' with the standard inner product, then
[A'] = [A]' where A' denotes the adjoint of A and [A]' denotes the
transpose of the matrix [A]. Now, we are in a position to relate the
notions of self-adjointness and skew symmetry of linear transforma tions to properties of matrices. Proofs of the following two asser tions are straightforward and are left to the reader. Let A be a linear
transformation on Rn to R" with matrix [A].
(i) A is self-adjoint iff [A] = [A]'. (ii) A is skew-symmetric iff [A]' = - [A].
Elements of En n that satisfy B= B' are usually called symmetric matrices, while the term skew-symmetric is used if B' = - B, B E
(En n. Also, the matrix B is called positive definite if x'Bx > 0 for all x e R", x * 0. Of course x'Bx is just the standard inner product of x with Bx. Clearly, B is positive definite iff the linear transforma tion it defines is positive definite.
If A is an orthogonal transformation on Rn to R", then [A] must satisfy [A][A]' = [A]'[A] = In where In is the n x n identity matrix.
Thus a matrix B e en,, n is called orthogonal if BB' = B'B = In. An
interesting geometric interpretation of the condition BB' = B'B = In follows. If B = (bij), the vectors bi
E Rn with coordinates bij, i = 1,..., n, are the column vectors of B and the vectors ci e Rn
with coordinates bij, j = 1,..., n, are the row vectors of B. The
matrix BB' has elements c'c1 and the condition BB' = In means that
c'cj = Sij-that is, the vectors cl,..., cn form an orthonormal basis for Rn in the usual inner product. Similarly, the condition B'B = In holds iff the vectors bl,..., bn form an orthonormal basis for Rn. Hence a matrix B is orthogonal iff both its rows and columns determine an orthonormal basis for Rn with the standard inner product.
1.4. THE CAUCHY-SCHWARZ INEQUALITY
The form of the Cauchy-Schwarz Inequality given here is general enough to be applicable to both finite and infinite dimensional vector spaces. The examples below illustrate that the generality is needed to treat some standard situations that arise in analysis and in the study of random
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
26 VECTOR SPACE THEORY
variables. In a finite dimensional inner product space (V, (., *)), the inequal ity established in this section shows that Kx, y)l < llxllllyll where 11x112 =
(x, x). Thus - I < (x, y)/llxllIIylI I 1 and the quantity (x, y)/lIxllllyIl is defined to be the cosine of the angle between the vectors x and y. A variety of applications of the Cauchy-Schwarz Inequality arise in later chapters.
We now proceed with the technical discussion. Suppose that V is a real vector space, not necessarily finite dimensional.
Let [-, -I denote a non-negative definite symmetric bilinear function on V x V- that is, [ , - ] is a real-valued function on V x V that satisfies (i)
[x, y] = [y, xI, (ii) [a1x1 + a2x2, y] = a,[x,, y] + a2[x2, y], and (iii) [x, x]
> 0. It is clear that (i) and (ii) imply that [x, a1y1 + a2Y2] = a1[x, YI] +
a2[x, Y2]. The Cauchy-Schwarz Inequality states that [x, y]2 < [X, x][y, y]. We also give necessary and sufficient conditions for equality to hold in this
inequality. First, a preliminary result.
Proposition 1.21. Let M = {xl[x, x] = 0). Then M is a subspace of V.
Proof. If x e M and a E R, then [ax, ax] = a2[x, x] = 0 so ax E M.
Thus we must show that if x, x2 e M, then xl +x2eM. For aeR,
O [< [x + aX2' xI + ax2] = [xI, XI] + 2a[xl, X2] + a2[x2, X2] = 2a[x1,x2] since xl, x2 E M. But if 2a[x,, x2] > 0 for all a E R, [xI, x2] = 0, and this
implies that 0 = [xI + ax2, xI + ax2] for all a E R by the above equality.
Therefore, xl + ax2 E M for all a when xl, x2 E M and thus M is a
subspace. []
Theorem 1.1. (Cauchy-Schwarz Inequality). Let [-,-] be a non-negative definite symmetric bilinear function on V x V and set M = (xlIx, x] = 0). Then:
(i) [X, y]2 < [x, x][y, y] for x, y E V. (ii) [X, y12 = [x, xJ[y, y] iff ax + fly E M for some real a and /3 not
both zero.
Proof. To prove (i), we consider two cases. If x E M, then 0 < [y + ax, y
+ ax] = [y, y] + 2a[x, y] for all a E R, so [x, y] = 0 and (i) holds. Simi
larly, if y e M, (i) holds. If x t M andy O M, let xI = x/[x, x]'/2 and let
Yi = y/[y, y]1/2. Then we must show that I[x,, yj] < 1. This follows from
the two inequalities
0 < [x, -y, xl -yl] = 2- 2[x,, y]
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.21 27
and
0 < [xI + yl, xl + y,] = 2 + 2[x, yl].
The proof of (i) is now complete. To prove (ii), first assume that [x, y]2 = [x, x][y, y]. If either x E M or
y E M, then ax + fly E M for some a, /3 not both zero. Thus consider
x q M and y q M. An examination of the proof of (i) shows that we can
have equality in (i) iff either 0 = [xl - yI, xl - yI] or 0 = [xl + y1, xl
+yI] and, in either case, this implies that ax + fly E M for some real a, /3
not both zero. Now, assume ax + fly E M for some real a, /3 not both zero.
If a = 0 or /t = 0 or x E M or y e M, we clearly have equality in (i). For
the case when a,B * 0, x t M, and y 0 M, our assumption implies that
xI + YY1 E M for some y * 0, since M is a subspace. Thus there is a real
-y * 0 such that 0 = [xl + YY,, xl + YYi] = 1 + 2y[x,, Yl] + y2. The equa
tion for the roots of a quadratic shows that this can hold only if j[xI, y] I 1.
Hence equality in (i) holds.
* Example 1.7. Let (V, (-, *)) be a finite dimensional inner product space and suppose A is a non-negative definite linear transforma
tion on V to V. Then [x, y] (x, Ay) is a non-negative definite
symmetric bilinear function. The set M = (xI(x, Ax) = 0) is equal to 6X(A)-this follows easily from Theorem 1.1(i). Theorem 1.1 shows that (x, Ay)2 < (x, Ax)(y, Ay) and provides conditions for equality. In particular, when A is nonsingular, M = (0) and equality holds iff x and y are linearly dependent. Of course, if A = I, then
we have (x, y)2 < 11X11211y112, which is one classical form of the Cauchy-Schwarz Inequality.
* Example 1.8. In this example, take V to be the set of all continu ous real-valued functions defined on a closed bounded interval, say a to b, of the real line. It is easily verified that
a [xI, X2] -lxI(t)X2(t dt
is symmetric, bilinear, and non-negative definite. Also [x, x] > 0 unless x = 0 since x is continuous. Hence M = {0}. The
Cauchy-Schwarz Inequality yields
t) dt b 2(t) dt a
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
28 VECTOR SPACE THEORY
* Example 1.9. The following example has its origins in the study of the covariance between two real-valued random variables. Consider a probability space (S2, 'F, PO) where Q is a set, 'F is a a-algebra of
subsets of Q, and PO is a probability measure on IF. A random
variable X is a real-valued function defined on Q such that the
inverse image of each Borel set in R is an element of IF; symboli
cally, X- 1(B) E 1Y for each Borel set B of R. Sums and products of
random variables are random variables and the constant functions on Q are random variables. If X is a random variable such that
JIX(w)lPo(dw) < + oo, then X is integrable and we write &X for
JX(w)Po(dw). Now, let V be the collection of all real-valued random variables
X, such that &X2 < + oo. It is clear that if X E V, then aX E V for
all real a. Since (XI + X2)2 < 2(X2 + X22), if XI and X2 are in V,
then X, + X2 is in V. Thus V is a real vector space with addition
being the pointwise addition of random variables and scalar multi plication being pointwise multiplication of .random variables by scalars. For XA, X2 E V, the inequality IXIX21 < X12 + X22 implies
that X, X2 is integrable. In particular, setting X2 = 1, Xl is integra
ble. Define [*, *] on V x V by [XI, X2] = &;(XIX2). That [*, ] is
symmetric and bilinear is clear. Since [XI, XI] - &;X12 > 0, [-, ] is
non-negative definite. The Cauchy-Schwarz Inequality yields (&;XIX2)2 < &;X2X22, and setting X2 = 1, this gives ($X1)2 <
&X2. Of course, this is just a verification that the variance of a
random variable is non-negative. For future use, let var(X,) &-X2 - (& Xl)2 . To discuss conditions for equality in the
Cauchy-Schwarz Inequality, the subspace M = {XI[X, X] = 0) needs to be described. Since [X, XA] = X2, X E M iff X is zero,
except on set of PO measure zero-that is, X = 0 a.e. (PO). There
fore, (6 X A2)2 = 6X&2X22 iff aX, + 13X2 = 0 a.e. (PO) for some
real a, ,B not both zero. In particular, var( XI) = 0 iff X, - SX, = 0
a.e. (PO). A somewhat more interesting non-negative definite symmetric
bilinear function on V X V is
cov(Xl, X2) -9XIX2 - "I"29
and is called the covariance between X, and X2. Symmetry is clear
and bilinearity is easily checked. Since cov(X,, X,) = X412 -
(;X 1)2 = var(X,), cov(, }) is non-negative definite and M, =
{ Xlcov( X, X) = 0) is just the set of random variables in V that have
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
THE SPACE f (V, W) 29
variance zero. For this case, the Cauchy-Schwarz Inequality is
(CoV{X1, X2))2 I var(Xl)var(X2).
Equality holds iff there exist a, /3, not both zero, such that var(aX, + 13X2) = 0; or equivalently, a(Xl - EX1) + ,B(X2 - &X2) = 0
a.e. (PO) for some a, /B not both zero. The properties of cov{-, *}
given here are used in the next chapter to define the covariance of a
random vector.
1.5. THE SPACE f (V, W)
When (V, (., ')) is an inner product space, the adjoint of a linear transfor
mation in f,(V, V) was introduced in Section 1.3 and used to define some
special linear transformations in E(V, V). Here, some of the notions dis cussed in relation to l (V, V) are extended to the case of linear transforma
tions in 1 (V, W) where (V,(., -)) and (W,[., ]) are two inner product
spaces. In particular, adjoints and outer products are defined, bilinear functions on V x W are characterized, and Kronecker products are intro duced. Of course, all the results in this section apply to P,(V, V) by taking
(W, [., []) = (V, (, )) and the reader should take particular notice of this
special case. There is one point that needs some clarification. Given (V, (., .)) and (W, [ *, * ]), the adjoint of A E C (V, W), to be defined below, depends
on both the inner products (-, ) and [., .]. However, in the previous
discussion of adjoints in f (V, V), it was assumed that the inner product was
the same on both the range and the domain of the linear transformation (i.e., V is the domain and range). Whenever we discuss adjoints of A E
F (V, V) it is assumed that only one inner product is involved, unless the contrary is explicitly stated- that is, when specializing results from i_(V, W) to f(V, V), we take W = V and [ *, -] = (-, *).
The first order of business is to define the adjoint of A E t(V, W) where
(V,(-, .)) and (W,[., *]) are inner product spaces. For a fixed w E W,
[w, Ax] is a linear function of x E V and, by Proposition 1.10, there exists a
unique vector y(w) E V such that [w, Ax] = (y(w), x) for all x E V. It is
easy to verify that y(a1w1 + a2w2) = aly(wI) + a2y(w2). Hence y(-) de
termines a linear transformation on W to V, say A', which satisfies [w, Ax] =(A'w,x)forallwe Wandxe V.
Definition 1.19. Given inner product spaces (V,(, )) and (W,[, ]), if A E f (V, W), the unique linear transformation A' E (W, V) that satisfies
[w, Ax] = (A'w, x) for all w E W and x E V is called the adjoint of A.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
30 VECTOR SPACE THEORY
The existence and uniqueness of A' was demonstrated in the discussion preceeding Definition 1.19. It is not hard to show that (A + B)' = A' + B', (A')' = A, and (aA)' = aA'. In the present context, Proposition 1.13 be comes Proposition 1.22.
Proposition 1.22. Suppose A E f(V, W). Then:
(i) @A(A) = (6X(A')
(ii) 6A(A) = %(AA'). (iii) %X(A)= = (A'A). (iv) r(A) = r(A').
Proof. The proof here is essentially the same as that given for Proposition 1. 13 and is left to the reader. o
The notion of an outer product has a natural extension to E(V, W).
Definition 1.20. For x E (V,(-, *) and w e (W,J[, ]1), the outer product,
wO x is that linear transformation in l (V, W) given by (wO x)(y)
(x, y)w for all y E V.
If w = 0 or x = 0, then wO) x = 0. When both w and x are not zero, then
wOE x has rank one, 'R5(w ) x) = span(w}, and IL (wE) x) = (span{x})' .
Also, a minor modification of the proof of Proposition 1.14 shows that, if A E F (V, W), then r(A) = 1 iff A = wo) x for some nonzero w and x.
Proposition 1.23. The outer product has the following properties:
(i) (a1wI + a2W2)C)X =
a1WIl X + a2W2E0x.
(ii) wO(a1x1 + a2x2) = a1wE) xi + a2wO X2.
(iii) (wO) x)' = xO w E fe(W, V).
If (V,(., -)-), (V2,(., )2), and (13,(., )3) are inner product spaces with
xi EV 1, x2, Y2 E V2, and y3 E V3, then
(iv) (y3 O y2)(x2E) x1) = (x2, Y2)2Y3 O x E C (VI, 1V3).
Proof. Assertions (i), (ii), and (iii) follow easily. For (iv), consider x E VI. Then (x2E) x,)x = (x1, x)Ix2, so (Y3E0 Y2)(X2 E X )X = (Xl, X)I(y3E0 Y2)X2 = (XI, X)(y2, X2)2y3 E 13. However, (x2, Y2)2(Y3 CO x)x = (x2, y2)2(xI, x)ly3. Thus (iv) holds. o
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.24 31
There is a natural way to construct an inner product on f&(V, W) from inner products on V and W. This construction and its relation to outer products are described in the next proposition.
Proposition 1.24. Let (xl,..., xm) be an orthonormal basis for (V,(, *)) and let (w1, . .., w%) be an orthonormal basis for (W,., , 1). Then:
(i) (wiO xjli = 1,.. ., n, j = ,..., m) is a basis for $ (V, W).
Let aij = [wi, Axj]. Then:
(ii) A = EEaijwiOxj and the matrix of A is [A] = (aij) in the given
bases.
If A-=Eaijw,O xj and B = D2 bijwif xj, define (A, B) D2aijbij. Then:
(iii) K , is an inner product on f (V, W) and (wil xjli = 1,..., n, j = 1,. . ., m} is an orthonormal basis for (I (V, W), ( )).
Proof. Since dim(l (V, W)) = mnn, to prove (i) it suffices to prove (ii). Let B = Ea ijwi Xj-. Then
[Wk, Bx1] = E aij[wk, (win xj)x,] - EaIjik8Il =
akl, i j i j
so[wi, Bxj] = [wi, Axj] for i = I,..., n andj = 1,..., im. Therefore, [w, Bx] = [w, Ax] for all w E W and x E V, which implies that [w, (B - A)x] = 0.
Choosing w = (B - A)x, we see that (B - A)x = 0 for all x E V and,
therefore, B = A, To show that the matrix of A is [A] = (aij), recall that the
matrix of A consists of the scalars bkj defined by Axj = 4kbkjwk. The inner
product of wi with each side of this equation is
aij - [wi, Axj] = Ebkj [ wi, Wk =b k
and the proof of (ii) is complete. For (iii), K *) is clearly symmetric and bilinear. Since (A, A) = Eaj,
the positivity of K - * ) follows. That (wi E xiji
= 1, ..., n, j = 1,. .., m) is
an orthonormal basis for (f (V, W), K *)) follows immediately from the definition of K *, * ). O
A few words are in order concerning the inner product K ) on l (V, W). Since (wiO xjIi = 1,.. ., n, j = 1,..., im) is an orthonormal basis,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
32 VECTOR SPACE THEORY
we know that if A e (V, W), then
A = E E (A, w,C x1)w,J x;,
since this is the unique expansion of a vector in any orthonormal basis. However, A = EE[wi, Axj]wi 0 xj by (ii) of Proposition 1.24. Thus (A, w,EiJ x1) = [w1, Ax.] for i = 1,..., n and j = 1,. . ., m. Since both sides
of this relation are linear in wi and xj, we have (A, wO0 x) = [w, Ax] for all
w e W and x e V. In particular, if A =i wE x, then
(V xi, wO x) = [w, (wO x)x] = [w, (.x,x)w] = w, w (x, x).
This relation has some interesting implications.
Proposition 1.25. The inner product ( ,*> on 5 (V, W) satisfies
(i) <wOv x , wOE x) = [wi7, w ](x, x)
for all w, w E W and x, x E V, and ( , ) is the unique inner product with
this property. Further, if (z1, ..., Zn) and y , Y)m} are any orthonormal
bases for W and V, respectively, then {ziO yjIi 1,..., n, j = 1,..., m} is
an orthonormal basis for (I4(V, W), ( , * )).
Proof. Equation (i) has been verified. If (, ) is another inner product on P (V, W) that satisfies (i), then
(wiElxj,IWi2 Ex1} = (wi,OxJI,wi2'xj2)
for all i1, i2 - 1,..., n and j1, j2 1,..., m where (x],..., xm) and
(wI, .., wn) are the orthonormal bases used to define ( *, * ). Using (i) of
Proposition 1.24 and the bilinearity of inner products, this implies that {A, B) = (A, B) for all A, B E E(V, W). Therefore, the two inner products
are the same. The verification that (ziEl y1Ii = 1,..., n, j = 1,..., m} is an
orthonormal basis follows easily from (i). o
The result of Proposition 1.25 is a formal statement of the fact that
K - ) does not depend on the particular orthonormal bases used to define
it, but ( *, * ) is determined by the inner products on V and W. Whenever V
and W are inner product spaces, the symbol , always means the inner
product on l (V, W) as defined above.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.25 33
* Example 1.10. Consider V = RK and W = R' with the usual inner
products and the standard bases. Thus we have the inner product ( * I-) on m, n-the linear space of n X m real matrices. For
A = (aij) and B = {bi) in Cm, n
n m
(A, B) = F. aijbij. i=1 j=1
If C = AB': n X n, then
cii=Eabijbj, i= 1,...,In, i ~ i=
so (A, B) = Ecii. In other words, (A, B) is just the sum of the
diagonal elements of the n x n matrix AB'. This observation leads to the definition of the trace of any square matrix. If C: k x k is a
real matrix, the trace of C, denoted by tr C, is the sum of the
diagonal elements of C. The identity (A, B) = (B, A) shows that trAB' = trB'A for all A, B E Cm n. In the present example, it is
clear that wO x = wx' for x E Rm and w E R , so wO x is just the
n x 1 matrix w times the 1 x m matrix x'. Also, the identity in
Proposition 1.25 is a reflection of the fact that
tr wxc'xw' = w7'wx'xx
for w, w E R" and x, E e Rm.
If (V,(-, .)) and (W,[., ]) are inner product spaces and A e E f(V, W),
then [Ax, w] is linear in x for fixed w and linear in w for fixed x. This
observation leads to the following definition.
Definition 1.21. A function f defined on V x W to R is called bilinear if:
(i) f(alxl + a2X2, W) = alf(xl, w) + a2f(x2, w)
(ii) f(x, a1w1 + a2W2) = alf(x, wI) + a2f(x, W2).
These conditions apply for scalars a, and a2; x, x1, x2 E V and w, w1, w2
E W.
Our next result shows there is a natural one-to-one correspondence between bilinear functions and f (V, W).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
34 VECTOR SPACE THEORY
Proposition 1.26. If f is a bilinear function on V X W to R, then there
exists an A eE (V, W) such that f(x, w) = [Ax, w] for all x E V and w e W. Conversely, each A E E(V, W) determines the bilinear function [Ax,w]on VX W.
Proof. Let {xl,..., xm) be an orthonormal basis for (V, (, .)) and (w1,..., w%) be an orthonormal basis for (W, [., 1). Set aij = f (xj, wi) for i = 1,..., n
and j = 1, . . ., m and let A = EEaijwi O xj. By Proposition 1.24, we have
aij [Axj,w] =f((xi,w).
The bilinearity of f and of [Ax, w] implies [Ax, w] = f(x, w) for all x E V
and w E W. The converse is obvious. El
Thus far, we have seen that f (V, W) is a real vector space and that, if V
and W have inner products ( *, - ) and [ *, * ], respectively, then f (V, W) has a
natural inner product determined by (, ) and [*, -]. Since (V, W) is a
vector space, there are linear transformations on E(V, W) to other vector spaces and there is not much more to say in general. However, C(V, W) is
built from outer products and it is natural to ask if there are special linear
transformations on E(V, W) that transform outer products into outer products. For example, if A E P&(V, V) and B E P4W, W), suppose we
define B ? A on E(V, W) by (B ? A)C = BCA' where A' denotes the
transpose of A C P (V, V). Clearly, B ? A is a linear transformation. If
C = wEl x, then (B X A)(wo x) = B(wo x)A' EC (V, W). But for v e V,
(B (w C x)A') v = B(w C x)(A'v) = B((x, A'v)w)
= (Ax, v)Bw
= ((Bw) O (Ax))v.
This calculation shows that (B ? A)(w O x) = (Bw) O(Ax), so outer prod
ucts get mapped into outer products by B ? A. Generalizing this a bit, we
have the following definition.
Definition 1.22. Let (V,,(., )X), (V2,(*, *)2), (W,,[*, * ],), and (W2,[*, *12) be inner product spaces. For A EE f(V,, V2) and B e E (W,, W2), the
Kronecker product of B and A, denoted by B 0 A, is a linear transformation
on (V,, W,) to E(V2, W2), defined by
(B X A)C BCA'
for all C CE (V,, W,).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.27 35
In most applications of Kronecker products, V1 = V2 and W, = W2, so
B 0 A is a linear transformation on f (VI, W1) to E(V,, WI). It is not easy to
say in a few words why the transpose of A should appear in the definition of
the Kronecker product, but the result below should convince the reader that the definition is the "right" one. Of course, by A', we mean the linear transformation on V2 to VI, which satisfies (x2, Ax1)2 = (A'x2, x0)1 for
xe VI andx2 e V2.
Proposition 1.27. In the notation of Definition 1.22,
(i) (B X A)(w1 D v,) = (Bw1)EJ(Av1) E (V2, W2).
Also,
(ii) (B ? A)' = B' XA',
where (B 0 A)' denotes the transpose of the linear transformation B 0 A
on (E(V1, W1), < )1) to ( (V2, W2), ( * * 2)
Proof To verify (i), for v2 E V2, compute as follows:
[(B 0 A)(w, [1 FV)](V2) = B(w1 D v,)A'v2 = B(vl, A'v2)2w1
= (Av,, v2)Bw, = [(Bw1)Cl(Av,)](v2)
Since this holds for all v2 E V2, assertion (i) holds. The proof of (ii) requires we show that B' 0 A' satisfies the defining equation of the adjoint-that is, for Cl e C (V1, WI) and C2 E I(V2, W2),
(C2, (B 0 A)CI)2 = ((B' 0 A')C2, COP1.
Since outer products generate E(V,, W1), it is enough to show the above holds for C1 = wIEJxX with w1 E
- W1 and xl E V1. But, by (i) and the
definition of transpose,
(C2,(B XA)(wOx1))2 = (C2,Bw1EOJAx,)2 = [C2Ax1,Bw1]2
= [B'C2Ax,, WI],
= (B'C2A, wl ? XI),
= ((B' 0 A')C2, w1 O x1),
and this completes the proof of (ii). o
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
36 VECTOR SPACE THEORY
We now turn to the case when A - E (V, V) and B E E (W, W) so B ? A
is a linear transformation on (V, W) to l (V, W). First note that if A is self-adjoint relative to the inner product on V and B is self-adjoint relative to the inner product on W, then Proposition 1.27 shows that B ? A is
self-adjoint relative to the natural induced inner product on l (V, W).
Proposition 1.28. For Ai e- E(V,V), i = 1,2, and Bi E f,(W,W), i = 1,2,
we have:
(i) (B, ? A1)(B2 ? A2) = (B1B2) X (AIA2). (ii) If A and B' exist, then (B1 A1) = B' A,.
(iii) If A, and B, are orthogonal projections, then B, O A, is an
orthogonal projection.
Proof. The proof of (i) gocs as follows: For C E C (V, W),
(B1 ? A1)(B2 ? A2)C = (B1 ? Al)(B2CA2) = BIB2CA'A'
= B1B2C(A,A2)' = ((B1B2) ? (AIA2))C.
Now, (ii) follows immediately from (i). For (iii), it needs to be shown that (B1 X A1)2 = B1 0 Al = (BI 0 Al)'. The second equality has been verified.
The first follows from (i) and the fact that B2 = BI and A2 = A. l
Other properties of Kronecker products are given as the need arises. One
issue to think about is this: if C E E(V, W) and B E f(W, W), then BC
can be thought of as the product of the two linear transformations B and C.
However, BC can also be interpreted as (B 0 I)C, I E &(V, V)-that is, BC is the value of the linear transformation B 0 I at C. Of course, the
particular situation determines the appropriate way to think about BC.
Linear isometries are the final subject of discussion in this section, and
are a natural generalization of orthogonal transformations on (V,( ,)). Consider finite dimensional inner product spaces V and W with inner
products (, *) and [, -] and assume that dim V < dim W. The reason for
this assumption is made clear in a moment.
Definition 1.23. A linear transformation A e C&(V, W) is a linear isometry
if (vI, v2) = [Av,, Av2l for all vI, v2 E V.
If A is a linear isometry and v E V, v * 0, then 0 < (v, v) = [Av, Av].
This implies that %)t(A) = {0), so necessarily dim V < dim W. When W = V
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.29 37
and [*, -l = (*, *), then linear isometries are simply orthogonal transforma
tions. As with orthogonal transformations, a number of equivalent descrip tions of linear isometnes are available.
Proposition 1.29. For A E ?(V, W) (dim V s dim W), the following are
equivalent:
(i) A is a linear isometry.
(ii) A'A = Ie E(V,V).
(iii) [Av, Av] (v, v), v E V.
Proof. The proof is similar to the proof of Proposition 1.19 and is left to
the reader. o
The next proposition is an analog of Proposition 1.20 that covers linear isometries and that has a number of applications.
Proposition 1.30. Let vi,..., Vk be vectors in (V,(I, *)), let w1,..., Wk be vectors in (W, [-, ]1), and assume dim V < dim W. There exists a linear
isometry A E C(V, W) such that Avi = wI, i 1,..., k, iff (vi, Vj) [wI, w]
for i,j= I,...,k.
Proof. The proof is a minor modification of that given for Proposition 1.20 and the details are left to the reader. o
Proposition 1.31. Suppose A E $ (V, W,) and B E 1 (V, W2) where dim W2
< dim W,, and ( , .), [ , -I, and [., 12 are inner products on V, W,, and
W2. Then A'A = B'B iff there exists a linear isometry I E E(W2, WI) such
that A = -B.
Proof. If A = PB, then A'A = B'"'*B = B'B, since V'I = I E
e (W2, W2). Conversely, suppose A'A = B'B and let (vl,..., vm) be a basis
for V. With x =Avi E W, and y, = Bvi e W2, i = l,..., m, we have [xi, x;]} = [Avi, Avj]1
= (v,, A'Avj) = (vi, B'Bvj)
= [Bvi, Bvy]2 = [Yi, Yj]12 for
i, j - 1,..., m. Applying Proposition 1.30, there exists a linear isometry I E' E(W2,W,)suchthat *yi xi fori = ,..., m.Therefore, *Bvi =Avi for i = 1, . . ., m and, since {v, . .., vm) is a basis for V, 4B = A. El
* Example 1.11. Take V= RK and W = Rn with the usual inner products and assume m < n. Then a matrix A = (aij}: n X m is a
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
38 VECTOR SPACE THEORY
linear isometry iff A'A = Im where Im is the m X m identity matrix.
If a,,..., am denote the columns of the matrix A, then A'A is just the m x m matrix with elements a'aj, i, j = 1,..., m. Thus the
condition A'A = I,,, means that aaj = ij so A is a linear isometry
on Rm to Rn iff the columns of A are an orthonormal set of vectors
in Rn. Now, let Fm, n be the set of all n X m real matrices that are
linear isometries- that is, A E &mn
iff A'A = I,,,. The set Cm n is
sometimes called the space of m-frames in Rn as the columns of A
form an m-dimensional orthonormal "frame" in R . When m = 1,
9, n is just the set of vectors in Rn of length one, and when m = n,
,,nn is the set of all n X n orthogonal matrices. We have much more
to say about Cm n in later chapters. An immediate application of Proposition 1.31 shows that, if
A: n1 X m and B: n2 X m are real matrices with n2 < n1, then
A'A =B'B iff A = IB where ': nn X n2 satisfies *'I
= In2. In
particular, when n1 = n2, A'A = B'B iff there exists an orthogonal matrix *I: n 1 X n l such that A = IB.
1.6. DETERMINANTS AND EIGENVALUES
At this point in our discussion, we are forced, by mathematical necessity, to introduce complex numbers and complex matrices. Eigenvalues are defined as the roots of a certain polynomial and, to insure the existence of roots,
complex numbers arise. This section begins with complex matrices, determi nants, and their basic properties. After defining eigenvalues, the properties of the eigenvalues of linear transformations on real vector spaces are
described. In what follows, 4T denotes the field of complex numbers and the symbol
i is reserved for -l=. If a E I, say a = a + ib, then a-= a-ib is the
complex conjugate of a. Let (Jn be the set of all n-tuples (henceforth called
vectors) of complex numbers-that is, x E (4n iff
x = : ; Xj E- IT j= 1,.. n.
Xn
The number xi is called the jth coordinate of x, j =1, . . ., n. For x, y CE " ,
x + y is defined to be the vector with coordinates xj + yj, j = 1,.. ., n, and
for a E (, ax is the vector with coordinates axj, j = 1,..., n. Replacing R
by 4T in Definition 1.1, we see that 4" satisfies all the axioms of a vector
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
DETERMINANTS AND EIGENVALUES 39
space where scalars are now taken to be complex numbers, rather than real
numbers. More generally, if we replace R by IT in (II) of Definition 1.1, we have the definition of a complex vector space. All of the definitions, results, and proofs in Sections 1.1 and 1.2 are valid, without change, for complex vector spaces. In particular, 4dU is an n-dimensional complex vector space and the standard basis for cr" is (,e...., En) where -j has its jth coordinate
equal to one and the remaining coordinates are zero.
As with real matrices, an m x n array A = {ajk} for j = 1,. . ., m, and k = 1,..., n where a1k E ( is called an m x n complex matrix. If A =
{ajk): m x n and B = (bkl): n x p are complex matrices, then C = AB is
the m X p complex matrix with with entries cjg = Ekajkbkl forj = 1,.. ., m
and 1 = l,..., p. The matrix C is called the product of A and B (in that
order). In particular, when p = 1, the matrix B is n x 1 so B is an element
of 4'. Thus if x E (n (x now plays the role of B) and A: m x n is a
complex matrix, Ax E -r. Clearly, each A: m x n determines a linear
transformation on ?" to ?rn via the definition of Ax for x E Xn. For an
m x n complex matrix A = (ajk), the conjugate transpose of A, denoted by A*, is the n x m matrix A* =
(akj), k = 1,.. ., n,j = 1 ... ., m, where akj is
the complex conjugate of ak1 E . In particular, if x E 4", x* denotes the conjugate transpose of x. The following relation is easily verified:
y*Ax = x*A*y
where y e C, x E - 4, and A is an m x n complex matrix. Of course, the
bar over y*Ax denotes the complex conjugate of y*Ax E 4(. With the preliminaries out of the way, we now want to define determi
nant functions. Let CE denote the set of all n X n complex matrices so Cn is an n2-dimensional complex vector space. If A E (Cn write A =
(a,, a2,...,
an) where a1 is the jth column of A.
Definition 1.24. A function D defined on en and taking values in (V is called a determinant function if
(i) D(A) = D(al,..., an) is linear in each column vector a1 when the other columns are held fixed. That is,
D(a,,..., aaj + Pbj,... , an )= aD(a,,..., aj,... 9 an)
+ #D(al,., bj,**. , an)
for a, /E E (V.
(ii) For any two indices j and k, j < k,
D(al,., aj,..., ak,..., an) = -D(a1,..., ak,..., aj,..., an).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
40 VECTOR SPACE THEORY
Functions D on Cn to ?t that satisfy (i) are called n-linear since they are
linear in each of the n vectors aI,..., an when the remaining ones are held fixed. If D is n-linear and satisfies (ii), D is sometimes called an alternating n-linear function, since D(A) changes sign if two columns of A are inter changed. The basic result that relates all determinant functions is the following.
Proposition 1.32. The set of determinant functions is a one-dimensional complex vector space. If D is a determinant function and D 7t 0, then
D(I) * 0 where I is the n X n identity matrix in en.
Proof. We briefly outline the proof of this proposition since the proof is instructive and yields the classical formula defining the determinant of an n X n matrix. Suppose D(A) = D(al,..., an) is a determinant function.
For each k = 1,. .., n, ak = EjajkEj where { 'n,..., E") is the standard basis
for ?rn and A = {ajk): n x n. Since D is n-linear and al =ajlej,
D(aj,..., an) = E ajD(ej, a2,..., an).
Applying this same argument for a2= 2 aj2=;,
D(a1,..., an) = E ,ajllaj22D(ej, E12, a3,. .. an). il J2
Continuing in the obvious way,
D(a1,..., an) = E aj,Ia122... annD D(e1, 12,.. e.,
il. * * *in
where the summation extends over all j1,. ijn with 1 < ji < n for 1 =
1,..., n. The above formula shows that a determinant function is de
termined by the nn numbers D(?Ej,. . ., E j) for 1 < ji < n, and this fact
followed solely from the assumption that D is n-linear. But since D is
alternating, it is clear that, if two columns of A are the same, then
D(A) = 0. In particular, if two indices]j and jk are the same, then D(Ej.,..., ej1) = 0. Thus the summation above extends only over those indices where
j,.. .,jn are all distinct. In other words, the summation extends over all
permutations of the set (1, 2,. .., n). If 'n denotes a permutation of 1, 2,.. ., n,
then
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.33 41
where the summation now extends over all n! permutations. But for a fixed
permutation iT(l),..., r(n) of 1,..., n, there is a sequence of pairwise interchanges of the elements of -T(1),..., T(n), which results in the order 1, 2,.. ., n. In fact there are many such sequences of interchanges, but the
number of interchanges is always odd or always even (see Hoffman and
Kunze, 1971, Section 5.3). Using this, let sgn(r) = I if the number of
interchanges required to put iT(l), .. ., v(n) into the order 1, 2, .. ., n is even
and let sgn(7r) = - 1 otherwise. Now, since D is alternating, it is clear that
D(e,T(1),..-, e.f(n)) = sgn( T)D(e1,..., 'n).
Therefore, we have arrived at the formula D(a1,..., an)= D(I)?7,sgn(7T)a,1(l)l... a,T(n)n since D(I) = D(E1,..., E'n). It is routine to
verify that, for any complex number a, the function defined by
Dj(al a . .. a n)--aL, Sgn(7r)a,r(,), . air(n)n
is a determinant function and the argument given above shows that every determinant function is a D. for some a E (4. This completes the proof; for more details, the reader is referred to Hoffman and Kunze (1971, Chapter 5). 0
Definition 1.25. If A E E"n, the determinant of A, denoted by det(A) (or det A), is defined to be D1(A) where D, is the unique determinant function with D1(I) = 1.
The proof of Proposition 1.32 gives the formula for det(A), but that is not of much concern to us. The properties of det(.) given below are most easily established using the fact that det(*) is an alternating n-linear function of the columns of A.
Proposition 1.33. For A, B E CEn:
(i) det(AB) = det A det B.
(ii) det A* = det A.
(iii) det A * 0 iff the columns of A are linear independent vectors in the complex vector space C.
If All: n, x n,, A12: n, X n2, A21: n2 x n,, and A22: n2 x n2 are complex
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
42 VECTOR SPACE THEORY
matrices, then:
(iv) det(A A2) det(~ ~2 det A, I det A22. (A21 A 22 dt 0 A 22 eAldt2
(v) If A is a real matrix, then det(A) is real and det(A) = 0 iff the
columns of A are linearly dependent vectors over the real vector space Rn.
Proof. The proofs of these assertions can be found in Hoffman and Kunze (1971, Chapter 5). 0
These properties of det(*) have a number of useful and interesting implications. If A has columns a1,..., an, then the range of the linear
transformation determined by A is just span{a1,..., an). Thus A is invert ible iff span{aI,..., an) = 4" iff det A * 0. If det A * 0, then 1 =
det AA- ' = det A det A- ', so det A- = l/det A. Consider B1I: n1 X nI, B12: n, X n2, B21: n2 X nI, and B22: n2 x n2- complex matrices. Then it
is easy to verify the identity:
( All A12 (B1l B12 A _ (A11BI1 + A12B21 A11B,2 + A12B22
A21 A22 1B21 B22 J A21B1l + A22B21 A21B,2 + A22B22
where All, A12, A21, and A22 are defined in Proposition 1.33. This tells us
how to multiply the two (n, + n2) X (n, + n2) complex matrices in terms
of their blocks. Of course, such matrices are called partitioned matrices.
Proposition 1.34. Let A be a complex matrix, partitioned as above. If
det AI I* 0, then:
(All A12_ (i) det ( A2' A2) = det A,1det(A22 -A A-A
If det A22 *0 , then:
(ii) det (A A - det A22det(All1-A12A22 2 A22 A22
Proof. For (i), first note that
d(In7 -Ai I
12)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.35 43
by Proposition 1.33, (iv). Therefore, by (i) of Proposition 1.33,
1 A12 IA11 A2 In A-1 det(A
= = det' I A , \A21 A 22j A21 A22 0 In 2 J
{All O = detj 0 A
l A21 A22 - A2-Al1A12
= det Al det(A22 -A2 AI7lA12)
The proof of (ii) is similar. Ol
Proposition 1.35. Let A: n X m and B: m x n be complex matrices. Then
det(I,, + AB) = det(Im + BA).
Proof Apply the previous proposition to
(,B Im )
We now turn to a discussion of the eigenvalues of an n x n complex matrix.
The definition of an eigenvalue is motivated by the following considera tions. Let A E CE". To analyze the linear transformation determined by A,
we would like to find a basis xl,..., xn of ?7n such that Axj = Xjxj, j = 1,..., n, where Xi
E '. If this were possible, then the matrix of the
linear transformation in the basis (x,,..., xn) would simply be
X2
An
where the elements not indicated are zero. Of course, this says that the linear transformation is Xi times the identity transformation when restricted to span(xj). Unfortunately, it is not possible to find such a basis for each linear transformation. However, the numbers X1,..., AX which are called eigenvalues after we have an appropriate definition, can be interpreted in another way. Given A e IT, Ax = Xx for some nonzero vector x iff (A -
XI)x = 0, and this is equivalent to saying that A - AI is a singular matrix,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
44 VECTOR SPACE THEORY
that is, det(A - AI) = 0. In other words, A - AI is singular iff there exists x * 0 such that Ax = Ax. However, using the formula for det(.), a bit of calculation shows that
det(A-AI) = (-l) An + an- A-' + + aA + ao
where ao, a, ... ., an -, are complex numbers. Thus det(A - AI) is a poly
nomial of degree n in the complex variable A, and it has n roots (counting multiplicities). This leads to the following definition.
Definition 1.26. Let A e C-n and set
p(X) = det(A - AI).
Then nth degree polynomial p is called the characteristic polynomial of A and the n roots of the polynomial (counting multiplicities) are called the eigenvalues of A.
If p(A) = det(A - AI) has roots AL,..., A n then it is clear that
n
p(A) = H (AX - A) j=1
since the right-hand side of the above equation is an n th degree polynomial with roots A,,..., An and the coefficient of An is (- 1). In particular,
n
P(M) = H Aj = det(A)
so the determinant of A is the product of its eigenvalues. There is a particular case when the characteristic polynomial of A can be
computed explicitly. If A E C,n, A = (ajk} is called lower triangular if
ajk = 0 when k > j. Thus A is lower triangular if all the elements above the
diagonal of A are zero. An application of Proposition 1.33 (iv) shows that
when A is lower triangular, then
n
det(A) = Hl ajj.
But when A is lower triangular with diagonal elements ajj,] = j ...., n, then A - AI is lower triangular with diagonal elements (ajj - A), j = 1,..., n.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.36 45
Thus
n
p(A) = det(A - XI) = Hl (ajj - A),
so A has eigenvalues al1,..*, a,nn,. Before returning to real vector spaces, we first establish the existence of
eigenvectors (to be defined below).
Proposition 1.36. If A is an eigenvalue of A E Cn, then there exists a
nonzero vector x E 47 such that Ax = Ax.
Proof. Since A is an eigenvalue of A, the matrix A - AI is singular, so the dimension of the range of A - AI is less than n. Thus the dimension of the
null space of A - AI is greater than 0. Hence there is a nonzero vector in
the null space of A - AI, say x, and (A - AI)x = 0. z
Definition 1.27. If A e C,n, a nonzero vector x e Cn is called an eigenvec
tor of A if there exists a complex number A E 47 such that Ax = Ax.
If x * 0 is an eigenvector of A and Ax = Ax, then (A - AI)x = 0 so
A - AI is singular. Therefore, A must be an eigenvalue for A. Conversely, if A E 4 is an eigenvalue, Proposition 1.36 shows there is an eigenvector x such that Ax = Ax.
Now, suppose V is an n-dimensional real vector space and B is a linear transformation on V to V. We want to define the characteristic polynomial, and hence the eigenvalues of B. Let (v,,..., v,) be a basis for V so the
matrix of B is [B] - (bjk) where the b k'S satisfy BVk = EjbIkV1. The char acteristic polynomial of [B] is
P(X) = det([B] - AI)
where I is the n X n identity matrix and X E 4?. If we could show that p(A)
does not depend on the particular basis for V, then we would have a reasonable definition of the characteristic polynomial of B.
Proposition 1.37. Suppose (v,, ..., v,) and (Yl,..., yn) are bases for the real vector space V, and let B EE f(V, V). Let [B] = (bjk) be the matrix of B in the basis (v1,..., v") and let [B],
= (ajk) be the matrix of B in the basis
(Yi,... , yn). Then there exists a nonsingular real matrix C = (cjk) such that
[B]1 = C-'[B]C.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
46 VECTOR SPACE THEORY
Proof. The numbers ajk are uniquely determined by the relations
BYk ?ajkYj k =1 ... , n.
Define the linear transformation Cl on V to V by C1vj = yj, j = 1,..., n. Then C1 is nonsingular since C1 maps a basis onto a basis. Therefore,
BC,vk = -aIkClvI = CI(ajkVj)
and this yields
CQ'BCIVk = E ajkVjI.
Thus the matrix of C1 1BC1 in the basis (v1,..., v,} is (a1k). From Proposi
tion 1.5, we have
[B]1 = {aJk)= [Cj-'BC1] = [Cl'][B][C11 = [CI]'1[B][CI]
where [C1] is the matrix of C1 in the basis (v1,..., v"). Setting C = [C], the conclusion follows. o
The above proposition implies that
p(X) = det([B] - XI) = det(C-'([B] - XI)C)
= det(C-'[B]C - XI) = det([B], - XI).
Thus p(X) does not depend on the particular basis we use to represent B, and, therefore, we call p the characteristic polynomial of the linear transfor
mation B. The suggestive notation
p(X) = det(B - XI)
is often used. Notice that Proposition 1.37 also shows that it makes sense to define det(B) for B E C(V, V) as the value of det[BI in any basis, since the
value does not depend on the basis. Of course, the roots of the polynomial p(A) = det(B - XI) are called the eigenvalues of the linear transformation
B. Even though [B] is a real matrix in any basis for V, some or all of the
eigenvalues of B may be complex numbers. Proposition 1.37 also allows us
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.38 47
to define the trace of A E f (V, V). If (v1,..., v") is a basis for V, let
trA tr[A] where [A] is the matrix of A in the given basis. For any
nonsingular matrix C,
tr[A] = trCC-'[A] = trC-'[A]C,
which shows that our definition of trA does not depend on the particular basis chosen.
The next result summarizes the properties of eigenvalues for linear transformations on a real inner product space.
Proposition 1.38. Suppose (V, (-, )) is a finite dimensional real inner product space and let A E C(V, V).
(i) If A E (V is an eigenvalue of A, then X is an eigenvalue of A.
(ii) If A is symmetric, the eigenvalues of A are real (iii) If A is skew-symmetric, then the eigenvalues of A are pure imagin
ary (iv) If A is orthogonal and A is an'eigenvalue of A, then AX = 1.
Proof If A E: E(V, V), then the characteristic polynomial of A is
p(A) = det([A] - AI), A E (V,
where [A] is the matrix of A in a basis for V. An examination of the formula
for det(.) shows that
p(A) = (- l)'X7 + an_ AV-' + * * + aiA + aiO
where ao,. .., a - _are real numbers since [A] is a real matrix Thus if
p(A) = 0, then p(X) =Ap() = 0 so whenever p(A) = 0, p(A) = 0. This establishes assertion (i).
For (ii), let A be an eigenvalue of A, and let (v,,..., v.) be an orthonor
mal basis for (V, (., *)). Thus the matrix of A, say [A], is a real symmetric
matrix and [A] - AI is singular as a matrix acting on (". By Proposition
1.36, there exists a nonzero vector x E F' such that [A]x = Ax. Thus
x*[A]x = Ax*x. But since [A] is real and symmetric,
x*[A]x = x*[A]*x = x*[A]x = Ax*x = Xx*x.
Thus Xx*x = Ax*x and, since x * 0, X = A so A is real.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
48 VECTOR SPACE THEORY
To prove (iii), again let [Al be the matrix of A in the orthonormal basis
(v1,..., vn) so [A]' = [Al* = -[A]. If A is an eigenvalue of A, then there exists x e or, x * 0, such that [A]x = Ax. Thus x*[A]x = Ax*x and
Ax*x = x*[A]x = -x*[A]x = -Ax*x.
Since x * 0, A = - A, which implies that A = ib for some real number
b-that is, A is pure imaginary and this proves (iii).
If A is orthogonal, then [A] is an n X n orthogonal matrix in the
orthonormal basis {v1,..., vn). Again, if A is an eigenvalue of A, then
[A]x = Ax for some x E a, x * 0. Thus Ax* = x*[A]* = x*[A]' since [A] is a real matrix. Therefore
AXx*x = x*[A]'[A]x = x*x
as [A]'[A] = I. Hence AX 1 and the proof of Proposition 1.38 is complete. 0
It has just been shown that if (V, (, )) is a finite dimensional vector space and if A E I&(V, V) is self-adjoint, then the eigenvalues of A are real.
The spectral theorem, to be established in the next section, provides much
more useful information about self-adjoint transformations. For example, one application of the spectral theorem shows that a self-adjoint trans
formation is positive definite iff all its eigenvalues are positive. If A e E(V, W) and B E E(W, V), the next result compares the eigen
values of AB (E f(W, W) with those of BA e E (V, V).
Proposition 1.39. The nonzero eigenvalues of AB are the same as the
nonzero eigenvalues of BA, including multiplicities. If W = V, AB and BA
have the same eigenvalues and multiplicities.
Proof. Let m = dim V and n = dim W. The characteristic polynomial of
BA is
p1(A) = det(BA - AIm).
Now, for A * 0, compute as follows:
det(BA - AIm) = det(-A)( BA + Imi) _ x~~~(A
= (-A)mdet( A + Im) = (-A)mdet A + In)
= (-A)mdet( IA)(AB -AI) - (-A)' det(AB - AI).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.40 49
Therefore, the characteristic polynomial of AB, say P2(X) = det(AB - Ij, is related to pI(X) by
pi(X) - (X),P2()X),
A E A * 0.
Both of the assertions follow from this relationship.
1.7. THE SPECTRAL THEOREM
The spectral theorem for self-adjoint linear transformations on a finite dimensional real inner product space provides a basic theoretical tool not only for understanding self-adjoint transformations but also for establishing a variety of useful facts about general linear transformations. The form of
the spectral theorem given below is slightly weaker than that given in
Halmos (1958, see Section 79), but it suffices for most of our purposes. Applications of this result include a necessary and sufficient condition that a self-adjoint transformation be positive definite and a demonstration that positive definite transformations possess square roots. The singular value decomposition theorem, which follows from the spectral theorem, provides a useful decomposition result for linear transformations on one inner product space to another. This section ends with a description of the relationship between the singular value decomposition theorem and angles between two subspaces of an inner product space.
Let (V, (-, -)) be a finite dimensional real inner product space. The
spectral theorem follows from the two results below. If A E fE(V, V) and M
is subspace of V, Mis called invariant under A if A(M) = (Axlx e M) c M.
Proposition 1.40. Suppose A E E(V, V) is self-adjoint and let M be a subspace of V. If A(M) c M, then A(M') c M'.
Proof. Suppose v E A(M1). It must be shown that (v, x) = 0 for all
x E M. Since v E A(M'), v = Av1 for v1 E M' . Therefore,
(v, x) =
(Av,, x) =
(v,, Ax) = 0
since A is self-adjoint and x E M implies Ax E M by assumption. O
Proposition 1.41. Suppose A E fE(V, V) is self-adjoint and A is an eigen value of A. Then there exists a v E V, v * 0, such that Av = Xv.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
50 VECTOR SPACE THEORY
Proof. Since A is self-adjoint, the eigenvalues of A are real. Let (v1, . . ., vn) be a basis for V and let [A] be the matrix of A in this basis. By Proposition
1.36, there exists a nonzero vector z E t' such that [A]z = Az. Write
z = zI + iz2 where z1 e R' is the real part of z and z2 E Rn is the imaginary
part of z. Since [A] is real and A is real, we have [A]z, = Az, and
[A]z2 = Az2. But, z1 and Z2 cannot both be zero as z * 0. For definiteness,
say z1 0 and let v c V be the vector whose coordinates in basis (v1,..., vn} are z1. Then v * 0 and [A][v] = A[vJ. Therefore Av = Av. El
Theorem 1.2 (Spectral Theorem). If A E E(V, V) is self-adjoint, then there exists an orthonormal basis (x1,.. ., xn) for V and real numbers AX...., A,n such that
n A = FAixioExix
Further, A1,..., A,n are the eigenvalues of A and Ax, = Aixi, i = 1,.. ., n.
Proof The proof of the first assertion is by induction on dimension. For n = 1, the result is obvious. Assume the result is true for integers 1, 2,..., n
- 1 and consider A E E(V, V), which is self-adjoint on the inner product
space (V,-(, -)), n = dim V. Let A be an eigenvalue of A. By Proposition 1.41, there exists v E V, v * 0, such that Av = Av. Set xn = v/IIvjj and
An = A. Then Axn = Anxn. With M = span(xn}, it is clear that A(M) C M
so A(M') c M' by Proposition 1.40. However, if we let A1 be the restriction of A to the (n - 1)-dimensional inner product space (M' , (*, *)), then A1 is clearly self-adjoint. By the induction hypothesis there is an
orthonormal basis (x,. . x,, x ) for M' and real numbers A,.. ., A,_ such that
n-I
A1= X AixiUx
It is clear that (xl,. .., x.,) is an orthonormal basis for V and we claim that
n
A = AEx Oxi
To see this, consider vo E V and write vO = vI + v2 with vI E M and
V2 E M'. Then
n-I
Avo = Av, + Av2 = A,nl + A1v2 = Xnvl + E Ai(xioxi)v2.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.42 51
However,
n n-I
E(A 0 Xi ()V + V2) = Xn(VI,xn)xn + E2Ai(xiOxi)v2
since v, e M and v2 E M'. But (vI, xn)xn = v1 since v1 E span(xn). Therefore A = El Xx1 0 xi, which establishes the first assertion.
For the second assertion, if A = En Xx 0 xi where {xl,..., x,} is an
orthonormal basis for (V, (*, )), then
Axj = EAi(xiO xi)xj = EAi(xi, xj)xi = A1jx1 i i
Thus the matrix of A, say [AJ, in this basis has diagonal elements A1,..., AI n
and all other elements of [A] are zero. Therefore the characteristic poly nomial of A is
n
p(A) = det([A] - AI) = Il(Xi - A),
which has roots A,,..., A,n. The proof of the spectral theorem is complete.
When A = EAXx i O xi, then A is particularly easy to understand. Namely,
A is Ai times the identity transformation when restricted to span(xi}. Also, if x E V, then x = X(xi, x)xi so Ax = EAi(xi, x)xi. In the case when A is an
orthogonal projection onto the subspace M, we know that A = Ekxio xi where k = dim M and {x,,..., Xk) is an orthonormal basis for M. Thus A
has eigenvalues of zero and one, and one occurs with multiplicity k = dim M.
Conversely, the spectral theorem implies that, if A is self-adjoint and has only zero and one as eigenvalues, then A is an orthogonal projection onto a
subspace of dimensional equal to the multiplicity of the eigenvalue one. We now begin to reap the benefits of the spectral theorem.
Proposition 1.42. If A E P,(V, V), then A is positive definite iff all the eigenvalues of A are strictly positive. Also, A is positive semidefinite iff the eigenvalues of A are non-negative.
Proof. Write A in spectral form:
n A =EAXx1 xi
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
52 VECTOR SPACE THEORY
where (x,,..., xn) is an orthonormal basis for (V,(-, *)). Then (x, Ax) =
ENi(xi, x)2. If Xi > 0 for i = 1,..., n, then x * 0 implies that EXi(xi, x)2
> 0 and A is positive definite. Conversely, if A is positive definite, set x = x; and we have 0 < (x;,
Ax.) = Aj. Thus all the eigenvalues of A are
strictly positive. The other assertion is proved similarly. o
The representation of A in spectral form suggests a way to define various functions of A. If A = XX1x1O xi, then
A2 = (A ixixix)(EXjxjo
x1) = E EXXj(xi0 xi)(xj1 xj)
i i iij
= E EAiXj(xi, Xj)xiO Xj = ExixiU x i
More generally, if k is a positive integer, a bit of calculation shows that
Ak = ixiO xi, k = 1,2.
For k = 0, we adopt the convention that A' = I since Ex1O xi = I. Now if
p is any polynomial on R, the above equation forces us to define p(A) by
p(A) = Yp(Xi)xi?xi.
This suggests that, if f is any real-valued function that is defined at
A,..., An, we should definef(A) by
f(A) = Yf(Ai)xiC xi.
Adopting this suggestive definition shows that if A1,..., A, are the eigen values of A, then f(AL ),. . . , f(XA,,) are the eigenvalues of f(A). In particular,
if A,i 0 for all i and f(t) - U', t *0, then it is clear that f(A) =A
Another useful choice for f is given in the following proposition.
Proposition 1.43. If A E E(V, V) is positive semidefinite, then there exists a B E
- C(V, V) that is positive semidefinite and satisfies B2 = A.
Proof. Choose f(t) = t'/2, and let
n
B -f(A) = EA'j2x,oxi.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.43 53
The square root is well defined since AX > 0 for i = 1,. . ., n as A is positive
semidefinite. Since B has non-negative eigenvalues, B is positive definite. That B2 = A is clear. El
There is a technical problem with our definition of f(A) that is caused by the nonuniqueness of the representation
n A =
XxAD xi
I
for self-adjoint transformations. For example, if the first n1 Xi's are equal
and the last n - n, X 's are equal, then
A = xi( ExiEl xi) + An E x O x
However, I2 xiExi is the orthogonal projection onto Ml span(x,,..., xnl}. If Y1..., yn is any other orthonormal basis for (V,(., *)) such that
span(xl,..., xnl} = span(yl,.. ., yn,), it is clear that
n, n n
A = 1EYC ? Yi + An E YiC? Yi = EXiYC ? Yi @ I n1+? I
Obviously, A1,.. ., Xn are uniquely defined as the eigenvalues for A (count ing multiplicities), but the orthonormal basis (x,,. . ., xn) providing the spectral form for A is not unique. It is therefore necessary to verify that the
definition of f(A) does not depend on the particular orthonormal basis in the representation for A or to provide an alternative representation for A. It is this latter alternative that we follow. The result below is also called the spectral theorem.
Theorem 1.2a (Spectral Theorem). Suppose A is a self-adjoint linear transformation on V to V where n = dim V. Let XA > ... Ar be the distinct eigenvalues of A and let ni be the multiplicity of AX, i = 1,..., r.
Then there exists orthogonal projections PF,..., Pr with P,Pj = 0 for i j, n= rank(Pi), and E Pn = I such that
r
A = EXiPi.
Further, this decomposition is unique in the following sense. If ,u, > ... >
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
54 VECTOR SPACE THEORY
Ak and Ql ... I Qk are orthogonal projections such that QiQj = 0 for i ]j,
EQi = I, and
k A =
then k = r, t X-i, and Qi = Pi for i = 1,..., k.
Proof. The first assertion follows immediately from the spectral represen tation given in Theorem 1.2. For a proof of the uniqueness assertion, see
Halmos (1958, Section 79). o
Now, our definition of f(A) is
r
f (A) = Ef(Xi)P I
when A = E'XiPi. Of course, it is assumed that f is defined at XA,..., Xr. This is exactly the same definition as before, but the problem about the nonuniqueness of the representation of A has disappeared. One application of the uniqueness part of the above theorem is that the positive semidefinite square root given in Proposition 1.43 is unique. The proof of this is left to
the reader (see Halmos, 1958, Section 82). Other functions of self-adjoint linear transformations come up later and
we consider them as the need arises. Another application of the spectral theorem solves an interesting extremal problem. To motivate this problem, suppose A is self-adjoint on (V, ( , *)) with eigenvalues X I> A2 > ... > An.
Thus A = EAixix xi where (xl,.. ., x,,} is an orthonormal basis for V. For
x e V and llxll = 1, we ask how large (x, Ax) can be. To answer this, write
(x, Ax) = EXi(x, (xi Exi)x) = Xi (x, xi)2, and note that 0 < (x, xi)2 and
1 = lix 12 = E1(xI, x)2. Therefore, Al > EAi(x, xi)2 with equality for x = x,. The conclusion is
sup (x, Ax) =A x, lIxil= I
where Al is the largest eigenvalue of A. This result also shows that A1 (A)-the largest eigenvalue of the self-adjoint transformation A -is a convex func
tion of A. In other words, if A, and A2 are self-adjoint and a E [0, 1], then
A1(aA, + (1 - a)A2) < aA1(Al) + (1 - a)A1(A2). To prove this, first
notice that for each x E V, (x, Ax) is a linear, and hence convex, function
of A. Since the supremum of a family of convex functions is a convex
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.44 55
function, it follows that
X1(A)= sup (x, Ax) x,IIxII= I
is a convex function defined on the real linear space of self-adjoint linear transformations. An interesting generalization of this is the following.
Proposition 1.44. Consider a self-adjoint transformation A defined on the n-dimensional inner product space (V, ( *, - )) and let XI > X2 > * >X Anbe the ordered eigenvalues of A. For 1 k < n, let fk be the collection of all k-tuples {v,..., Vk) such that (v, ..., Vk) is an orthonormal set in (V, (, *)). Then
k k
sup E(vi, Avi) =x. (V1. V .k)Ek 1
Proof Recall that < , is the inner product on f, (V, V) induced by the inner product (*, *) on V, and (x, Ax) = (x O x, A) for x e V. Thus
k k k
Vi Avi) = Evi Di, A) v1,viC viA I I\ I/
Write A in spectral form, A = X1 xi. For (v0,..., Vk) E 6k' 0k = EkrVD vi is the orthogonal projection onto span{v,..., Vk). Thus for (v1,..., Vk) E '
K Evfli ) =(Pk IXiX)
n n
- EPk, EXi xiE ) = ZXx(Xi Pkxi)
n n
Since Pk is an orthogonal projection and 1xi=X = 1, iP= 1,..., n, 0
(xi, Pkxi) < 1. Also,
n n I n
(xi, Pkxi) = Pk, xiExi)= Pk, Xi[Xi Pk, I)
because ynXykExV xi = I E (V, V). But Pk = EIv,J Vi, so
k k
(Pk, I) = (vil vi, I) (vi, vi) = k.
1 1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
56 VECTOR SPACE THEORY
Therefore, the real numbers ai = (xi, Pkxi), i = 1,..., n, satisfy 0 < ai < 1
and , = k. A moment's reflection shows that, for any numbers a,...., an satisfying these conditions, we have
n k
EXiai <, Ei I I
since Al > .. > AXn. Therefore,
/k \ k
(Evi vi, A) <Ei
for {v1,..., vk) E 6
However, setting vi = xi, i = 1,..., k, yields equality in the above inequality.
For A E E(V, V), which is self-adjoint, define trkA = EkAX where X, >?
***> A,n are the ordered eigenvalues of A. The symbol trkA is read "trace
sub-k of A." Since (EkV EJ v., A) is a linear function of A and trkA is the
supremum over all {v1,..., vk) E 6Bk it follows that trkA is a convex
function of A. Of course, when k = n, trk A is just the trace of A.
For completeness, a statement of the spectral theorem for n x n symmet
ric matrices is in order.
Proposition 1.45. Suppose A is an n X n real symmetric matrix. Then there
exists an n x n orthogonal matrix F and an n x n diagonal matrix D such
that A = FDF'. The columns of F are the eigenvectors of A and the
diagonal elements of D, say A1,. . ., AXn, are the eigenvalues of A.
Proof. This is nothing more than a disguised version of the spectral
theorem. To see this, write n
A = EXixixi
where xi E R", Ai E R, and {x,,..., xn) is an orthonormal basis for Rn with
the usual inner product (here xi 0
xi is xix' since we have the usual inner
product on Rn). Let F have columns xI,..., xn and let D have diagonal
elements AX,..., AIn. Then a straightforward computation shows that
n
fA,ixix; = rDF.
The remaining assertions follow immediately from the spectral theorem. o
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.46 57
Our final application of the spectral theorem in this chapter deals with a
representation theorem for a linear transformation A E I (V, W) where (V, ( , -)) and (W, [ *, *]) are finite dimensional inner product spaces. In this
context, eigenvalues and eigenvectors of A make no sense, but something can be salvaged by considering A'A E fC(V, V). First, A'A is non-negative definite and 6X(A'A) = OL(A). Let k = rank(A) = rank(A'A) and let X, >
*. > X > 0 be the nonzero eigenvalues of A'A. There must be exactly k
positive eigenvalues of A'A as rank(A) = k. The spectral theorem shows that
k A'A = EZixiL xi
where (xl,..., x,,} is an orthonormal basis for V and A'Axi ixi for = 1,..., k, A'Axi = 0 for i = k + 1,..., n. Therefore, %X(A) = %(A'A)
= (span(x1,..., Xk}).
Proposition 1.46. In the notation above, let wi = (1/ rX1)Axi for i = 1,..., k. Then (w1,..., Wk} is an orthonormal basis for AR (A) c W and A = EkX1Wi 1x1.
Proof Since dim @(A) = k, {w,..., Wk) is a basis for QA (A) if (w1,..., Wk)
is an orthonormal set. But
[wi, wj] = (XiX) /2 [Axi, Axj] = (Xij)-1/2(Xi, A'Ax1)
= (XiXj) -2Xj(xi, x) =Sij
and the first assertion holds. To show A = E*X wi E xi, we verify the two
linear transformations agree on the basis (xl,... ., xn). For 1 < j < k, Axj = X wj by definition and
E X wi ?x Xi Xj= E X- (Xi,g X ) Wi X wj
For k + 1 <j < n, Ax. = 0 since %y(A) = (span(x1,..., xk)). Also
aEjXW?Xi k) xL=EX(Xi,xjI)W O
asj >k.O
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
58 VECTOR SPACE THEORY
Some immediate consequences of the above representation are (i) AA' =
I AiwiO Wi, (ii) A'=' - ,x, x wi and A'wi = xixi for i- 1,...,k. In
summary, we have the following.
Theorem 13 (Singular Value Decomposition Theorem). Given A E (V, W) of rank k, there exist orthonormal vectors x,,..., Xk in V and
w1,.. ., Wk in W and positive numbers ,u ,. . .k such that
k
A = uwi O xi.
Also, 6@(A) = span{wl,..., Wk), 6%(A) = (span(x1,..., xk))', Axi = Aiwi, = 1,..., k, A' -= iXi ?Lw, A'A = 4yxi ?rXi, AA' = Z i w,. The
numbers [2, 2are the positive eigenvalues of both AA' and A'A.
For matrices, this result takes the following form.
Proposition 1.47. If A is a real n x m matrix of rank k, then there exist
matrices F: n x k, D: k x k, and I: k x m that satisfy FrT = Ik' 5' =
Ik, D is a diagonal matrix with positive diagonal elements, and
A = rDA.
Proof Take V = RI, W = Rn and apply Theorem 1.3 to get
k
A = Miwix'
where x,,..., Xk are orthonormal in R", wm,..., Wk are orthonormal in Rn,
and 0i > O, i = 1,..., k. Let F have columns w , Wk let I have rows
xl,..., X, and let D be diagonal with diagonal elements IL,. * ., [Lk An easy
calculation shows that k
fjwixit = IFDP.
In the case that A e E(V, V) with rank k, Theorem 1.3 shows that there
exist orthonormal sets (x,,..., Xk) and (w,,..., Wk) of V such that
k
A = iwiL xi
where ,i > 0, i = 1,..., k. Also, 6Ai(A) = span{wl,..., Wk} and 6X(A)=
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.48 59
(span(xI,..., xk})k'. Now, consider two subspaces M1 and M2 of the inner product space (V, (,*)) and let P1 and P2 be the orthogonal projections onto M, and M2. In what follows, the geometrical relationship between the two subspaces (measured in terms of angles, which are defined below) is related to the singular value decomposition of the linear transformation P1P2 e f&(V, V). It is clear that 6it(P2PI) c M2 and OL(P2PI) D MI . Let
k = rank(P2Pi) so k < dim(Mi), i = 1, 2. Theorem 1.3 implies that
k
P2PI = EYtiWiJOXi I
where 1ui > 0, i = 1,..., k, 6Y(P2P1) = span{wl,..., wk) c M2, and
('X(p2PM) = span{x1,..., xk) C M,. Also, (wl,..., wk) and {x1,..., xk) are orthonormal sets. Since P2P1x1 = - and (P2PI)'P2PI
= P1P2P2PI =
P P2P1 = k2Xi 0 x, we have
I.j (xi, w1) (Xi, P2P1XJ) = (PIXi, P2PIXJ) = (Xi, PIP2PIXj)
- =Xi, (?tZX?XIXj - (X.,X1) X i2.
Therefore, for i, j 1,..., k,
(Xi, wj) =)ijyj
since 0Lj > O. Furthermore, if x E Ml n (span(x,,..., Xk))' and w E M2, then (x, w) = (P1x, P2w) = (P2P1x, w) = 0 since P2P1x - 0. Similarly, if
w E M2 n (span(w,,..., wk))L and x E M,, then (x, w) 0. The above discussion yields the following proposition.
Proposition 1.48. Suppose Ml and M2 are subspaces of (V,(., *)) and let P1 and P2 be the orthogonal projections onto M, and M2. If k =
rank(P2PI), then there exist orthonormal sets (xl,..., Xk) c Ml, (wi,..., wk) c M2 and
positive numbers u I >* > I1k such that:
(i) P2P1 = E liwi xi.
(ii) P P2pi = ktL2Xi 0Xi.
(iii) P2 P1P2 = Ek12 Wi C wi.
(iv) O < uj < I and (xi, wj) =
8ijj for i, j =1,..., k.
(v) If x E M, n (span{x1,..., xk))' and w e M2, then (x, w) = 0. If w e M2 rn (span(wl,..., wk))' and x e Ml, then (x, w) = 0.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
60 VECTOR SPACE THEORY
Proof. Assertions (i), (ii), (iii), and (v) have been verified as has the relationship (xi, wj) = 6ij,uj. Since 0 < pj = (xj, wj), the Cauchy-Schwarz Inequality yields (xj, wj) < 11xj11 ILwjII = 1. oiE
In Proposition 1.48, if k = rank P2P1 = 0, then M1 and M2 are orthogo
nal to each other and P1P2 = P2P1 = 0. The next result provides the
framework in which to relate the numbers i,u > ... > ,Uk to angles.
Proposition 1.49. In the notation of Proposition 1.48, let MH1 = M1, M21 = M2,
Ml= (span(x1,.., x-1 ) x n M1,
and
M2i= (span(w1,.., wi-1)) nrM2
for i = 2,..., k + 1. Also, for i = 1,..., k, let
Dii= (xix E Mi,Ixll = 1)
and
D2i = {wlw E M2i,11W11 = 1)
Then
sup sup (x,w) = (Xi,Wi) = y
xeD1, wGD2,
for i = I,..., k. Also, MI(k+ 1) 1 M2 and M2(k+ ) ? M.
Proof. Since xi E D1i and wi E D2i, the iterated supremum is at least
(xi, wi) and (xi, wi) = ,ij by (iv) of Proposition 1.48. Thus it suffices to show
that for each x E DUi and w E D2i, we have the inequality (x, w) < ,ui. However, for x E D1i and w E D20
(x, w) = (PIx, P2W) = (P2P1X, W) < 1IP2PixII IIWII = 1IP2P1XII
since llwll = 1 as w E Dii. Thus
(x, w) 1 IIP2PlxII = (P2P1x, P2P1x)1'2 - (P1P2P1x, X)1/2
1/2 k1/2
= E =j2((XjE Xj)X, X) [ 2 1t2(Xj, X)2
SincejxD x,=l 0orj 1 , 1.Alo,thmj=1 t
Since x EE Dli, (x, Xj) = for j = ,.,i-1. Also, the numbers aj
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 1.49 61
(x, x)2 satisfy 0 < aj < 1 and Y4aJ < 1 as llxll = 1. Therefore,
k 1/2 k 1/2
(x, w) A r 2I}(Xj, x) = [atL2aj.2 (g) =
The last inequality follows from the fact that A, > ,Ak > 0 and the
conditions on the aj's. Hence,
sup sup (x, w) = (xi, wi) =-Li xEDli wED2i
and the first assertion holds. The second assertion is simply a restatement of
(v) of Proposition 1.48. 0
Definition 1.28. Let M1 and M2 be subspaces of (V,(-,)). Given the numbers ,u >1 * * > Ak > 0, whose existence is guaranteed by Proposition
1.48, define 0i E [0, r/2) by
COS@-- i =1, ...,~k.
Let t = min(dim M1,dim M2) and set 0i = 7T/2 for i = k + 1,. . ., t. The
numbers 0, < 02 < ***< Ot are called the ordered angles between Ml and
M2.
The following discussion is intended to provide motivation, explanation, and a geometric interpretation of the above definition. Recall that if y, and Y2 are two vectors in (V, (., -)) of length 1, then the cosine of the angle between y, and Y2 is defined by cos 0 = (y1, Y2) where 0 < 0 < so. However, if we want to define the angle between the two lines span(y,} and span(y2}, then a choice must be made between two angles that are complements of
each other. The convention adopted here is to choose the angle in [0, '7/21. Thus the cosine of the angle between span{y,) and span{y2} is just l(y,, Y2)1. To show this agrees with the definition above, we have Mi = span{yi) and
Pi = yi E yi is the orthogonal projection onto Mi, i = 1, 2. The rank of P2PI is either zero or one and the rank is zero iff yi 1L y2. If Yi ? y2, then the
angle between Ml and M2 is fr/2, which agrees with Definition 1.28. When the rank of P2P1 is one, P1P2Pl = ( Y2)2yl0yl whose only nonzero eigenvalue is (Yi, y2)2. Thus A2 = (Y1, y2)2 So s = i(yl, Y2)l = cos 0,, and again we have agreement with Definition 1.28.
Now consider the case when M, = span{y,), IlyIY, = 1, and M2 is an
arbitrary subspace of (V,(-, -)). Geometrically, it is clear that the angle
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
62 VECTOR SPACE THEORY
between M1 and M2 is just the angle between M1 and the orthogonal projection of M1 onto M2, say M2* = span{P2y,) where P2 is the orthogonal projection onto M2. Thus the cosine of the angle between M1 and M2 is
Cos - = (Y1, IP2Y,Il =1P2Y111
If P2y, = 0, then Ml I M2 and cos 6 = 0 so 0 = s/2 in agreement with
Definition 1.28. When P2y, * 0, then P1P2P1 = (y1,P2y1)yl EJly whose only nonzero eigenvalue is (y1, P2y1) = (P2y, P2y1) = =1 Therefore, p, = iJP2y1J and again we have agreement with Definition 1.28.
In the general case when dim( Mi) > 1 for i = 1, 2, it is not entirely clear
how we should define the angles between M, and M2. However, the
following considerations should provide some justification for Definition 1.28. First, if x e M, and w E M2, llxli = liwli = 1. The cosine of the angle between span x and span w is i(x, w)i. Thus the largest cosine of any angle (equivalently, the smallest angle in [0, r/2]) between a one-dimensional subspace of Ml and a one-dimensional subspace of M2 is
sup sup i(x, w)i = sup sup (x, w). xEDll wED21 xeDll wED21
The sets D11 and D2, are defined in Proposition 1.49. By Proposition 1.49, this iterated supremum is ,u, and is achieved for x = x, E DI1 and w = w, E D21. Thus the cosine of the angle between span{x1) and span{w,} is l,.
Now, remove spanx l) from M1 to get Ml 2 = (span(x 1)) I
Ml and remove
span(w1} from M2 to get M22 = (span{w,))' nAM2. The second largest cosine of any angle between Ml and M2 is defined to be the largest cosine of any angle between Ml2 and M22 and is given by
Sup sup (X,W) = (X2, W2) = 2 xeD12 WED22
Next span(x2) is removed from M,2 and span{w2) is removed from M22, yielding M,3 and M23. The third largest cosine of any angle between Ml and
M2 is defined to be the largest cosine of any angle between MA13 and M23, and so on. After k steps, we are left with MlI(k?+ ) and M2(k+ 1)I which are
orthogonal to each other. Thus the remaining angles are '7/2. The above is
precisely the content of Definition 1.28, given the results of Propositions 1.48 and 1.49.
The statistical interpretation of the angles between subspaces is given in a later chapter. In a statistical context, the cosines of these angles are called
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 63
canonical correlation coefficients and are a measure of the affine depen dence between the random vectors.
PROBLEMS
All vector spaces are finite dimensional unless specified otherwise.
1. Let V"+ l be the set of all nth degree polynomials (in the real variable
t) with real coefficients. With the usual definition of addition and scalar multiplication, prove that VJ'+ is an (n + 1)-dimensional real
vector space.
2. For A e E(V, W), suppose that M is any subspace of V such that
M E %(A) = V.
(i) Show that 6'(A) = A(M) where A(M) = (wlw = Ax for some
x E M).
(ii) If x1,..., Xk is any linearly independent set in V such that span(x1,..., xk) n 9L(A) = (0), prove that Ax1,..., Axk is lin early independent.
3. For A E $(V, W), fix w0 E W and consider the linear equation Ax =
wo. If w0 e 1P(A), there is no solution to this equation. If w0 E @(A), let x0 be any solution so Axo = w0. Prove that 6L(A) + x0 is the set of
all solutions to Ax = w0.
4. For the direct sum space V, ED V2, suppose Aij E f&(Vj, VJ) and let
T (All A,2
tA21 A221
be defined by
(A A )(vI, v2J =
(A1v + A12v2, A21vI + A22V2)
for{v1, v2} E V1 ED V2.
(i) Prove that T, is a linear transformation. (ii) Conversely, prove that every T1 E E (V1 ED V2, V1 E V2) has such
a representation. (iii) If
(Al1 A12 T kA21 A22
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
64 VECTOR SPACE THEORY
and
(B1l B12 U= ~B2 B2J
prove that the representation of TU is
AH1B11 + A12B21 AH1B12 + A12B22
( A21B11 + A22B21 A21B2+ A22B22+
5. Let xl,..., Xr9 Xr+I be vectors in V with x1,..., xr being linearly
independent. For w,,..., wr, wr+ I in W, give a necessary and sufficient
condition for the existence of an A E E(V, W) that satisfies Axi = Wi, i = 1,..., r+ 1.
6. Suppose A E f&(V, V) satisfies A2 = cA where c * 0. Find a constant
k so that B = kA is a projection.
7. Suppose A is an m x n matrix with columns a,,..., an and B is an
n x k matrix with rows b',..., bn. Show that AB = aib
8. Let xl,..., Xk be vectors in R , set M = span(x1,..., Xk}, and let A be
the n x k matrix with columns x,. . ., Xk so A E t(Rk, Rn).
(i) Show M = 61 (A).
(ii) Show dim(M) = rank(A'A).
9. For linearly independent x1, . . ., Xk in (V,(., 9 )), let Y1, ., Yk be the vectors obtained by applying the Gram-Schmidt (G-S) Process to
XI,--.g x ~~~~~~~~~~~~~~~~~Shwthtif =A i= x1,. .., Xk. Show that if z. ., k, where A E 6(V), then
the vectors obtained by the G-S Process from z1,- -, Zk are Ay,,..., Ayk. (In other words, the G-S Process commutes with orthogonal
transformations.)
10. In (V,(*, *)), let x1,..., xk be vectors with xl * 0. Form yi, . ., y by = x1/11x111 andy/ = xi - (xi, y')yq, i = 2,..., k:
(i) Show span(x1,..., Xr} = span{y ,1 . y.,y,) for r = 1, 2,..., k.
(ii) Show y1 I span(yi,...,y,} so span(y',. ..,y1) = span(y') ED span{y2,...,y})forr=2,...,k.
(iii) Now, form y2,. . , y 2 from y2,... , as the y"s were formed
from the x 's (reordering if necessary to achieve y2 * 0). Show
span{x1,..., Xk} = span{y }) ED span{y2) @ span{y2. ,. . .,
(iv) Let m = dim(span{x1,..., Xk)). Show that after applying the
above procedure m times, we get an orthonormal basis I1
y22 y, Yfor span(x , . ,xk)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 65
(v) If xi,..., Xk are linearly independent, show that span{xx,..., Xr} - span{yty, ..,y} y) forr= ,.. .,k.
11. Let xl,..., xm be a basis for (V,(., *)) and w,,..., w% be a basis for
(W,[, *]). For A, B E E(V,W), show that [Axi,wj] = [Bxi,wj] for
i = 1, .. ., m and j=1,. . ., n implies that A = B.
12. For xi E (V, ( , )) and yi E (W, [ ., -]), i = 1, 2, suppose that x1El y1 = x20E Y2 * 0. Prove that xl = cx2 for some scalar c * 0 and then
YC = c 'Y2.
13. Given two inner products on V, say (,*) and [-, *], show that there exist positive constants cl and c2 such that cJ[x, x] < (x, x) < c2[x, x],
x e V. Using this, show that for any open ball in (V,(, .)), say
B = (xj(x, x)'/2 < a), there exist open balls in (V,[, ]), say Bi =
{xj[x, x]'/2 < pi), i = 1,2, such that B, c B c B2.
14. In (V,(-, *)), prove that lix + yll < llxil + IlyIl. Using this, prove that
h(x) = llxll is a convex function.
15. For positive integers I and J, consider the IJ-dimensional real vector space, V, of all real-valued functions defined on (1, 2,.. ., I) x
(1, 2,..., J) . Denote the value of y E V at (i, j) by yij. The inner
product on V is taken to be (y, 9) = EEyijiij. The symbol 1 c V
denotes the vector all of whose coordinates are one.
(i) DefineA on Vto VbyAy =y.l wherey..= (IJ)f-'Yyij. Show that A is the orthogonal projection onto span{l).
(ii) Define linear transformations BI, B2, and B3 on V by
(Bly)ij = Yi- Y..
(B2Y)ij = Y- ..
(B3Y)ij = Yij-YA. YJ + Y.
where
yi.= =J EYI
and
y.B = a t l
Show that B,, B2, and B3 are orthogonal projections and the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
66 VECTOR SPACE THEORY
following holds:
ABk=O, k= 1,2,3
B1B2 = BIB3 = B2B3 = 0
(A + B1 + B2 + B3)y = y, y E V.
(iii) Show that
AlyiI2 = IlAyl12 + JIB,yhI2 + hIB2yhI2 + hIB3yII2.
16. For F c (9(V) and M a subspace of V, suppose that r(M) C M. Prove
that r(Ml) c M .
17. Given a subspace M of (V, (, )), show the following are equivalent:
(i) }(x, y)l < clixii for all x E M.
(ii) iIPMYIi < C. Here c is a fixed positive constant and PM is the orthogonal projection onto Al.
18. In (V, (-, *)), suppose A and B are positive semidefinite. For C, D c
(9(V, V) prove that (tr ACBD')2 < tr ACBC'tr ADBD'.
19. Show that XF is a 2n-dimensional real vector space.
20. Let A be an n X n real matrix. Prove:
(i) If A0 is a real eigenvalue of A, then there exists a corresponding
real eigenvector. (ii) If A0 is an eigenvalue that is not real, then any corresponding
eigenvector cannot be real or pure imaginary.
21. In an n-dimensional space (V, (*, * )), suppose P is a rank r orthogonal
projection. For a, /3 E R, let A = aP + /3(I - P). Find eigenvalues,
eigenvectors, and the characteristic polynomial of A. Show that A is
positive definite iff a > 0 and , > 0. What is A1-l when it exists?
22. Suppose A and B are self-adjoint and A - B > 0. Let XA> - * * > AX,
and A, > -*. > A, be the eigenvalues of A and B. Show that Ai > Ai, i= l,...,n.
23. If S,TcE (V,V) and S>0, T>O, prove that (S,T) =0 implies
T = 0.
24. ForA e (I (V,V),( ,.)), show that (A, I) = trA.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 67
25. Suppose A and B in L (V, V) are self-adjoint and write A > B to mean
A - B > 0.
(i) If A > B, show that CAC' > CBC' for all C E f?&(V, V).
(ii) Show I > A iff all the eigenvalues of A are less than or equal to
one. (iii) Assume A > 0, B > 0, and A > B. Is A1/2 > B1/2? Is A2 > B2?
26. If P is an orthogonal projection, show that tr P is the rank of P.
27. Let xl,..., xn be an orthonormal basis for (V, ( , )) and consider the
vector space ( (V, V), ( . , )). Let M be the subspace of i (V, V)
consisting of all self-adjoint linear transformations and let N be the
subspace of all skew symmetric linear transformations. Prove: (i) {xiO xj + xjC xi Ii < j} is an orthogonal basis for M.
(ii) (xi1 xj - xj xi li < j) is an orthogonal basis for N.
(iii) M is orthogonal to N and M ED N = f (V, V).
(iv) The orthogonal projection onto M is A -* (A + A')/2, A E
P(V, V).
28. Consider en, n with the usual inner product (A, B) = trAB', and let Sn be the subspace of symmetric matrices. Then ((n , *)) is an inner
product space. Show dim Sn = n(n + 1)/2 and for S, T E S,n, (S, T) = Lisiitii +
2EEi<jsijtij.
29. For A E fS(V, W), one definition of the norm of A is
IIhAIl = sup IlAvll 11v11= I
where is the given norm on W. (i) Show that 1ilAIII is the square root of the largest eigenvalue of A'A.
(ii) Show that JllaAlll = lallIlAlJl, a E R and IIIA + BIll < IIAIII + IIJBIlI. 30. In the inner product spaces (V, (., *)) and (W, [., *]), consider A E
f-(V, V) and B E f (W, W), which are both self-adjoint. Write these in
spectral form as
m A = ixi x xi
n B T y O, we ci w,.
(Note: The symbol El has a different meaning in these two equations
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
68 VECTOR SPACE THEORY
since the definition of El depends on the inner product.) Of course, X1,..., xm,[wl.W. ., w%] is an orthonormal basis for (V, (, ')) (W, [, 1)1. Also, (xi o wjli = 1,..., m, j = 1,.. ., n) is an orthonormal basis for
(f&(W, V), ( , )), and A ? B is a linear transformation on i&(W, V)
to E(W, V).
(i) Show that (A X B)(xiOwj) = XifjL(xi0wj) so Ai[; is an eigen
value of A ? B.
(ii) Show that A ? B =E ij(xi lwj)O(xiOwj) and this is a spectral decomposition for A X B. What are the eigenvalues and corresponding eigenvectors for A X B?
(iii) If A and B are positive definite (semidefinite), show that A ? B is positive definite (semidefinite).
(iv) Show that tr A ? B = (tr A)(tr B) and det A ? B =
(det A) (det B)
31. Let x,,..., xp be linearly independent vectors in Rn, set M = span{x ,..., xp), and let A: n x p have columns xl,..., xp. Thus
(t (A) = M and A'A is positive definite.
(i) Show that 4 = A(A'A)- 1/2 is a linear isometry whose columns
form an orthonormal basis for M. Here, (A'A)- 1/2 denotes the
inverse of the positive definite square root of A'A.
(ii) Show that A+4 = A(A'A)-'A' is the orthogonal projection on to M.
32. Consider two subspaces, M1 and M2, of Rn with bases xl,.., xq and
Y .,Yr, Let A(B) have columns x, * ,Xq (y,,. *Yr). Then P1 = A(A'A) 'A' and P2 = B(B'B)- 'B' are the orthogonal projections onto
M, and M2, respectively. The cosines of the angles between M, and M2
can be obtained by computing the nonzero eigenvalues of P1P2P1.
Show that these are the same as the nonzero eigenvalues of
(A'A) 'A'B(B'B) 'B'A: q X q
and of
(B'B) -'B'A (A'A) - A'B: r x r.
33. In R4, set xl = (1,0,0,0), x2 = (0, 1,0,0), y' = (1, 1, 1, 1), and y2 =
(1, -1, 1, -1). Find the cosines of the angles between M, =
span(xI, x2) and M2 = span{y,, Y2).
34. For two subspaces M1 and M2 of (V, (-,)), argue that the angles
between M, and M2 are the same as the angles between F(M,) and
F(M2) for any I' e ((V).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 69
35. This problem has to do with the vector space V of Example 1.9 and V
may be infinite dimensional. The results in this problem are not used in the sequel. Write Xl 1 X2 if XI = X2a.e. (PO) for XI and X2 in V. It
is easy to verify = is an equivalence relation on V. Let M = (XI X e
V, X = 0 a.e. (PO)) so Xl = X2 iff XI - X2 E M. Let L2 be the set of
equivalence classes in V. (i) Show that L2 is a real vector space with the obvious definition of
addition and scalar multiplication. Define (., *) on L2 by (yi, Y2) = &X1X2 where Xi is an element of the
equivalence class yi, i = 1, 2.
(ii) Show that ( is well defined and is an inner product on L2. Now, let go be a sub a-algebra of 6W. For y E L2, let Py denote the
conditional expectation given i0 of any element in y. (iii) Show that P is well defined and is a linear transformation on L2
to L2. Let N be the set of equivalence classes of all qo measurable functions in V. Clearly, N is a subspace of L2.
(iv) Show that p2 = p, P is the identity on N, and 6(P) = N. Also
show that P is self-adjoint-that is (yl, Py2) = (Py1, Y2).
Would you say that P is the orthogonal projection onto N?
NOTES AND REFERENCES
1. The first half of this chapter follows Halmos (1968) very closely. After
this, the material was selected primarily for its use in later chapters. The
material on outer products and Kronecker products follows the author's tastes more than anything else.
2. The detailed discussion of angles between subspaces resulted from unsuccessful attempts to find a source that meshed with the treatment of canonical correlations given in Chapter 10. A different development can be found in Dempster (1969, Chapter 5).
3. Besides Halmos (1958) and Hoffman and Kunze (1971), I have found the book by Noble and Daniel (1977) useful for standard material on
linear algebra.
4. Rao (1973, Chapter 1) gives many useful linear algebra facts not
discussed here.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 2
Random Vectors
The basic object of study in this book is the random vector and its induced distribution in an inner product space. Here, utilizing the results outlined in Chapter 1, we introduce random vectors, mean -vectors, and covariances. Characteristic functions are discussed and used to give the well known factorization criterion for the independence of random vectors. Two special classes of distributions, the orthogonally invariant distributions and the weakly spherical distributions, are used for illustrative purposes. The vector spaces that occur in this chapter are all finite dimensional.
2.1. RANDOM VECTORS
Before a random vector can be defined, it is necessary to first introduce the
Borel sets of a finite dimensional inner product space (V, (, -)). Setting
xII = (x, x)/'2, the open ball of radius r about x0 is the set defined by Sr(x0) {xlllx - xoll < r).
Definition 2.1. The Borel a-algebra of (V,(, )), denoted by 6?5(V), is the smallest a-algebra that contains all of the open balls.
Since any two inner products on V are related by a positive definite
linear transformation, it follows that @5 (V) does not depend on the inner
product on V- that is, if we start with two inner products on V and use
these inner products to generate a Borel a-algebra, the two a-algebras are
the same. Thus we simply call ff (V) the Borel a-algebra of V without
mentioning the inner product.
70
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.1 71
A probability space is a triple (S2, C, PO) where i2 is a set, F is a a-algebra
of subsets of S1, and PO is a probability measure defined on IF.
Definition 2.2. A random vector X e V is a function mapping 2 into V
such that X-'(B) E F for each Borel set B E 6@I(V). Here, X-'(B) is the
inverse image of the set B.
Since the space on which a random vector is defined is usually not of
interest here, the argument of a random vector X is ordinarily suppressed.
Further, it is the induced distribution of X on V that most interests us. To
define this, consider a random vector X defined on Q to V where (Q, 6C, PO) is a probability space. For each Borel set B E @J6(V), let Q(B) =
Po( X- (B)). Clearly, Q is a probability measure on @ (V) and Q is called
the induced distribution of X-that is, Q is induced by X and PO. The
following result shows that any probability measure Q on f1(V) is the induced distribution of some random vector.
Proposition 2.1. Let Q be a probability measure on @(V) where V is a
finite dimensional inner product space. Then there exists a probability space (Q, IF, PO) and a random vector X on Q to V such that Q is the induced
distribution of X.
Proof. Take i2 = V, 1Y= 03(V), PO = Q, and let X(w) = X for w e V.
Clearly, the induced distribution of X is Q. n
Henceforth, we write things like: "Let X be a random vector in V with
distribution Q," to mean that X is a random vector and its induced
distribution is Q. Alternatively, the notation E(X) = Q is also used-this is read: "The distributional law of X is Q."
A function f defined on V to W is called Borel measurable if the inverse image of each set B E 6A) (W) is in @3 (V). Of course, if X is a random vector in V, then f( X) is a random vector in W when f is Borel measurable. In
particular, when f is continuous, f is Borel measurable. If W = R and f is Borel measurable on V to R, then f( X) is a real-valued random variable.
Definition 2.3. Suppose X is a random vector in V with distribution Q and f is a real-valued Borel measurable function defined on V. If JvIf(x)lQ(dx) < + x, then we say that f(X) has finite expectation and we write &f (X) for Jvf(x)Q(dx).
In the above definition and throughout this book, all integrals are Lebesgue integrals, and all functions are assumed Borel measurable.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
72 RANDOM VECTORS
* Example 2.1. Take V to be the coordinate space RI with the usual
inner product (-, ) and let dx denote standard Lebesgue measure on R'. If q is a non-negative function on Rn such that Jq(x) dx = 1,
then q is called a density function. It is clear that the measure Q given by Q(B) = JBq(x) dx is a probability measure on Rn so Q is
the distribution of some random vector X on Rn. If E 1,...,rE is the standard basis for R', then (Ej, X) Xi is the ith coordinate of X.
Assume that Xi has a finite expectation for i = 1,..., n. Then
6;Xi = fR-(Ei, x)q(x) dx -1i is called the mean value of Xi and the
vector i& Ee Rn with coordinates I,u, l n is the mean vector of X. Notice that for any vector x E Rn, &(x, X) = &;(Ex iE, X)= Exi&(ei, X) = Exiui = (x, A). Thus the mean vector L satisfies the
equation S(x, X) = (x, ,u) for all x E Rn and ,u is clearly unique. It
is exactly this property of .t that we use to define the mean vector of
a random vector in an arbitrary inner product space V.
Suppose X is a random vector in an inner product space (V, (, -)) and
assume that for each x E V, the random variable (x, X) has a finite
expectation. Let f(x) = &(x, X), so f is a real-valued function defined on V. Also, f(a1x1 + a2x2) = 6(a1x1 + a2x2, X) = &[a,(x,, X) +
a2(x2, X)] = a,&(x,, X) + a2&(x2, X) = alf(x1) + a2f(x2). Thus f is a
linear function on V. Therefore, there exists a unique vector y E V such that
f(x) = (x, ,i) for all x E V. Summarizing, there exists a unique vector
e E V that satisfies &(x, X) = (x, ,u) for all x E V. The vector ,u is called
the mean vector of X and is denoted by & X. This notation leads to the
suggestive equation S(x, X) = (x, ;X), which we know is valid in the
coordinate case.
Proposition 2.2. Suppose X E (V, (., )) and assume X has a mean vector
1. Let (W, [*, -]) be an inner product space and consider A e E (V, W) and
wo E W. Then the random vector Y = AX + wo has the mean vector
A,i + wo-that is, SY = A&X + wo.
Proof. The proof is a computation. For w EW,
S[w, Y] = &[w, AX+ wo] = S[w, AX] + [w, wo]
= &(A'w, X) + [w, wo] = (A'w, M) + [w, wo]
= [w, A,u] + [w,wo] = [w, A,s + wo].
Thus Ali + wo satisfies the defining equation for the mean vector of Y and
by the uniqueness of mean vectors, EY = A,u + wo. a
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.3 73
If X1 and X2 are both random vectors in (V, (,)), which have mean vectors, then it is easy to show that &(X1 + X2) = 6XI + &X2. The follow
ing proposition shows that the mean vector,u of a random vector does not
depend on the inner product on V.
Proposition 2.3. If X is a random vector in (V, (, *)) with mean vector i satisfying &(x, X) = (x, ,) for all x E V, then ,u satisfies &f(x, X) =
f(x, ,u) for every bilinear function f on V x V.
Proof. Every bilinear function f is given byf(x1, x2) = (xl, Ax2) for some A E I_(V, V). Thus &f(x, X) = &(x, AX) = (x, A,t) = f(x, i) where the
second equality follows from Proposition 2.2. El
When the bilinear function f is an inner product on V, the above result
establishes that the mean vector is inner product free. At times, a convenient
choice of an inner product can simplify the calculation of a mean vector.
The definition and basic properties of the covariance between two real-valued random variables were covered in Example 1.9. Before defining the covariance of a random vector, a review of covariance matrices for
coordinate random vectors in Rn is in order.
* Example 2.2. In the notation of Example 2.1, consider a random vector X in Rn with coordinates Xi = (Ei, X) where e1,..., En is the standard basis for Rn and (-,) is the standard inner product.
Assume that &X2 < + o, i-1,..., n. Then cov(Xi> Xj) =a ij ex
ists for all i, j = 1,..., n. Let 2 be the n X n matrix with elements
oij. Of course, aii is the variance of X, and aij is the covariance
between Xi and Xj. The symmetric matrix 2 is called the covariance matrix of X. Consider vectors x, y E RI with coordinates xi and yi, i=l,..., n. Then
cov{(x, X), (y, X)) = cov( .ixi, Ey2X9}
= E Exiyjcov(Xi, xj) =
E ExiY i j i j
= (x,2 y).
Hence cov((x, X), (y, X)) = (x, 2y). It is this property of 2 that is used to define the covariance of a random vector.
With the above example in mind, consider a random vector X in an inner product space (V, (*, *)) and assume that &(x, X)2 < oo for all x E V. Thus
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
74 RANDOM VECTORS
(x, X) has a finite variance and the covariance between (x, X) and (y, X)
is well defined for each x, y E V.
Proposition 2.4. For x, y E V, define f(x, y) by
f(x, y) = cov((x, X), (y, X)}.
Then f is a non-negative definite bilinear function on V x V.
Proof. Clearly, f(x, y) = f(y, x) and f(x, x) = var((x, X)} > 0, so it re
mains to show that f is bilinear. Since f is symmetric, it suffices to verify that
f(a IxI + a2x2, Y) = aIf(xI, y) + a2f(x2, y). This verification goes as fol
lows:
f (a1xl + a2x2, y) = cov{( a1x + a2X2, X), (y, X))
= cov{a1(x1, X) + a2(x2, X), (y, X))
= a, cov{(x1, X), (y, X)) + a2cov{(x2, X), (y, X))
= a,f(xI, y) + a2f(x2, y).
By Proposition 1.26, there exists a unique non-negative definite linear transformation I such that f(x, y) = (x, Yy).
Definition 2.4. The unique non-negative definite linear transformation : on V to V that satisfies
cov((x, X), (y, X)) = (x, Yy)
is called the covariance of X and is denoted by Cov(X).
Implicit in the above definition is the assumption that &(x, X)2 < + x for all x E V. Whenever we discuss covariances of random vectors, S(x, X)2 is always assumed finite.
It should be emphasized that the covariance of a random vector in
(V,(-, .)) depends on the given inner product. The next result shows how
the covariance changes as a function of the inner product.
Proposition 2.5. Consider a random vector X in (V,(., -)) and suppose
Cov(X) = L:. Let [-, -I be another inner product on V given by [x, y] =
(x, Ay) where A is positive definite on (V, (., *)). Then the covariance of X
in the inner product space (V,[-, *]) is EA.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.6 75
Proof. To verify that 2A is the covariance for X in (V,[, ]), we must
show that cov([x, X], [y, X]} = [x, lAy] for all x, y E V. To do this, use
the definition of [ *, * I and compute:
cov([x, X], [y, X]) = cov((x, AX), (y, AX)) = cov((Ax, X), (Ay, X)}
= (Ax, lAy) = (x, A2Ay) = [x, 2Ay]. O
Two immediate consequences of Proposition 2.5 are: (i) if Cov(X) exists in one inner product, then it exists in all inner products, and (ii) if Cov(X) = 2
in (V,(., *)) and if l is positive definite, then the covariance of X in the
inner product [x, y] (x, 2- 'y) is the identity linear transformation. The result below often simplifies a computation involving the derivation of a covariance.
Proposition 2.6. Suppose Cov(X) = 2 in (V,(-, *)). If 2, is a self-adjoint
linear transformation on (V, (*, *)) to (V, (, *)) that satisfies
(2.1) var((x, X)) = (x, >21x) for x e V,
then Y., = E.
Proof. Equation (2.1) implies that (x, 2 ,x) = (x, Ex), x e V. Since 2:
and Y. are self-adjoint, Proposition 1.16 yields the conclusion Y. = 2:. U
When Cov( X) = 2 is singular, then the random vector X takes values in
the translate of a subspace of (V, (*, )). To make this precise, let us consider the following.
Proposition 2.7. Let X be a random vector in (V,(-, -)) and suppose
Cov(X) = 2 exists. With ,i = &X and 'R(2) denoting the range of 2Y,
P(X E6 @(Yz) + )=1
Proof The set 6 (Y.) + ,u is the set of vectors of the form x + ,u for x E 6it(2); that is 6lt(2) + ,u is the translate, by p,, of the subspace 6t(2). The statement P{X E 6 (1) + ,u) = 1 is equivalent to the statement P(X -
IL E 6(2)) = 1. The random vector Y = X - ,u has mean zero and, by
Proposition 2.6, Cov(Y) = Cov(X) = 2 since var((x, X - ,u)) = var((x, X)) for x e V. Thus it must be shown that P(Y E 6A (2)) = 1. If 2 is nonsingu
lar, then 6I (2) = V and there is nothing to show. Thus assume that the null
space of 2, L (2), has dimension k > 0 and let (x1,..., Xk) be an orthonor
mal basis for %9(2). Since 6A (2) and O(Y) are perpendicular and 6i(2) D
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
76 RANDOM VECTORS
i(2:)= V, a vector x is not in 64(Y) iff for some index i = 1,..., k,
(xi, x)* 0. Thus
P{Y i t(2:)) = P((xi, Y) * 0 for some i = 1,. ..,k)
k
< EP((Xi, Y) - 0).
But (xi, Y) has mean zero and var{(x1, Y)) = (xi, Yxi) = 0 since xi E
9L(E). Thus (xi, Y) is zero with probability one, so P((xi, Y) * 0} = 0.
Therefore P(Y ? 6fR(2)) = 0. o
Proposition 2.2 describes how the mean vector changes under linear transformations. The next result shows what happens to the covariance
under linear transformations.
Proposition 2.8. Suppose X is a random vector in (V,(-, )) with Cov(X) = E. If A E ?(V, W) where (W, [ *, * ]) is an inner product space, then
Cov(AX + w0) = AYA'
for all w0 E W.
Proof. By Proposition 2.6, it suffices to show that for each w E W,
var[w, AX + w0] = [w, AYA'w]. However,
var[w, AX + w0] = var([w, AX] + [w, wo]) = var[w, AX]
= var(A'w, X) = (A'w, X A'w) = [w, AX:A'w].
Thus Cov(AX + w0) = AY2A'. o
2.2. INDEPENDENCE OF RANDOM VECTORS
With the basic properties of mean vectors and covariances established, the
next topic of discussion is characteristic functions and independence of
random vectors. Let X be a random vector in (V, (-, * )) with distribution Q.
Definition 2.5. The complex valued function on V defined by
+(v) Sei(vX) ei(vX)Q(dx)
is the characteristic function of X.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INDEPENDENCE OF RANDOM VECTORS 77
In the above definition, e1t = cos t + i sin t where i = - 1 and t e R.
Since eit is a bounded continuous function of t, characteristic functions are well defined for all distributions Q on (V, (*, *)). Forthcoming applications of characteristic functions include the derivation of distributions of certain functions of random vectors and a characterization of the independence of two or more random vectors.
One basic property of characteristic functions is their uniqueness, that is, if Q1 and Q2 are probability distributions on (V, (-, *)) with characteristic functions 40 and +2, and if I(x) = +2(x) for all x E V, then Q1 = Q2. A
proof of this is based on the multidimensional Fourier inversion formula, which can be found in Cramer (1946). A consequence of this uniqueness is that, if XI and X2 are random vectors in (V, (-, .)) such that f, ((x, XI)) =
f_((x, X2)) for all x E V, then E(Xl) = Q?4X2). This follows by observing
that l ((x, X,)) = I ((x, X2)) for all x implies the characteristic functions of
X, and X2 are the same and hence their distributions are the same.
To define independence, consider a probability space (S2, C, P0) and let X E (V(., *)) and YE: (W, [ , ]1) be two random vectors defined on U2.
Definition 2.6. The random vectors X and Y are independent if for any Borel sets B, E 63?(V) and B2 E 6i3(W),
PO(x-'(B1) n Y-'(B2)) = P0{X-'(B))P0{Y (B2))
In order to describe what independence means in terms of the induced distributions of X E (V, ( , )) and Y e (W, [ *, * 1), it is necessary to define
what is meant by the joint induced distribution of X and Y. The natural vector space in which to have X and Y take values is the direct sum V ED W
defined in Chapter 1. For (vi, wi) e V E W, i = 1,2, define the inner
product (., ), by
((VI, WI), {V2 W I -= (vI, V2) + [wI,w2.
That (., ) is an inner product on V @ W is routine to check. Thus {X, Y} takes values in the inner product space V @ W. However, it must be shown
that (X, Y} is a Borel measurable function. Briefly, this argument goes as follows. The space V @ W is a Cartesian product space-that is, V E W
consists of all pairs {v, w) with v E V and w E W. Thus one way to get a
a-algebra on V ED W is to form the product a-algebra63 ( V) x @ (W), which
is the smallest a-algebra containing all the product Borel sets B1 x B2 c V
@ W where B1 e (13(V) and B2 e 6,B(W). It is not hard to verify that
inverse images, under (X, Y), of sets in 63(V) x @3(W) are in the a-algebra 6W. But the product a-algebra @(V) x ffi (W) is just the a-algebra fi (V @ W)
defined earlier. Thus (X, Y) E V ED W is a random vector and hence has an
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
78 RANDOM VECTORS
induced distribution Q defined on @(V E W). In addition, let Q1 be the induced distribution of X on 63 (V) and let Q2 be the induced distribution of Y on 6,(W). It is clear that Q1(B1) = Q(BI X W) for B1 e 6Q(V) and
Q2(B2) = Q(V x B2) for B2 E f'i(W). Also, the characteristic function of (X,Y)E V EWis
p({v, w)) = &;exp[i({v, w), {X, Y))1] = &;exp(i(v, X) + i4w, Y])
and the marginal characteristic functions of X and Y are
p1 (v) = &ei(v,X)
and
p2(w) = &ei[w,Y]
Proposition 2.9. Given random vectors X E (V, (*, *)) and Y E (W, [*, * the following are equivalent:
(i) X and Y are independent. (ii) Q(B1 x B2) = Q1(B1)Q2(B2) for all B1 E 6(V) and B 62 e
(iii) 4((v, w)) = pI(v)02(w) for all v E V and w e W.
Proof. By definition,
Q(B1 x B2) =
Po((X, Y) E B1 x B2) = PO{X e B1, Y E B2).
The equivalence of (i) and (ii) follows immediately from the above equation. To show (ii) implies (iii), first note that, if f, and f2 are integrable complex
valued functions on V and W, then when (ii) holds,
f fl(v)f2(w)Q(dv, dw) = fft1 (v)f2(w)QI(dv)Q2(dw) v@ wvw
= Iff(v)QI(dv) f2(w)Q2(dw)
by Fubini's Theorem (see Chung, 1968). Taking f1(v) = ei(vI, v) for vI, v e V, and f2(w) = ei[w,wl for w1, w E W, we have
p((vI, w)) = Jexp(i(vI, v) + i[wI, w])Q(dv, dw)
= f exp[i(v, v)] Ql(dv) f exp(i[WI, W])Q2(dW)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.9 79
Thus (ii) implies (iii). For (iii) implies (ii), note that the product measure Q1 x Q2 has characteristic function 4142. The uniqueness of characteristic functions then implies that Q = Q1 x Q2. o
Of course, all of the discussion above extends to the case of more than two random vectors. For completeness, we briefly describe the situation.
Given a probability space (s2, C5, PO) and random vectors Xj E (Vj, (., *)),
j = ,..., k, let Qj be the induced distribution of Xj and let 0j be the
characteristic function of Xj. The random vectors XI,..., Xk are independent if for all Bj e9 3(Vj),
k
Po(Xi E Bi,j = 1,..., k) = Po{XjE B). =1=
To construct one random vector from X1,..., Xk, consider the direct sum
VI E ... E Vk with the inner product ( k, )= k(., %)j. In other words, if
{v1,..., vk) and (w1,..., Wk) are elements of VI ED ... E Vk, then the inner
product between these vectors is j4(v1, wj)j. An argument analogous to that given earlier shows that {XI,..., Xk} is a random vector in V, eD ... E Vk
and the Borel a-algebra of V E ... E@ Vk is just the product u-algebra
63 ( VI ) x ... x- 63 ( Vk ). If Q denotes the induced distribution of { XI, . . ., Xk),
then the independence of XI,..., Xk is equivalent to the assertion that
k Q(B1 x * - X Bk) = H Qj(Bj)
j=1
for all Bj c J(J(Vj), j = 1, . . ., k, and this is equivalent to
k k
g exp i (vi, XA)1] = _ (vi)
Of course, when XI,..., Xk are independent and fj is an integrable real
valued function on Vj,j = 1,..., k, then
k k
6 H fi (Xi)= H 6fj (Xi) j=l IJ=1
This equality follows from the fact that
k
Q(B1 x . .. x Bk) = H Qj(B1) J=1
and Fubini's Theorem.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
80 RANDOM VECTORS
* Example 2.3. Consider the coordinate space RP with the usual inner product and let QO be a fixed distribution on RP. Suppose
Xl,..., Xn are independent with each Xi e RP, i = 1,..., n, and
E(Xi) = Q0. That is, there is a probability space (S2, C, P0), each Xi is a random vector on Q with values in RP, and for Borel sets,
n
Po(Xi E Bi, i = 1,..., n) Q HQ(B,).
Thus {XI,. . ., X,,) is a random vector in the direct sum RP E ... e*
RP with n terms in the sum. However, there are a variety of ways to
think about the above direct sum. One possibility is to form the coordinate random vector
xi
X2 y_ X E Rnp
Xn
and simply consider Y as a random vector in Rnp with the usual
inner product. A disadvantage of this representation is that the independence of XI,..., Xn becomes slightly camouflaged by the notation. An alternative is to form the random matrix
p,n
Xn
Thus X has rows X,', i = 1,..., n, which are independent and each
has distribution Q0. The inner product on l, n is just that inherited
from the standard inner products on Rn and RP. Therefore X is a
random vector in the inner product space (l , , K )). In the
sequel, we ordinarily represent X,,..., Xn by the random vector X c E . The advantages of this representation are far from clear
at this point, but the reader should be convinced by the end of this
book that such a choice is not unreasonable. The derivation of the
-mean and covariance of X E Ep n given in the next section should
provide some evidence that the above representation is useful.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.10 81
2.3. SPECIAL COVARIANCE STRUCTURES
In this section, we derive the covariances of some special random vectors. The orthogonally invariant probability distributions on a vector space are shown to have covariances that are a constant times the identity transforma tion. In addition, the covariance of the random vector given in Example 2.3 is shown to be a Kronecker product. The final example provides an
expression for the covariance of an outer product of a random vector with
itself. Suppose (V,(, *)) is an inner product space and recall that 6(V) is the
group of orthogonal transformations on V to V.
Definition 2.7. A random vector X in (V, (*, )) with distribution Q has an orthogonally invariant distribution if lE(X) E(rx) for all r E (9(v), or equivalently if Q(B) = Q(FB) for all Borel sets B and F e (9(V).
Many properties of orthogonally invariant distributions follow from the following proposition.
Proposition 2.10. Let x0 E V with lxol1 = 1. If E(X) = f&(FX) for F E
6(V), then for x e V, (9((x, X)) = 6((IxII(x0, X)).
Proof. The assertion is that the distribution of the real-valued random variable (x, X) is the same as the distribution of lxIt(xo, X). Thus knowing the distribution of (x, X) for one particular nonzero x E V gives us the distribution of (x, X) for all x E V. If x = 0, the assertion of the proposi
tion is trivial. For x * 0, choose F e ((V) such that Fx0 = x/llxll. This is
possible since x0 and x/llxll both have norm 1. Thus
fP?((x, x)) = (ilixi( l x, x)) = (IIxII(ixo, x)) = 1,(lIxII(xo, rx))
= (9(Ilxll(xo, X)) where the last equality follows from the assumption that f, ( X) = I (rX) for
all F E (9(V) and the fact that '
e (9(V) implies F' e (9(V). o
Proposition 2.11. Let x0 E V with lixoll = 1. Suppose the distribution of X is orthogonally invariant. Then:
(i) ?>(x)=- le'(x x) = ,,0(ljxjx).
(ii) If &X exists, then &;X = 0.
(iii) If Cov(X) exists, then Cov(X) = a2I where a2 = var{(xo, X)), and I is the identity linear transformation.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
82 RANDOM VECTORS
Proof Assertion (i) follows from Proposition 2.10 and
&;ei(x,X) = -;ei11x11(x0,X) = &;ei(lIxllxo,X) =
40(iixiix).
For (ii), let ,u = SX. Since 1(X) = IP(rX), it = c;x= rx = rsx= r=
for all F e (9(V). The only vector it that satisfies Iu = F,u for all IF e ((V) is y = 0. To prove (iii), we must show that a
21 satisfies the defining equation for Cov(X). But by Proposition 2.10,
var((x, X)) = var{llxll(xo, X)) = iixI2var(xo, X} = 02(x, x) = (x, u2Ix)
so Cov(X) = a2I by Proposition 2.6. C]
Assertion (i) of Proposition 2.11 shows that the characteristic function 0 of an orthogonally invariant distribution satisfies 0(lx) = +(x) for all x E V and F e ((V). Any function f defined on V and taking values in
some set is called orthogonally invariant if f(x) = f(Fx) for all r e ((V). A
characterization of orthogonal invariant functions is given by the following proposition.
Proposition 2.12. A function f defined on (V,(-, *)) is orthogonally in variant ifff(x) = f(llxllxo) where xo E V, llxoll = 1.
Proof. If f(x) = f(llxllxo), then f(rx) = f(ilFxllxo) = f(llxllxo) = f(x) so f is orthogonally invariant. Conversely, suppose f is orthogonally invariant and xo E V with llxoii = 1. For x = 0, f(0) = f(ilxllxo) since llxll 0. If
x * 0, let F e ((V) be such that Fxo = x/llxll. Then f(x) = f(Fjixjix0) =
f(ilxllxo).
If X has an orthogonally invariant distribution in (V, (-, *)) and h is a function on R to R, then
f(x) &h((x, X))
clearly satisfies f(Fx) = f(x) for rF e9(V). Thus f(x) =f(llxllxo) =
&h(iixii(x0, X)), so to calculatef(x), one only needs to calculatef(axo) for a E (0, oc). We have more to say about orthogonally invariant distributions
in later chapters. A random vector X e V(., *) is called orthogonally invariant about xo if
X - x0 has an orthogonally invariant distribution. It is not difficult to
show, using characteristic functions, that if X is orthogonally invariant
about both xo and xl, then xo = x1. Further, if X is orthogonally invariant
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.13 83
about xo and if &X exists, then l;(X - xo) = 0 by Proposition 2.11. Thus
x -= & X when &X exists.
It has been shown that if X has an orthogonally invariant distribution and if Cov(X) exists, then Cov(X) = a 2I for some a2 > 0. Of course there
are distributions other than orthogonally invariant distributions for which the covariance is a constant times the identity. Such distributions arise in
the chapter on linear models.
Definition 2.8. If X E (V, (., .)) and
Cov(X) = a2I for some a2 > 0,
X has a weakly spherical distribution.
The justification for the above definition is provided by Proposition 2.13.
Proposition 2.13. Suppose X is a random vector in (V, (-, )) and Cov(X) exists. The following are equivalent:
(i) Cov(X) = a2I for some a2> 0. (ii) Cov(X) = Cov(rX) for all r e C(V).
Proof. That (i) implies (ii) follows from Proposition 2.8. To show (ii) implies (i), let I = Cov(X). From (ii) and Proposition 2.8, the non-negative definite linear transformation I must satisfy E = F'21 for all F e C(V). Thus for all x e V, llxll = 1,
(x, ix) = (x, rFrtx) = (rFx, Er'x).
But F'x can be any vector in V with length one since F' can be any element
of 6(V). Thus for all x, y, llxII = IIYII = 1,
(XI EX) -(Y' IA)
From the spectral theorem, write I = E'Xx 0 xi and choose x = Xj and
y = Xk. Then we have
Aj =
(xj, YxX) = (Xk, YXk) Ak
for allj, k. Setting a2 = Xl,
I= na2x1Oxi = 2 x Cxi ai
That a2 > 0 follows from the positive semidefiniteness of 2. O
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
84 RANDOM VECTORS
Orthogonally invariant distributions are sometimes called spherical distri butions. The term weakly spherical results from weakening the assumption that the entire distribution is orthogonally invariant to the assumption that just the covariance structure is orthogonally invariant (condition (ii) of Proposition 2.13). A slight generalization of Proposition 2.13, given in its algebraic context, is needed for use later in this chapter.
Proposition 2.14. Supposef is a bilinear function on V x Vwhere (V, (., *)) is an inner product space. If f[Fx1, Fx2I = f [xI, x2] for all xl, x2 E V and F e ((V), thenf [xi, x2] = c(x, x2) where c is some real constant. If A is a linear transformation on V to V that satisfies F'AF = A for all F E 0(V), then A = cI for some real c.
Proof. Every bilinear function on V x V has the form (x,, Ax2) for some linear transformationA on Vto V. The assertion that f [rx, rx2] = f [xl, x2]
is clearly equivalent to the assertion that r'AF = A for all r e 0(V). Thus it suffices to verify the assertion concerning the linear transformation A.
Suppose F'AF = A for all r 0(V). Then for x1, x2 E V,
(xI, Ax2) = (xl, F'AFx2) = (rx,, AFx2).
By Proposition 1.20, there exists a r such that
rX = X2 r X2 =xi IIXIII lIX21x lX211 IIx III
when xl and x2 are not zero. Thus for xl and x2 not zero,
(x1, Ax2) =
(Fx,, Arx2) = (x2, Ax,)
= (Ax,, x2).
However, this relationship clearly holds if either x, or x2 is zero. Thus for all xl, x2 E V, (xI, Ax2) = (Ax,, x2), so A must be self-adjoint. Now, using the spectral theorem, we can argue as in the proof of Proposition 2.13 to
conclude that A = cI for some real number c. o
* Example 2.4. Consider coordinate space Rn with the usual inner product. Let f be a function on [0, oo) to [0, oo) so that
J (1X112) dx = 1. Tn
Thus f(IIXI12) is a density on Rn. If the coordinate random vector
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.14 85
X E Rn has f(jlxl2) as its density, then for r E on (the group of n X n orthogonal matrices), the density of rXis again f(Ix 1( 2). This follows since IIrxII = Ilxi and the Jacobian of the linear transforma tion determined by r is equal to one. Hence the distribution determined by the density is Q invariant. One particular choice forf is f(u) = (2irfn/2e -1/2u and the density for X is then
n
f(iixi12) = (2 ffn/2 exp[-2S1X2] = H (2Y12 exp[-2x7].
Each of the factors in the above product is a density on R (corresponding to a normal distribution with mean zero and vari ance one). Therefore, the coordinates of X are independent and each has the same distribution. An example of a distribution on Rn that is weakly spherical, but not spherical, is provided by the density (with respect to Lebesgue measure)
p(x) = 2 nexp[-_,xXij]
where x E Rn, x' = (x1, x2,..., xn). More generally, if the random
variables X1,..., Xn are independent with the same distribution on R, and a2 = var(X,), then the random vector X with coordinates X,,..., X,is easily shown to satisfy Cov(X) = a2In where In is the n X n identity matrix.
The next topic in this section concerns the covariance between two random vectors. Suppose Xi e (Vi, (., .)) for i = 1, 2 where Xl and X2 are
defined on the same probability space. Then the random vector (XI, X2} takes values in the direct sum VI E V2. Let [.,-] denote the usual inner
product on VI ED V2 inherited from ( -)j, i = 1,2. Assume that ii= Cov( Xi), i = 1, 2, both exist. Then, let
f(x1, x2) = cov{(xI, X1)I, (X2, X2)2}
and note that the Cauchy-Schwarz Inequality (Example 1.9) shows that
I f(xI, x2)2 . (Xl, E11X1)1(X2, 722X2)2
Further, it is routine to check that f(., * ) is a bilinear function on VI x V2
so there exists a linear transformation 2,2 E E (V2, VI) such that
f(xI, X2) =
(XI I 7 12X2)1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
86 RANDOM VECTORS
The next proposition relates 21 , 2 12, and 222 to the covariance of ( XI, X2)
in the vector space (V,1E V2e [, ] ).
Proposition 2.15. Let 2 = Cov(X1, X2). Define a linear transformation A on VI @D V2 to VI @3 V2 by
A(x,, x2} = (211x1 + 2 12x2, 22x1 + 222x2)
where '12 is the adjoint of 212' Then A = 2.
Proof. It is routine to check that
[A{x1, X2), (X3, x4)] = [(xI, X2}, A{x3, x4)]
so A is self-adjoint. To show A = 2, it is sufficient to verify
[(xI, x2), A(x1, X2)] = [(XI, X2), 2(XI, X2}]
by Proposition 1.16. However,
[{X1, X2), 2({X1, X2)] = var[(x,, X2}, (XI, X2)]
= var((xl, X1)1 + (X2, X2)2)
= var(x1, XI), + var(x2, X2)2
+2cov{((X, X1)1X (X2, X2)2)
= (XI, 2YI1X1)l + (X2, 2 22X2)2 + 2(x,, 2:,2X2)1
= (xI, 211x1)1 + (X2, 222X2)2
+(XI, 212X2)1 + ( 12X1X X2)2
= [{X1, X2}, ({E1x l + E12X2, 2f2X1 + 222X2)]
= [{X1, X2}, A{x1, X2}]. El
It is customary to write the linear transformation A in partitioned form as
({_' 212)(x x2) = (211x. + 212x2, 2'2x1 + 222x2}.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.16 87
With this notation,
COV(X1, X2) =(it
12
212 222!
Definition 2.9. The random vectors Xl and X2 are uncorrelated if 212 = 0.
In the above definition, it is assumed that Cov( Xi) exists for i = 1, 2. It is
clear that X1 and X2 are uncorrelated iff
COV((X1 lX1)1 (X2 X2)2) = O for all xi e Vi, i = 1, 2.
Also, if X1 and X2 are uncorrelated in the two given inner products, then
they are uncorrelated in all inner products on V1 and V2. This follows from
the fact that any two inner products are related by a positive definite linear
transformation. Given Xi E (Vi, (., .)i) for i = 1, 2, suppose
(I 11 I21 Cov{X1,I X2} =if j
( 2i2 2221
We want to show that there is a linear transformation B E f (V2, V1) such
that X, + BX2 and X2 are uncorrelated random vectors. However, before
this can be established, some preliminary technical results are needed. Consider an inner product space (V, (., *)) and suppose A E C(V, V) is
self-adjoint of rank k. Then, by the spectral theorem, A = Ek2Xx O xi where
xi * 0, i = 1,..., k, and (xl,..., Xk} is an orthonormal set that is a basis
for 6(A). The linear transformation
- IA J i i
is called the generalized inverse of A. If A is nonsingular, then it is clear that A - is the inverse of A. Also, A - is self-adjoint and AA= A -A = 2kx El x, which is just the orthogonal projection onto 6(A). A routine computation shows that A-AA-= A- and AA-A = A.
In the notation established previously (see Proposition 2.15), suppose
(XI, X2) E VI e V2 has a covariance
2 = CoV(X,I X2) = (I 12)
Proposition 2.16. For the covariance above, 6,(222) C %((21) and 212 = 2122z222'
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
88 RANDOM VECTORS
Proof For X2 E 6X(122), it must be shown that 212x2 = O. Consider
x, E V1 and a E R. Then 222(ax2) = 0 and since l: is positive semidefi
nite,
0 < [(X1, aX2), X2{X, aX2)] = [{X1, aX2), {>21IXI + a>22x2, >21}f ]
= (Xl, >21Xx)1 + a(X, 2XA2)1 + a(X2, >2X1)2
= (xI, E>,x1) + 2a(x,, ,22x2),
As this inequality holds for all a E R, for each x, E V, (x,, 1,2x2), = O. Hence 21,2x2 = 0 and the first claim is proved. To verify that 212 =
2122-22222 it suffices to establish the identity 212(I - 2212222) = 0. How ever, I - >222>22 iS the orthogonal projection onto X22) Since ??(>222) C 6X (>12)4 it follows that >212(I
- >22222)
= 0. [
We are now in a position to show that Xl - >212>2: X2 and X2 are
uncorrelated.
Proposition 2.17. Suppose {XI, X2) E VI E V2 has a covariance
> = CoV(X1, X2) (i2f 212).
Then X, - >22>2X2 and X2 are uncorrelated, and Cov(X, - 2,2222X2) = >211 - l212>212121 2 where >221 i22
Proof. For xi e Vi, i = 1, 2, it must be verified that
Cov{(xI, XI - >212>212X2)1, (X2, X2)2) }
This calculation goes as follows:
cov{(X,, X1 - 2222X2)1, (X2, X2)2)
=cov((X1, XA), (X2, X2)2)
-cOV4((2212XI2, X2)2, (X2, X2)2}
= (X1, X212X2)1 - (122222X1, >222X2)2
= (XI >X212X2 )1- (X1, X212222>222X2)1
= (X, (212 -
>212122 >
222)X2)1 = 0.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.18 89
The last equality follows from Proposition 2.15 since 212 = 2122222. To
verify the second assertion, we need to establish the identity
var(x,, Xl - (X, ( -)l
But
var(x,, Xl - = var(x1, X1) + var(xl, N12j2-X2)l 7- 12 y- 22 X2{(1 )I x )l, (l 121222)1
= (X1, 2Icox1 )1 +
(XI, X2 1) -2(XI,~~(X g '11212x22
=(X], ( 71 1-> l2 + X922212 )X1 )1 =
2(x1, 2 - 2X)
In the above, the identity I22222 = Y.2
has been used.
We now return to the situation considered in Example 2.4. Consider independent coordinate random vectors Xl,..., Xn with each Xi E RP, and
suppose that ;SXi =
j e RP, and Cov(Xi)= = for i = 1,..., n. Form the
random matrix X e e with rows Xl,..., X". Our purpose is to describe the mean vector and covariance of X in terms of E and ,u. The inner product
on ep ,n ( *, is that inherited from the standard inner products on the
coordinate spaces RP and Rn. Recall that, for matrices A, B E ep n
(A, B) = trAB' = trB'A = trA'B = trBA'.
Let e denote the vector in Rn whose coordinates are all equal to 1.
Proposition 2.18. In the above notation,
(i) S; X =
eu'
(ii) COv(X) = In C)l
Here I,, is the n x n identity matrix and e denotes the Kronecker product.
Proof. The matrix e,i' has each row equal to ,u' and, since each row of X
has mean IL', the first assertion is fairly obvious. To verify (i) formally, it must be shown that, for A E fP- n
&(A, X) = (A, epi').
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
90 RANDOM VECTORS
Let a',..., a', ai E RP, be the rows of A. Then
S(A, X) = &trAX' = &E2a'X1 = Y4a%Xi = EVa[L = trAtLe' = (A, eM').
Thus (i) holds. To verify (ii) it suffices to establish the identity
var(A, X) = (A, (I 0 E)A)
for A e &p n. In the notation above,
var(A, X) = var(E"a'Xi) = E" var(a'Xi) + E cov(a'Xi, a)Xj) =Ela,>a i*i
= tr A'Al = tr A:A' = tr A (A)' = (A, (In ? Y-)A).
The third equality follows from var(a'X) = al ai and, for i * j, alXi and
aj,Xj are uncorrelated. o
The assumption of the independence of X1,..., Xn was not used to its full extent in the proof of Proposition 2.18. In fact the above proof shows
that, if X1,..., Xn are random variables in RP with &;Xi = ,u, i = 1,..., n,
then &X = e,u'. Further, if XI,..., Xn in RP are uncorrelated with Cov(X,) -= , i = 1,..., n, then Cov(X) = In 0 E. One application of this formula
for Cov(X) describes how Cov(X) transforms under Kronecker products. For example, if A E en and B E E , then (A X B)X= AXB' is a
random vector in ep ,,,. Proposition 2.8 shows that
Cov((A 0 B)X) = (A ? B)Cov(X)(A ? B)'.
In particular, if Cov(X) = In 0 2, then
Cov((A 0 B)X) = (A X B)(In ? 2)(A ? B)' = (AA') X (BIB').
Since A 0 B = (A 0 Ip)(In I0 B), the interpretation of the above covari
ance formula reduces to an interpretation for A 0 Ip and In 0 B. First,
(In 0 B)X is a random matrix with rows X,'B' = (BXi)', i = 1,..., n. If
Cov(X,) = T, then Cov(BX,) = B2B'. Thus it is clear from Proposition
2.18 that Cov((In X B)X) = 0 (BEB'). Second, (A 0 Ip) applied to Xis
the same as applying the linear transformation A to each column of X.
When Cov(X) = In 0 , the rows of X are uncorrelated and, if A is an
n X n orthogonal matrix, then
Cov((A X Ip)X) = In ' Y = Cov(X).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.19 91
Thus the absence of correlation between the rows is preserved by an
orthogonal transformation of the columns of X. A converse to the observation that Cov((A ? Ip)X) = In X 2 for all
A E 6(n) is valid for random linear transformations. To be more precise,
we have the following proposition.
Proposition 2.19. Suppose (Vj, (-, -)), i = 1,2, are inner product spaces
and X is a random vector in (C (VI, V2), ( ,*)). The following are equiva
lent:
(i) Cov(X) = I2 ? Y
(ii) Cov((F X I1)X) = Cov(X) for all F E 0(V2).
Here, Ii is identity linear transformation on Vi, i = 1,2, and 2 is a
non-negative definite linear transformation on V, to V1.
Proof Let I = Cov( X) so 'I is a positive semidefinite linear transforma tion on E(V,, V2) to (9(V1, V2) and I is characterized by the equation
cov{(A, X), KB, X)} = (A, IB)
for all A, B E f (VI, V2). If (i) holds, then we have
Cov((r ? i1)X) = (r X il)Cov(X)(r X il)'
= (F ? I1)(I2 ? 20)(' X II) = (rI2r') ? (Il2I1)
= I2 2 2 = COV(X),
so (ii) holds. Now, assume (ii) holds. Since outer products form a basis for f,(VI, V2), it
is sufficient to show there exists a positive semidefinite 2 on VI to V, such that, for x,x2 e V, andy,,y2e V2,
(Y 0XI,I'(Y20X2)) =
(y1EJx1,(I2 X 1)(y20x2)).
Define H by
H(xII X2, YI, Y2) cov{(y C? XI, X), (Y2 ? X2 X)}
for x,, x2 E VI and yi, Y2 E V2. From assumption (ii), we know that I
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
92 RANDOM VECTORS
satisfies P = (r X Il)(r X I)' for all r e ((V2). Thus
H(x,, X2, Y,, Y2) = (Yfl0xl,'(Y2Ex2))
= (yEl xI,(r ? I)4(r X I0 Y2)'(y2LX2))
= ((r( X )'(y1oxI),4'(r X I? AY20X2))
= ((F'yl) )XII'(r'y2)Llx2) = H(xl, X2, ry ,I r'y2)
for all r E (9(V2). It is clear that H is a linear function of each of its four
arguments when the other three are held fixed. Therefore, for x, and x2
fixed, G is a bilinear function on V2 X V2 and this bilinear function satisfies the assumption of Proposition 2.14. Thus there is a constant, which depends on x, and x2, say c[xl, x2], and
H(xl, X2, Yi, Y2) = C[x1, X2J(Y1 Y2)2
However, for yi = Y2 * 0, H, as a function of xl and x2' is bilinear and
non-negative definite on V1 x VI. In other words, c[x1, x2] is a non-nega
tive definite bilinear function on V1 x V1, so
C[X1, X2] = (X1, X2)1
for some non-negative definite E. Thus
H(xl, X2, Yi, Y2) = (X1, EX2)1(Yl Y2)2 = (Y1I? XI, (I2 ? )(Y20 X2)),
SO4 = I2 2-E
The next topic of consideration in the section concerns the calculation of
means and covariances for outer products of random vectors. These results
are used throughout the sequel to simplify proofs and provide convenient
formulas. Suppose Xi is a random vector in (Vi,(-, -) ) for i = 1,2 and let
ii = Xi, and Y.ii = Cov(X1) for i = 1,2. Thus {X1; X2} takes values in
VI ED V2 and
(III E12 Cov( XI X2) =
Ei I 22 1 2 22)
where 212 is characterized by
COV((X1, Xl),, (X2, X2)2) = (X1, E12X2)1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.20 93
for xi E Vi, i = 1,2. Of course, Cov{X1, X2) is expressed relative to the
natural inner product on V1
D V2 inherited from (V,, (*, *),) and (V2, (*, *)2)
Proposition 2.20. For Xi E (Vi, (', *)), i = 1, 2, as above,
&X1LJ X2 = -12 + t1 fl2
Proof. The random vector XI O X2 takes values in the inner product space (f(V2, VI), ( , )). To verify the above formula, it must be shown that
<(A, X 1O X2) = (A, I12) + (A, A I J >2)
for A e C(V2, VI). However, it is sufficient to verify this equation for A = xI O x2 since both sides of the equation are linear in A and every A is a
linear combination of elements in f(V2, Vl) of the form xl O x2, xi EVi, i= 1,2.Forx Ox2iE(V2,V1),
(XIc]X2, XlOX2) = &(X1, X)MX21 X2 )2
= COV((Xl, X0)1, (X2, X2)2) + &(X1, X1)l6(X2, X2)2
= (XI, I12X2)1 + (XI, I1)1(MX2 2)2
= (XIl X2,1 12) +
(XIl X2, 1 ? O2)
A couple of interesting applications of Proposition 2.20 are given in the following proposition.
Proposition 2.21. For X1, X2 in (V,(, *)), let Ai = &Xi, Iii = Cov(Xi) for i = 1, 2. Also, let 2I2 be the unique linear transformation satisfying
COV((X1, X1), (X2, X2)) = (X1 5 212X2)
for all xl, x2 E V. Then:
(i) c;Xl Oxi =
III + Al 0 U.
(ii) &(Xl, X2) = (I, -12) + (jt1, A2)
(iii) F9(Xl5 XI) = (I, 211) + (Al1 1).
Here I e E (V, V) is the identity linear transformation and ( ,* is the inner product on (V, V) inherited from (V, ( *, )).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
94 RANDOM VECTORS
Proof. For (i), take Xl = X2 and (V1,(, )X) = (V2,(., )2) = (V,(, )) in
Proposition 2.20. To verify (ii), first note that
6XI O X2 = 12 + Al ?2
by the previous proposition. Thus for I e Q4V, V),
6<I x, Xi X2) = (I, 12) + (I, A ? 2).
However, (I, X El ZX2) = (XI, X2) and (I, p O IL2) = (IL IL2) so (ii) holds.
Assertion (iii) follows from (ii) by taking X, = X2. ol
One application of the preceding result concerns the affine prediction of one random vector by another random vector. By an affine function on a
vector space V to W, we mean a function f given by f(v) = Av + wo where
A e V(V, W) and w0 is a fixed vector in W. The term linear transformation
is reserved for those affine functions that map zero into zero. In the
notation of Proposition 2.21, consider Xi E (PV,(, i)i for i = 1,2, let ,It =
Xi, i = 1, 2, and suppose
E CoV(X1, X2} = (X i 212)
exists. An affine predictor of X2 based on Xl is any function of the form
AX1 + x0 where A EE e(VI, V2) and x0 is a fixed vector in V2. If we assume
that ,L , I2, and I are known, then A and x0 are alowed to depend on these known quantities. The statistical interpretation is that we observe X,, but
not X2, and X2 is to be predicted by AX, + x0. One intuitively reasonable
criterion for selecting A and x0 is to ask that the choice of A and x0
minimize
SJX2 - (AX, + X0)112
Here, the expectation is over the joint distribution of X1 and X2 and 11 iS the norm in the vector space (V2,(. ,)2). The quantity S&JX2 - (AX, +
x0)I12 is the average distance of X2 - (AXI + x0) from 0. Since AX1 + x0 is
supposed to predict X2, it is reasonable that A and xo be chosen to minimize
this average distance. A solution to this minimization problem is given in
Proposition 2.22.
Proposition 2.22. For X1 and X2 as above,
GJJX2 - (AX, + x0)II2 ( (I2, Y22 - 7l2112
with equality for A = E2: - and xo = I2- "i221
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.22 95
Proof. The proof is a calculation. It essentially consists of completing the
square and applying (ii) of Proposition 2.21. Let Yi = Xi - ,ui for i = 1,2. Then
61JX2 - (AX, + x 2)II2 = SIY2 -AY1 + 2 - At,- X112 = 211Y2
+ 26 Y2 -A Y,, ,U- AIL-X) 12AA, _
X0112 - - AY1~~~ + IItL2 - - XO) .A =SJJY2
- A y,11 2 -
112AA, -
X0112
The last equality holds since &;(Y2 - A Y1) = 0. Thus for each A e E (V1, V2),
&11X2 - (AX1 + xo)I2 >, SY2 - A Y12
with equality for xo = A2 - A,u. For notational convenience let 12I = 212 Then
611Y2 - AY1j2 = GuY2 - -21 YI + (2211 - 2A)YIIJ2
= gttY2- ~2,l,,YII + &j1(22,2j1-A)Y,112 Y2- 1 21 y 11Y 2+611(221yl -
+ 26&(Y2 - Y21-1j 1 Y,j , (+(21211j - A)Y I )2
= IY2 - 221 ly- 1- 2 + ;11("2111-A)1 12
6 s;hY2 - 1212 Y1j211
The last equality holds since S;(Y2 - 2 21 1 YI) = 0 and Y2 - 22121jY1 is
uncorrelated with Y, (Proposition 2.17) and hence is uncorrelated with (z2, ,,- A)Y1. By (ii) of Proposition 2.21, we see that &(Y2 -
221111 Y1, (121111 - A)Y1)2 = 0. Therefore, for each A E E(VI, V2),
SGuY2 - AYIII2 > yY2 -2 2
with equality for A = 121j11. However, Cov(Y2 - 2212jY1) = 1 22 - Y2l H122 and &(Y2 - 2 21 Y1) = 0 so (iii) of Proposition 2.21 shows that
GulY2 - =21Y11 2 (I2, -22 - Y21-112->
Therefore,
6IIX2 - (AX, + x0)|j (12, '222- 11
with equality for A = 12J1j1 and2x1 = 2 - 2121 1*1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
96 RANDOM VECTORS
The last topic in this section concerns the covariance of XE] X when X is a random vector in (V, ( , *)). The random vector XE] Xis an element of the vector space (f&(V, V), ( , )). However, XO X is a self-adjoint linear transformation so XO X is also a random vector in (M, ( , )) where M, is the linear subspace of self-adjoint transformations in P,(V, V). In what follows, we regard XE] X as a random vector in (Ms, K *, )). Thus the
covariance of XO3 X is a positive semidefinite linear transformation on (Ms, ( *, * )). In general, this covariance is quite complicated and we make some
simplifying assumptions concerning the distribution of X.
Proposition 2.23. Suppose X has an orthogonally invariant distribution in (,(., )) where IXIt4 < +oo. Let vI and v2 be fixed vectors in V with
Ivii = 1, i = 1,2, and (vI, v2) = 0. Set cl = var((v1, X)2) and c2 =
cov{(v , X)2, (v2, X)2). Then
Cov(XO X) = (cl - c2)I ? I + c2T],
where T1 is the linear transformation on M, given by T1(A) = (I, A)I. In
other words, for A, B E M,
cov((A, XO X), (B, XE] X)) = (A, ((cl - c2)I ? I + c2T1)B)
= (Cl - C2)(A, B) + C2(I, A)(I, B).
Proof. Since (cl - c2)I 0 I + c2T1 is self-adjoint on (Ms, ( ,*), Pro
position 2.6 shows that it suffices to verify the equation
var(A, XO X> = (c, - c2)(A, A) + c2(I, A>2
for A e M, in order to prove that
Cov(XE X) = (C1 - C2)I 1 I+ C2T1.
First note that, for x e V,
var<(xOx, XE] X) = var(x, X)2 = Iixi14var( x
X) = IIxIl4var(v,, X)2
This last equality follows from Proposition 2.10 as the distribution of X is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 2.24 97
orthogonally invariant. Also, for x, x2 e V with(x,, x2) =O,
cov((xl, X)2, (X2, X)2) = ixlI12IIx21I2cov{( j' 1, X), ( jVx) 2 X
= lix1112ix2ii2cov((v1, X)2, (V2, X)2).
Again, the last equality follows since l&(X) = I&(4X) for I e @(V) so
coV { ( It ,x) ( X)2} = cov((2c jl x)2 , l IX211X)}
and I can be chosen so that
+ i=v, i=1,2.
For A e M, apply the spectral theorem and write A = Enaix i xi where
xI,.. ., xn is an orthonormal basis for (V,(, *)). Then
var(A, XO X) = var(KaaixiO xi, XO3 X)
= :a var(xj O Xi, XO X)
+ E E aiaj cov((xi O xi, XLI X), (xj O xj, XO X)) i *1
= la2var(xi, X)2 + ELaia cov{(xi, X)2, (xj, X)2}
= c1:a1 + C2EEaiaj = (cl - c2)Ea2 +
i*j i i j
= (c - c2)(A, A) + c2(I, A)2. El
When X has an orthogonally invariant normal distribution, then the con stant c2 = 0 so Cov( XLI X) = c jI I. The following result provides a
slight generalization of Proposition 2.23.
Proposition 2.24. Let X, v,, and v2 be as in Proposition 2.23. For C E I (V, V), let L = CC' and suppose Y is a random vector in (V, (., *)) with
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
98 RANDOM VECTORS
lE(Y) = C (CX). Then
Cov(YO Y) = (Cl - C2)1 ? 2 + C2T2
where T2(A) = (A, E)2 for A E M,
Proof. We apply Proposition 2.8 and the calculational rules for Kronecker products. Since (CX) OJ(CX) = (C X C)(XL X),
Cov(YZ Y) = Cov((CXO CX)) = Cov((C X C)(XO X))
= (C ? C)Cov(XO X)(C ? C)'
= (C ? C)((c -C2)I X I + C2T1)(C' ? C')
= (C1 - C2)(C X C)(I X I)(C' e C')
+c2(C ? C)T,(C' ? C')
= (c, - c2)1 X z + c2(C 0 C)T1(C' ? C').
It remains to show that (C 0 C)T1(C' 0 C') = T2. For A E M
(C ? C)T1(C' 0 C')(A) = C 0 C((I, (C' ? C')A)I)
= ((C e C)I, A)(C 0 C)(I) = (CC', A)CC'
= <E, A)2= T2(A).
PROBLEMS
1. If x 1,..., xn is a basis for (V,(-, )) and if (x, X) has finite expecta tion for i = 1,..., n, show that (x, X) has finite expectation for all
x E V. Also, show that if (xi, X)2 has finite expectation for i = 1,..., n, then Cov(X) exists.
2. Verify the claim that if X1 (AX2) with values in V, (V2) are uncorrelated
for one pair of inner products on V, and V2, then they are uncorrelated
no matter what the inner products are on V1 and V2.
3. Suppose Xi E Vi, i = 1, 2 are uncorrelated. If fi is a linear function on
Vi i = 1 2, show that
(2.2) cov{f1(XI), f2(X2)) = 0.
Conversely, if (2.2) holds for all linear functions f1 and f2, then X1 and
X2 are uncorrrelated (assuming the relevant expectations exist).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 99
4. For X E R', partition X as
X=(x)
with X E R' and suppose X has an orthogonally invariant distribution.
Show that X has an orthogonally invariant distribution on R'. Argue that the conditional distribution of X given X has an orthogonally
invariant distribution.
5. Suppose XI,..., Xk in (V, (-, )) are pairwise uncorrelated. Prove that Cov(yk4Xi
k E Cov(Xi)
6. In Rk, let el,..., ek denote the standard basis vectors. Define a random
vector U in Rk by specifying that U takes on the value Ei with
probability pi where 0 < pi < 1 and Ekpi = 1. (U represents one of k
mutually exclusive and exhaustive events that can occur). Let p E Rk
have coordinates P 1, - - -, Pk- Show that & U = p, Cov(U) = Dp - pp'
where Dp is a diagonal matrix with diagonal entries P1,'. . . E Pk . When
0 < pi < 1, show that Cov(U) has rank k - 1 and identify the null
space of Cov(U). Now, let X,,..., X,, be i.i.d. each with the distribu
tion of U. The random vector Y = E'Xi has a multinomial distribution
(prove this) with parameters k (the number of cells), the vector of probabilities p, and the number of trials n. Show that SY = np,
Cov(Y) = n (Dp - pp').
7. Fix a vector x in Rn and let ST denote a permutation of 1, 2,.. ., n (there
are n! such permutations). Define the permuted vector 'rx to be the
vector whose ith coordinate is x('iT`(i)) where x(j) denotes the jth coordinate of x. (This choice is justified in Chapter 7.) Let X be a
random vector such that Pr{X = 7rx} = l/n! for each possible permu
tation 7T. Find &X and Cov( X).
8. Consider a random vector X E R' and suppose P(X) = (DX) for
each diagonal matrix D with diagonal elements dii = ? 1, i = 1,.. ., n.
If ;I&XI12 < + oo, show that SX = 0 and Cov(X) is a diagonal matrix
(the coordinates of X are uncorrelated).
9. Given X E (V, (*, *)) with Cov( X) = l, let Ai be a linear transforma tion on (V,(., ))to (Wi, [, ]), i= 1, 2. Form Y = (AlX, A2X} with
values in the direct sum W1 @D W2. Show
Cov(Y) = ( I~A 1A
() A22A A2 A2)
in W, ED W2 with its usual inner product.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
100 RANDOM VECTORS
10. For X in (V, , *)) with ,i = SX and Y. = Cov(X), show that
&(X, AX) = (A, L) + (i,u A,l) for any A E 0(V, V).
11. In (0,, n9 ( - * )), suppose the n x p random matrix X has the covari
ance In X E for some p x p positive semidefinite Y.. Show that the
rows of X are uncorrelated. If L = &X and A is an n x n matrix, show
that & X'A X = (tr A) Y. + ,I'A,u.
12. The usual inner product on the space of p x p symmetric matrices, denoted by s,, is ( ,*>, given by (A, B) = trAB'. (This is the natural inner product inherited from (0ep,p, ( , *) by regarding Sp as a subspace of Ep, p.) Let S be a random matrix with values in SP, and suppose that P,(FSF') = P(S) for all r E (p. (For example, if X E RP
has an orthogonally invariant distribution and S = XX', then -(rSFr) = I(S).) Show that &S = cIp where c is constant.
13. Given a random vector X in (0(V, W), ( *, - )), suppose that 0(X) =
OrF X +)X) for all F E e(W) and 4 Ee C(V).
(i) If X has a covariance, show &X = 0 and Cov(X) = cIw X Iv where c > 0.
(ii) If Y E 0 (V, W) has a density (with respect to Lebesgue measure)
given by f(y) = p((y, y)), y E 0(V, W), show that 0(Y) =
((Jr X 4)Y) for r E 0(W) and p Ee 0(V).
14. Let XI,..., Xn be uncorrelated random vectors in RP with Cov(Xi) =
2, i = 1 .. ., n. Form the n x p random matrix X with rows X,..., Xn and values in (li, ,, -
*)). Thus Cov(X) = In ? E:.
(i) Form X in the coordinate space R'P with the coordinate inner
product where
X,
In the space R'P show that
Cov(X) 4
where each block is p x p.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 101
(ii) Now, form X in the space R'P where
X- Zi E Rn
Zp Z,
and Zi has coordinates XI1,. .., X,,i for i = 1,..., p. Show that
UIIIn (12In ... IpIn
( G21In (J22In U2pIn
Cov(X)=
pI In Up2In ..
app In
where each block is n X n, I = (ain).
15. The unit sphere in Rn is the set (xlx e Rn, IxII = 1) = 9?. A random
vector X with values in 9, has a uniform distribution on 9 if D (X) =
1 (rX) for all r E Qn (There is one and only one uniform distribution
on X-this is discussed in detail in Chapters 6 and 7.)
(i) Show that SX = 0 and Cov(X) = (l/n)In. (ii) Let X, be the first coordinate of X and let X E R -' be the
remaining n - 1 coordinates. What is the best affine predictor of
XI based on X? How would you predict XI on the basis of X?
16. Show that the linear transformation T2 in Proposition 2.24 is I0 2U where Ol denotes the outer product of the vector space (Ms, K Here, *, -) is the natural inner product on l (V, V).
17. Suppose XA E R2 has coordinates XI and X2 that are independent with a standard normal distribution. Let S = XX' and denote the elements
of S by s1, S22, and S12 - S21.
(i) What is the covariance matrix of
51
512 E- R3 (22)
(ii) Regard S as a random vector in (52, K *) (see Problem 12). What is Cov(S) in the space (S2, ( *, *
(iii) How do you reconcile your answers to (i) and (ii)?
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
102 RANDOM VECTORS
NOTES AND REFERENCES
1. In the first two sections of this chapter, we have simply translated well
known coordinate space results into their inner product space versions.
The coordinate space results can be found in Billingsley (1979). The
inner product space versions were used by Kruskal (1961) in his work
on missing and extra values in analysis of variance problems.
2. In the third section, topics with multivariate flavor emerge. The reader
may find it helpful to formulate coordinate versions of each proposi tion. If nothing else, this exercise will soon explain my acquired
preference for vector space, as opposed to coordinate, methods and
notation.
3. Proposition 2.14 is a special case of Schur's Lemma?a basic result in
group representation theory. The book by Serre (1977) is an excellent
place to begin a study of group representations.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 3
The Normal Distribution
on a Vector Space
The univariate normal distribution occupies a central position in the statisti cal theory of analyzing random samples consisting of one-dimensional observations. This situation is even more pronounced in multivariate analy sis due to the paucity of analytically tractable multivariate distributions one notable exception being the multivariate normal distribution. Ordin arily, the nonsingular multivariate normal distribution is defined on Rn by specifying the density function of the distribution with respect to Lebesgue
measure. For our purposes, this procedure poses some problems. First, it is desirable to have a definition that does not require the covariance to be
nonsingular. In addition, we have not, as yet, constructed what will be called Lebesgue measure on a finite dimensional inner product space. The definition of the multivariate normal distribution we have chosen cir cumvents the above technical difficulties by specifying the distribution of each linear function of the random vector. Of course, this necessitates a proof that such normal distributions exist.
After defining the normal distribution in a finite dimensional vector space and establishing some basic properties of the normal distribution, we derive the distribution of a quadratic form in a normal random vector. Conditions for the independence of two quadratic forms are then presented followed by a discussion of conditional distributions for normal random vectors. The chapter ends with a derivation of Lebesgue measure on a finite dimensional vector space and of the density function of a nonsingular normal distribution on a vector space.
103
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
104 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
3.1. THE NORMAL DISTRIBUTION
Recall that a random variable Z0 E R has a normal distribution with mean
zero and variance one if the density function of Z0 is
p(z) = (2,Y) -1/2exp[ - z2], z E R
with respect to Lebesgue measure. We write E(Z0) = N(O, 1) when Z0 has density p. More generally, a random variable Z E R has a normal distribu
tion with mean ,t e R and variance a2 , 0 if E(Z) = E(aZ0 + ,u) where
f(Zo) = N(O, 1). In this case, we write e(Z) = N(M, a2). When a2 = 0, the
distribution N(i, a2) is to be interpreted as the distribution degenerate at u.
If C(Z) = N(p,u a2), then the characteristic function of Z is easily shown to be
+(t) = exp[it -
Ij2t2], t E R.
The phrase "Z has a normal distribution" means that for some jA and some
a > 0, IS(Z) = N(ju, a2). If Z1,.., Zk are independent with E(Zj) =
N(,uj, 2), then fE(2ajZj) = N(a1ajjij, ja4a.2). To see this, consider the
characteristic function
k k
& exp[itYajZj] = & rH exp[itajZj] =
n H exp[itaj z]
J=1 j=1
k
= H exp[itaj1j
-I t 2aj2 ]
I~~~~~~ g
j=l
= exp [it(aL1) - t2 (aja2j)].
Thus the characteristic function of Y2ajZj is that of a normal distribution
with mean Yajlj and variance Yaj2aj2. In summary, linear combinations of
independent normal random variables are normal. We are now in a position to define the normal distribution on a finite
dimensional inner product space (V, (*, )).
Definition 3.1. A random vector X E V has a normal distribution if, for
each x E V, the random variable (x, X) has a normal distribution on R.
To show that a normal distribution exists on (V, (, )), let {x, ... ., x,,) be
an orthonormal basis for (V, (-, *)). Also, let Z,.., Zn be independent
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.1 105
N(O, 1) random variables. Then X -= lZixi is a random vector and (x, X) =
Y2(x, xi)Zi, which is a linear combination of independent normals. Thus (x, X) has a normal distribution for each x E V. Since &;(x, X)=
E (xi, x)&Zi = 0, the mean vector of X is 0 E V. Also,
var(x, X) = var(E(x, xi)Zi) = 2:(x, xi)2var(Zi) = E(x, xi)2 = (x, x).
Therefore, Cov(X) = I E t(V, V). The particular normal distribution we have constructed on (V,(., .)) has mean zero and covariance equal to the
identity linear transformation. Now, we want to describe all the normal distributions on (V, (*, )). The
first result in this direction shows that linear transformations of normal random vectors are again normal random vectors.
Proposition 3.1. Suppose X has a normal distribution on (V, (., )) and let A EE f(V, W), wo E W. Then AX + wo has a normal distribution on
(W,[ I]) -D
Proof. It must be shown that, for each w E W, [w, AX + wo] has a normal
distribution on R. But [w, AX + wo] = [w, AX] + [w, wo] = (A'w, X) +
[w, wo]. By assumption, (A'w, X) is normal. Since [w, wo] is a constant,
(A'w, X) + [w, wo] is normal. C]
If X has a normal distribution on (V, (, *)) with mean zero and covari ance I, consider A E (V, V) and ,I E V. Then AX + ,u has a normal
distribution on (V,(., *)) and we know &(AX + ,u) = A(&X) + it = ,u and
Cov(AX + ,u) = A Cov(X)A' = AA'. However, every positive semidefinite linear transformation I can be expressed as AA' (take A to be the positive semidefinite square root of L). Thus given ju E V and a positive semidefinite Y., there is a random vector that has a normal distribution in V with mean vector ,u and covariance L2. If X has such a distribution, we write P (X) =
N(ju, 4). To show that all the normal distributions on V have been de
scribed, suppose X E V has a normal distribution. Since (x, X) is normal on R, var(x, X) exists for each x E V. Thus ,u = &X and E = Cov(X) both exist and S( X) = N(,u, E). Also, f,((x, X)) = N((x, ,I), (x, Ex)) for x E V. Hence the characteristic function of (x, X) is
+(t) = &exp[it(x, X)] = exp[it(x, ,u) - It2(x, Ex)].
Setting t = 1, we obtain the characteristic function of X:
((x) = Sexp[i(x, X)] = exp[i(x, ,u) - 2(x, Ex)].
Summarizing this discussion yields the following.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
106 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
Proposition 3.2. Given ,u E V and a positive semidefinite L E PJ(V, V), there exists a random vector X e V with distribution N(M, E) and char acteristic function
((x) = exp[i(x,i,) - 2(x, Ex)].
Conversely, if X has a normal distribution on V, then with A = SX and
= Cov(X), E(X) = N(A, E) and the characteristic function of X is given by (.
Consider random vectors Xi with values in (Vi, (*, . )j) for i = 1, 2. Then
{Xl, X2) is a random vector in the direct sum V1 E V2. The inner product
on VI @ V2 is [, *] where
RVI I V2), (V3 I V4}]- (VI1, VA) + (V2, V4)2,
v1, V3 E V1 and V2, V4 E V2. If Cov(X1) = Eii, i = 1,2, exists, then
S{X1, X2) = (Al 2) where A, = &X4, i = 1, 2. Also,
T O(X 1 2} (2 2 (VI 2tV21 VI 2) ~Cov{XI ,X2)=(221 122~
v@v,ie
as defined in Chapter 2 and 21 -- 2'
Proposition 3.3. If (XI, X2) has a normal distribution on V1 E V2, then Xl and X2 are independent iff 112
= 0.
Proof. If Xl and X2 are independent, then clearly ,2 = 0. Conversely, if
Y12 = 0, the characteristic function of {(X, X2) is
&exp(i [(v , v2), X1, 'X2)]) = exp(i [(v,, v2),( {1, I2)]
-I [V(1, V2}, '{V1I V2)])
= exp(i(v1, I1)0 + i(v2 2)2
-I(V1, E1VI)l -2(V2, 22v2)2}
= exp(i(v1, p1)1 2- (V1, 11v1)1}
x exp(i( V2, 2 )2 - 2 ( V2, 222)2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.4 107
since I2 =', = 0. However, for v1 E VI, (v1, X1)1 = [(v1,0),{XI, X2}],
which has a normal distribution for all vI E V1. Thus f (X1) = N(/lI, 2I) on
V1 and similarly X(X2) = N(JU2, 12) on V2. The characteristic function of
(X1, X2) is just the product of the characteristic functions of X1 and X2.
Thus independence follows and the proof is complete. oi
The result of Proposition 3.3 is often paraphrased as "for normal random
vectors, X1 and X2 are independent iff they are uncorrelated." A useful consequence of Proposition 3.3 is shown in Proposition 3.4.
Proposition 3.4. Suppose 15(X) = N(,u, L:) on (V,(-, *)), and consider A E f(V, W1), B E t(V,W 2) where (W1,-, ]1) and (W2,[, *]2) are inner
product spaces. AX and BX are independent iff AL2B' = 0.
Proof. We apply the previous proposition to X1 = AX and X2 = BX. That
{XI, X2) has a normal distribution on W1 @ W2 follows from
[wI, XIjI + [w2, X212 =
(A'w1, X) + (B'w2, X) =
(A'w1 + B'w2, X)
and the normality of (x, X) for all x E V. However,
CoV([ W1, XI ] 1 [ W2, X2]2) = cov{(A'w1, X), (B'w2, X))
= (A'w1, EB'w2)
= [ W1 AE:B'w2]
Thus X1 = AX and X2 = BX are uncorrelated iff A:B' = 0. Since (X1, X2) has a normal distribution, the condition AY2B' = 0 is equivalent to the independence of X1 and X2. El
One special case of Proposition 3.4 is worthy of mention. If e(X) =
N(,t, I) on (V, (., *)) and P is an orthogonal projection in e(V, V), then PX
and (I - P)X are independent since P(I - P) = 0. Also, it should be
mentioned that the result of Proposition 3.3 extends to the case of k random vectors-that is, if (XI, X2,..., Xk) has a normal distribution on the direct sum space V, ED V2 E ...* E Vk, then XI, X2,..., Xk are independent iff Xi and Xj are uncorrelated for all i * j. The proof of this is essentially the same
as that given for the case of k = 2 and is left to the reader.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
108 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
A particularly useful result for the multivariate normal distribution is the
following.
Proposition 3.5. Suppose E(X) = N(,u, l) on the n-dimensional vector space (V,(,-)). Write En2ixi EJxi in spectral form, and let Xi= (xi, X), i =1,..., n. Then X,..., Xn are independent random variables
that have a normal distribution on R with SXi = (xi, ,t) and var(Xi) = AX,
i= 1,..., n. In particular, if 2 = I, then for any orthonormal basis (xl,...,
xn) for V, the random variables Xi = (xi, X) are independent and normal
with &Xi = (xi, ,u) and var(Xi) = 1.
Proof. For any scalars
a,,...,
an in R, Ena1X1
= Ela1(x1, X)=
'I ixi, X), which has a normal distribution. Thus the random vector
X E Rn with coordinates X,,. . ., Xn has a normal distribution in the coordi
nate vector space Rn. Thus X1,..., Xn are independent iff they are uncorre
lated. However,
cov(Xj, Xk} = cov((xj, X), (xk, X)) = (Xj >Xk)
= (Xj, (yn Xiioxi)xk) = XjOjk.
Thus independence follows. It is clear that each Xi is normal with &Xi =
(xi, ,) and var(Xi) = Ai, i = 1,..., n. When L = I, then Enxijxi = I for any orthonormal basis xI,..., xn. This completes the proof.
The following is a technical discussion having to do with representations of the normal distribution that are useful when establishing properties of the
normal distribution. It seems preferable to dispose of the issues here rather than repeat the same argument in a variety of contexts later. Suppose
X E (V, (-, *)) has a normal distribution, say P,(X) = N(,u, Y.), and let Q be
the probability distribution of X on (V,(-, - )). If we are interested in the
distribution of some function of X, say f( X) E (W, [ *, - ]), then the underly
ing space on which X is defined is irrelevant since the distribution Q
determines the distribution of f(X)-that is, if B E `@(W), then
P{f(X) E B) = P(XEf-l(B)) = Q(f-1(B)).
Therefore, if Y is another random vector in (V, ( *, * )) with C ( X) = & (Y), then f( X) and f(Y) have the same distribution. At times, it is convenient to
represent e,(X) by J3(CZ + It) where E (Z) = N(O,I) and CC' = E. Thus
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
QUADRATIC FORMS 109
(X) = C (CZ + t,) sof(X) and f(CZ + it) have the same distribution. A
slightly more subtle point arises when we discuss the independence of two
functions of X, say fl(X) and f2(X), taking values in (W,[., ],) and
(W2, [* ]2). To show that independence of f1(X) and f2(X) depends only on Q, consider Bi E 9i3(Wi) for i = 1, 2. Then independence is equivalent to
P(f1(X) e B1, f2(X) e B2) = P(f1(X) e BI)P(f2(X) E B2).
But both of these probabilities can be calculated from Q:
P{f1(X) E B1, f2(X) E B2) = P{X EJ f-'(BI)
nf '(B2)}
= Q(fT-'(B,) nf 1(B2))
and
P{fi(X) E Bi) = Q(fl-'(Bi)), i = 1,2.
Again, if B(Y) = e(X), then fl(X) and f2(X) are independent iff fl(Y) and f2(Y) are independent. More generally, if we are trying to prove
something about the random vector X, e (X) = N(,u, 2), and if what we are
trying to prove depends only on the distribution Q, of X, then we can
represent X by any other random vector Y as long as C(Y) = e(X). In
particular, we can take Y = CZ + ,L where e (Z) = N(O, I) and CC' = E.
This representation of X is often used in what follows.
3.2. QUADRATIC FORMS
The problem in this section is to derive, or at least describe, the distribution of (X, AX) where X E (V, (*, *)), A is self-adjoint in f&(V, V) and f (X) =
N(,u, E). First, consider the special case of 2 = I, and by the spectral
theorem, write A = E1 x1ox1. Thus
(X, AX) = (X,(EnXiXiOxi)X) = EXi(Xi, X)2.
But Xi (xi, X), i = 1,...., n, are independent since I = I (Proposition
3.5) and e(Xi) = N((xi, ,u), 1). Thus our first task is to derive the distribu tion of Xj2 when e (Xi) = N((xi, IL), 1).
Recall that a random variable Z has a chi-square distribution with m
degrees of freedom, written E&(Z) = x2, if Z has a density on (0, oo) given
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
110 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
by
z(m/2)-l (z)= ~~~exp -z], z>O0. Pm() =r(m/2)2m/2 2
Here m is a positive integer and r(.) is the gamma function. The character
istic function of a X2 random variable is easily shown to be
&;e tm= -2it)-/ t eR'.
Thus, if e(Z1) = x2, e(Z2) = , and Z1 and Z2 are independent, then
& exp[it(Z, + Z2)] =&;exp[itZj]jexp[itZ2]
= (1 - 2it) /2(l - 2it) n/2 = (1 - 2it)-(m+n)/2
Therefore, P2(Z1 + Z2) = x2.+n. This argument clearly extends to more
than two factors. In particular, if e(Z)= x2, then, for independent ran
dom variables Z4,. , Zm with (Zi) = X2, ef,mZi) = e(Z). It is not difficult to show that if P,(X) = N(O, 1) on R, then e(X2) = x2. However, if e(X) = N(a, 1) on R, the distribution of X2 is a bit harder to derive. To
this end, we make the following definition.
Definition 3.2. *Let Pm' m = 1,2,..., be the density of a x2 random variable and, for X > 0, let
qj = exp) 2] j! (2)
For A = 0, qo = I and q. = 0 for j > 0. A random variable with density
00
h(z) = E qjPm+2j(z), z > 0 j=O
is said to have a noncentral chi-square distribution with m degrees of freedom
and noncentrality parameter A. If Z has such a distribution, we write
E(Z) = X2 When A = 0, it is clear that e (X 2 (O)) = x2 . The weights qj, j = 0, 1,...,
are Poisson probabilities with parameter X/2 (the reason for the 2 becomes
clear in a bit). The characteristic function of a x2 (A) random variable is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.6 111
calculated as follows:
00 0
gexp[itX2 (A)] = E qjf exp(itx)p-I2j(x) dx
j=O0
00
= qj(l - 2it) n(m/2+j) j=0
= (1 - 2it)m/2 ? qj(l - 2it)-j j=0
= (1 - 2it) m/2exp(-X/2) E (A (I - 2it)
=(1 - 2it) -m/2exp-+ X 1-2it]
- (1 - 2it) m/2exp 2[ J2it
From this expression for the characteristic function, it follows that if E(Zi) = x2 (Xi), i = 1, 2, with Z1 and Z2 independent, then fE(Z1 + Z2)
X2n +m2(A, + X2). This result clearly extends to the sum of k independent noncentral chi-square variables. The reason for introducing the noncentral chi-square distribution is provided in the next result.
Proposition 3.6. Suppose E(X) = N(a, 1) on R. Then C(X2) = X2(a2).
Proof The proof consists of calculating the characteristic function of X2. A justification of the change of variable in the calculation below can be given using contour integration. The characteristic function of X2 is
& exp( itX2) = j exp [itx2 - X - a)2] dx
exp[- (1 -2itx2+ ax-'a2]dx
(1 - 2it)12 J| exp[-lw2 + a(l -2it) /
xw- 1a2] dw
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
112 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
=(1 2itY'2 j| exp[- (w - a(l - 2it)/2)
+ a it
7dw 2(1 2it)d
= (1- 2it)/ exp 2 ( - 2it
By the uniqueness of characteristic functions, E(X2) = X2(a2). ?
Proposition 3.7. Suppose the random vector X in (V, (-, *)) has a N(,, I)
distribution. If A EE C(V, V) is an orthogonal projection of rank k, then P-((X, AX)) = x2((A, AA)).
Proof. Let (xI,..., Xk) be an orthonormal basis for the range of A. Thus
A =2,EJJiOxi and
(X, AX) = Ek2(xi, X)2.
But the random variables (xi, X)2, i = 1,..., k, are independent (Proposi tion 3.5) and, by Proposition 3.6, i4X7) = x)((x, )2) From the additive property of independent noncentral chi-square variables,
E (yk (xi,9 X)2) = Xk(l(E (Xi ) A)2
Noting that (,u, Ai) = Ek(Xi, A)2, the proof is complete. U
When Ci(X) = N(,u, E), the distribution of the quadratic form (X, AX), with A self-adjoint, is reasonably complicated, but there is something that can be said. Let B be the positive semidefinite square root of E: and assume
that , e 'Ri(E). Thus , E 'Y(B) since i6,(B) = '3(E). Therefore, for some
vector T E V, ,u = BT. Thus e(X) = E(BY) where E(Y) = N(T, I) and it
suffices to describe the distribution of (BY, ABY) = (Y, BABY). Since A and B are self-adjoint, BAB is self-adjoint. Write BAB in spectral form:
BAB = Y'1x UxOx
where (xl,, xJ) is an orthonormal basis for (V,(., *)). Then
(Y, BABY) = EnAi (xi, Y)2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.8 113
and the random variables (xi, Y), i = 1,..., n, are independent with
f ((Xi, Y)2) = X'((Xi, T)2). It follows that the quadratic form (Y, BABY)
has the same distribution as a linear combination of independent noncentral chi-square random variables. Symbolically,
Y((Y, BABY)) = E ()XiX2 i((Xi T ))
In general not much more can be said about this distribution without some
assumptions concerning the eigenvalues Al,..., An. However, when BAB is an orthogonal projection of rank k, then Proposition 3.7 is applicable and
e ((Y, BABY)) = X2 ((T, BABT)) = X2 ((BT, ABT)) = X2 ((G' Aft)).
In summary, we have the following.
Proposition 3.8. Suppose P&(X) = N(,u, >2) where ,u E 511(>2), and let B be the positive semidefinite square root of 2. If A is self-adjoint and BAB is a
rank k orthogonal projection, then
e ((X, AX)) = X2 ((M, Api)).
We can use a slightly different set of assumptions and reach the same
conclusion as Proposition 3.8, as follows.
Proposition 3.9. Suppose E(X) = N(,u, 2) and let B be the positive semi definite square root of >. Write ,t = ,u + L2 where yI E 64>2:) and I2 E
6(2:). If A is a self-adjoint such that AU2 = 0 and BAB is a rank k
orthogonal projection, then
f_((X, AX)) = Xk((t Ali)).
Proof Since A,u2 = 0, (X, AX)=(X-,U2, A(X-,A2)). Let Y = X-12 so E&(Y) = N(t 1, :5) and E ((X, AX)) = E ((Y, AY)). Since uLi E (R ), Proposition 3.8 shows that
f((Y, AY)) = X2((Ol, AA1)).
However, (t, A,i) = (L , A,i1) as AM2 = 0. ?
3.3. INDEPENDENCE OF QUADRATIC FORMS
Thus far, necessary and sufficient conditions for the independence of different linear transformations of a normal random vector have been given
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
114 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
and the distribution of a quadratic form in a normal random vector has
been described. In this section, we give sufficient conditions for the indepen dence of different quadratic forms in normal random vectors.
Suppose X e (V, (., )) has an N(pL, L) distribution and consider two
self-adjoint linear transformations, Ai, i = 1, 2, on V to V. To discuss the
independence of (X, AIX) and (X, A2X), it is convenient to first reduce the discussion to the case when y = 0 and E = I. Let B be the positive
semidefinite square root of 2 so if Q(Y) = N(0, I), then Q(X) = &(BY + ,u). Thus it suffices to discuss the independence of (BY + ,u, A,(BY + ,u)) and (BY + tp, A2(BY + ,u)) when f (Y) = N(O, I). However,
(BY ? A, Ai(BY + ,u)) = (Y, BAiBY) + 2(BAi,l, Y) + (ti, AiA)
for i = 1,2. Let Ci = BAiB, i = 1,2, and let xi = 2BAitp. Then we want to
know conditions under which (Y, C1Y) + (xl, Y) and (Y, C2Y) + (x2, Y)
are independent when &(Y) = N(0, I). Clearly, the constants (,I, Ai,u),
i = 1, 2, do not affect the independence of the two quadratic forms. It is this
problem, in reduced form, that is treated now. Before stating the principal result, the following technical proposition is needed.
Proposition 3.10. For self-adjoint linear transformations Al and A2 on (V,(, .)) to (V,(-, -)), the following are equivalent:
(i) AIA2 = 0.
(ii) 6A (A1) I 63t(A2).
Proof If AIA2 = 0, then AIA2x
= 0 for all x c V so 6i(A2) c %(A ). Since GDl(A,) 1 6A(AI), 6iK(A2) 1 6Yu(A,). Conversely, if 6J(AI) 6 AR(A2), then 6i(A2) c 6t(A1)l = Gi(AI) and this implies that A,A2x = 0 for all
xE V.Therefore,AIA2 =0. [U
Proposition 3.11. Let Y E (V, (-, .)) have a N(0, I) distribution and sup
pose Zi = (Y, AiY) + (xi, Y) where Ai is self-adjoint and xi E V, i = 1, 2.
If A1A2 = 0, A1x2 = 0, A2x1 = 0, and (xl, x2) = 0, then Z1 and Z2 are
independent random variables.
Proof The idea of the proof is to show that Z1 and Z2 are functions of two
different independent random vectors. To this end, let Pi be the orthogonal projection onto 6(A1) for i = 1, 2. It is clear that PiAiPi = Ai for i = 1, 2.
Thus Z, = (P,Y, AiP,Y) + (xi, Y) for i = 1, 2. The random vector
(P,Y, (x1, Y)) takes values in the direct sum V E R and Z, is a function of
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.12 115
this vector. Also, (P2Y, (x2, Y)) takes values in V @ R and Z2 is a function
of this vector. The remainder of the proof is devoted to showing that
(PIY, (x , Y)) and {P2Y, (x2, Y)} are independent random vectors. This is
done by verifying that the random vectors are jointly normal and that they are uncorrelated. Let [ , * ] denote the induced inner product on the direct
sum V ED R. The inner product of the vector ({Y1, al), (Y2, a2)) in (V @ R) ED (V @ R) with {{P,Y, (xI, Y)), (P2Y, (x2, Y))) is
(y1, P1Y) + a,(x,, Y) + (Y2' P2Y) + a2(x2, Y)
= (Plyl + a1x + P2y2 + a2x2, Y),
which has a normal distribution since Y is normal. Thus ({P1Y, (xI, Y)), (P2Y, (x2, Y))} has a normal distribution. The independence of these two vectors follows from the calculation below, which shows the vectors are uncorrelated. For (Yi, a,)
( V ED R and {Y2' a2) E V E R,
COV([(y1, a1), {P1Y, (x1, Y))], [(Y22, a2), {P2Y, (x2, Y))])
= cov((y1, PIY) + al(x1, Y), (Y2, P2Y) + a2(x2, Y))
= cov((P1 Y1, Y), (P2 Y2 Y)) + al cov{((x , Y), (P2 Y2, Y))
+ a2 cov{( P1y1, Y), (x2, Y)) + al a2 Cov {(xl, Y), (x2, Y))
= (P1Y1, P2Y2) + al(X1, P2Y2) + a2(X2, PIyI) + ala2(x1, X2)
= (y1, P1P2y2) + al(P2xI, Y2) + a2(P1x2, yI) + ala2(xI, x2).
However, P,P2 = 0 since 6Y(A,) I 6I(A2). Also, P2x, = 0 as xl E O6I(A2) and, similarly, P1x2 = 0. Further, (xl, x2) = 0 by assumption. Thus the above covariance is zero so Z1 and Z2 are independent. o
A useful consequence of Proposition 3.11 is Proposition 3.12.
Proposition 3.12. Suppose (X) = N(,t, L) on (V,(-, *)) and let Ci, i =
1,2, be self-adjoint linear transformations. If C1 C2 = 0, then (X, C1 X) and ( X, C2 X) are independent.
Proof. Let B denote the positive semidefinite square root of 2, and suppose f&(Y) = N(0, I). It suffices to show that Z- (BY + it, C1(BY +
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
116 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
,u)) is independent of Z2 - (BY + M, C2(BY + tt)) since f ( X) = C (BY +
,u). But
Zi = (Y, BC1BY) + 2(BCI,, Y) + (u, C>t)
for i = 1,2. Proposition 3.11 can now be applied with Ai BCJB and x= 2BCQt for i 1, 2. Since I = BB, AIA2
= BC,BBC2B = BC1IC2B =
O as CJYC2 = 0 by assumption. Also, A1x2 = 2BC1BBC2u = 2BC1>X2M =
0. Similarly, A2x 0 and (x1, X2) = 4(BC,M, BC2M) = 4(A, C,YC2M) = 0. Thus (Y, BC,BY) + 2(BC1ti, Y) and (Y, BC2BY) + 2(BC21, Y) are inde
pendent. Hence Z1 and Z2 are independent. o
The results of this section are general enough to handle most situations that arise when dealing with quadratic forms. However, in some cases we
need a sufficient condition for the independence of k quadratic forms. An examination of the proof of Proposition 3.11 shows that when f(Y) =
N(O, I), the quadratic forms Zi = (Y, AiY) + (xi, Y), i = 1,... k, are mu
tually independent if, for each i * j, AiA1. 0, Aixj 0, Ajxi = 0, and
(xi, xj) = 0. The details of this verification are left to the reader.
3.4. CONDITIONAL DISTRIBUTIONS
The basic result of this section gives the conditional distribution of one normal random vector given another normal random vector. It is this result that underlies many of the important distributional and independence
properties of the normal and related distributions that are established in
later chapters. Consider random vectors Xi E (Vi, (-, .)), i-1, 2, and assume that the
random vector (Xl, X2} in the direct sum V1 ED V2 has a normal distribution with mean vector {,L, J2) E VI E V2 and covariance given by
cov(X)= ( ).
Thus E (Xi) = N(,i, :ii) on (Vi,(, .)i) for i = 1,2. The conditional distri
bution of X, given X2 - E V2 is described in the next result.
Proposition 3.13. Let Ei(XIX2 = x2) denote the conditional distribution of X1 given X2 = x2. Then, under the above normality assumptions,
E(XIX2 x2) =N( + N(2Al 2(X2 - 1 -
Here, 22- denotes the generalized inverse of 122
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.13 117
Proof. The proof consists of calculating the conditional characteristic function of Xl given X2 = x2. To do this, first note that Xl - - 212-2X2 and
X2 are jointly normal on V, E V2 and are uncorrelated by Proposition 2.17.
Thus X1 - :12j2-X2 and X2 are independent. Therefore, for x E V,1
+o(x) -- (exp[i(x, X1)1]1X2 = x2)
= &(exp[i(x, Xl)l - i(x, 2122X2)1 + i(X, 1222X2)1]lX2 = X2)
exp[i(x, E12E-2x2),]&(exp[i(x, X - =X2)1X2 = x2)
= exp[i(x, E127 2x2) l]& exp[i(x, Xl - X
where the last equality follows from the independence of X2 and Xl -
l2E22 X2. However, it is clear that
( Xl - E-= X2) = N(ju
- -12X22t22, Ell
- 1 22212)
as Xl - 2 22-2X2 is normal on V1 and has the given mean vector and
covariance (Proposition 2.17). Thus
+O(x)= exp[i(x, 12- l x2)1]exp[i(x, tl
Xexp[ - I(X, (I - 21
= exp[i(x, l + 212222(X2 -
2A- 2(X (211 -
21222 12)X)1]
The uniqueness of characteristic functions yields the desired conclusion. El
For normal random vectors, Xi E (Vi, (, .)i), i = 1,2, Proposition 3.13 shows that the conditional mean of X1 given X2 = x2 is an affine function of x2 (affine means a linear transformation, plus a constant vector so zero does not necessarily get mapped into zero). In other words,
&(X1IX2 = X2) = A + 212(x222 -
Further, the conditional covariance of X, does not depend on the value of X2. Also, this conditional covariance is the same as the unconditional covariance of the normal random vector Xl - El2E22 X2. Of course, the specification of the conditional mean vector and covariance specifies the conditional distribution of X, given X2 = x2 as this conditional distribution is normal.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
118 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
* Example 3.1. Let W,,..., Wl4 be independent coordinate random vectors in RP where RP has the usual inner product. Assume that E (WI) = N(I, 2) so I C RP is the coordinate mean vector of each
Wi and Y. is the p x p covariance matrix of each Wi. Form the
random matrix XE E ,n with rows Wi-, i = l,...,n. We know that
6= ep'
and
COv(X) In I) E
where e e R' is the vector of ones. To show X has a normal
distribution on the inner product space (ft, n K )), it must be verified that for each A E !p, np (A, X) has a normal distribution. To do this, let the rows of A be a',. .., a', a E RP. Then
n
(A, X) = trAX' =
E>a'W1.
However, a'Wi has a normal distribution on R since E(Wj)= N(,u, L) on RP. Also, since W,,..., Wn are independent, a,W,,..., a'Wn are independent. Since a linear combination of independent
normal random variables is normal, (A, X) has a normal distribu
tion for each A E E n. Thus
(X) = N(e,t'In I& ?)
on the inner product space (F, , n *)). We now want to describe
the conditional distribution of the first q columns of X given the last
r columns of X where q + r = p. After some relabeling and a bit of
manipulation, this conditional distribution follows from Proposition 3.13. Partition each Wi into Yi and Z1 where Yi E Rq consists of the
first q coordinates of Wi and Zi E Rr consists of the last r coordi
nates of Wi. Let Xl E Sq, n have rows Y,',..., Yn and let X2 E Er n
have rows Z',. Zn. Also, partition , into CL E Rq and ,ll2 E Rr So
St = u, and =Z, = 2= 1... n. Further, partition the covari
ance matrix l of each W, so that
Cov(Y},Zi}= (El l 12
where > 21 = E' 2.From the independence of W,,... . Wn, it follows
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.13 119
that
P-(XI) = N(epj, In ? 1),
ii( X2 ) = N( ett',In 2)
and (X1, X2) has a normal distribution on eq, n ED r nwith mean
vector (esf , e,'2) and
{ ) ( ~In (& y- i I In C) 2 12)
In'n ?-21 In?0 122!
Now, Proposition 3.13 is directly applicable to {XI, X2) where we
make the parameter correspondence
,u> eju, i= 1,2
and
I ij In X 2 ij .
Therefore, the conditional distribution of XI given X2 = E r n iS normal with mean vector
&(X11X2 = x2) = epLj + (In ? T12)(In ? 122)(X2 2
and
Cov(X11X2 = X2)
=In X 2111 (In 0s Y-12)(In 08 122)- (In (& 12021
However, it is not difficult to show that (In ? 22 2) = In & 2 22 Using the manipulation rules for Kronecker products, we have
E;(X11X2 = X2) = eA" + (x2 - 22
and
Cov(X1JX2 = X2) = In ?& (T11 -
712122121)
This result is used in a variety of contexts in later chapters.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
120 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
3.5. THE DENSITY OF THE NORMAL DISTRIBUTION
The problem considered here is how to define the density function of a nonsingular normal distribution on an inner product space (V, (, -)). By nonsingular, we mean that the covariance of the distribution is nonsingular. To motivate the technical considerations given below, the density function of a nonsingular normal distribution is first given for the standard coordi nate space Rn with the usual inner product.
Consider a random vector X in Rn with coordinates X,,..., Xn and assume that XI,..., Xn are independent with P (Xi) = N(O, 1). The symbol
dx denotes Lebesgue measure on Rn. Since X1,..., Xn are independent, the joint density of X,,..., Xn in Rn is just the product of the marginal densities, that is, X has a density with respect to dx given by
p(x) H H exp [Ix2] /2 exp [-2 1nx2]
where x E R' has coordinates xl,..., xn. Thus
p(x) = (27") 2exp[- x'x]
and x'x is just the inner product of x with x in Rn. To derive the density of
an arbitrary nonsingular normal distribution in R , let A be an n x n
nonsingular matrix and set Y = AX + it where u E R . Since E (X) =
N(O, In), C(Y) = N(j , E) where E = AA' is positive definite. Thus X =
A - (Y - tt) and the Jacobian of the nonsingular linear transformation on
Rn to Rn sending x into A - (x - I) is Idet(A - 1)1 where I I denotes absolute
value. Therefore, the density function of Y with respect to dy is
pI(y) = Idet(A 1) p(A-'(y - )) = (detE) 1/2(27T) n/2
xexp[- '(y - t)'A'-A-'(y - IL)]
= (det E:) (2(T ) exp[ - ?(y - 1A) 1 ( - 11)].
Thus we have the density function with respect to dy of any nonsiiLgular
normal distribution on Rn. Of course, this expression makes no sense when
I is singular.
Now, suppose Y is a random vector in an n-dimensional vector space
(V, ( , )) and P (Y) = N(tL, E) where E is positive definite. The expression
p2 ( Y ) = (2 ThY) n/2(det X) -
/2 expt -((y - u), L -(y - u))
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
THE DENSITY OF THE NORMAL DISTRIBUTION 121
for y e V, certainly makes sense and it is tempting to call this the density
function of Y E (V, (-, )). The problem is: What is the measure on (V, (., .))
with respect to which P2 is a density? In other words, what is the analog of
Lebesgue measure on (V, (*, -))? To answer the question, we now show that
there is a natural measure on (V, (-, )), which is constructed from Lebesgue measure on R', and P2 is the density function of Y with respect to this
measure. The details of the construction of "Lebesgue measure" on an n- dimen
sional inner product space (V,(, )) follow. First, we review some basic topological notions for (V,(, )). Recall that Sr(xo) (xllIx - xo0l < r} is called the open ball of radius r with center xo. A set B c V is called open if, for each xo E B, there is an r > O such that Sr(xo) c B. Since all inner
products on V are related by positive definite linear transformations, the definition of open does not depend on the given inner product. A set is
closed iff its complement is open and a set if bounded iff it is contained in
Sr(O) for some r > 0. Just as in R', a set is compact iff it is closed and
bounded (see Rudin, 1953, for the definition and characterization of com pact sets in Rn). As with openness, the definitions and characterizations of
closedness, boundedness, and compactness do not depend on the particular inner product on V. Let 1 denote standard Lebesgue measure on R'. To
move 1 over to the space V, let x,,... x,, Xnbe a fixed orthonormal basis in
(V, (, )) and define the linear transformation T on Rn to V by
T(a) = , a,
where a E Rn has coordinates a,,. .., an. Clearly, T is one-to-one, onto, and
maps open, closed, bounded, and compact sets of R" into open, closed, bounded, and compact sets of V. Also, T- ' on V to Rn maps x e V into the vector with coordinates (xi, x), i-1,..., n. Now, define the measure vo on Borel sets B e ':,(V) by
vO(B) = 1(T-1(B)).
Notice that vo(B + x) = (T-((B + x)) = l(T-'(B) + T-'x) = l(T-'(B)) = vO(B) since Lebesgue measure is invariant under translations. Also, vo(B) < + so if B is a compact set. This leads to the following definition.
Definition 3.3. A nonzero measure ,' defined on the Borel sets 63 (V) of
(V, (., .)) is invariant if:
(i) v(B + x)=v(B) for xEc VandBE @(V).
(ii) v(B) < + xc for all compact sets B.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
122 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
The measure P0 defined above is invariant and it is shown that, if v is any
invariant measure on !(V), then v = cpo for some constant c > 0. Condi tion (ii) of Definition 3.3 relates the topology of V to the measure v. The
measure that counts the number of points in a set satisfies (i) but not (ii) of
Definition 3.3 and this measure is not equal to a positive constant times v0. Before characterizing the measure v0, it is now shown that Po is a
dominating measure for the density function of a nonsingular normal distribution on (V, (*, *)).
Proposition 3.14. Suppose fE(Y) = N(ji, .) on the inner product space (V,., .)) where E is nonsingular. The density function of Y with respect to
the measure Po is given by
p(y) = (2r)_ -n/2(det )-1/2 exp[- 2 -(y _ 1 2`(y - I))]
fory E V.
Proof. It must be shown that, for each Borel set B,
P{Y E B) = fIB(Y)P(Y)PO(dY),
where IB is the indicator function of the set B. From the definition of the
measure P0, it follows that (see Lehmann, 1959, p. 38)
JIB(Y)P(Y)vo(dY) = IB(T(a))p(T(a))l(da).
Let X = T- (Y) E R' so X is a random vector with coordinates (xi, Y),
i = 1,..., n. Thus X has a normal distribution in Rn with mean vector
T` (IL) and covariance matrix [2J where [2] is the matrix of E in the given orthonormal basis x,,..., xn. Therefore,
P(Ye B) P{T-(T(Y) e T-'(B)) = P{Xe T-'(B))
- JI (B)(a)(2T)-n/2(det[21)-1/2
xexp[-4 (a - -
(,u))'[2j l (a-T- l(,u))] l(da)
- JIB(T(a))p (T(a))l(da).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.14 123
The last equality follows since IT-I(B)(a) = IB(T(a)) and
p(T(a)) -(27T)-n2(det 2)-l/2
xexp[- 4(T(a) - pt, 2> (T(a) - t))]
-(2X) /1
(det[ 7. ])
Xexp -2(a - (a - T- 1
Thus
P{Y E B) JIB(T(a))p(T(a))l(da)
= JIB(Y)P(Y) vo(dy) .
We now want to show that the measure vo, constructed from Lebesgue measure on R , is the unique translation invariant measure that satisfies
fp(y)vo(dy) = 1.
Let C( be the collection of all bounded non-negative Borel measurable functions defined on V that satisfy the following: given f E V, there is a compact set B such that f(v) = 0 if v t B. If v is any invariant measure on
V and f e VC+, then Jf(v)v(dv) < + oc since f is bounded and the v-mea sure of every compact set is finite. It is clear that, if v, and v2 are invariant
measures such that
ff(v)v1 (dv) -ff(v)v2(dv) for ailf f e ,
then v1 = p2' From the definition of an invariant measure, we also have
ff(v + x)v(dv) = ff(v)v(dv)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
124 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
for all f E 'C+ and x e V. Furthermore, the definition of vo shows that
f(x)vo(dx) = Jf(T(a))l(da) = ff(T(-a))l(da)
=f(- T(a))l(da) =f( -x)vo(dx)
for all f e V+. Here, we have used the linearity of T and the invariance of
Lebesgue measure under multiplication of the argument of integration by a minus one.
Proposition 3.15. If v is an invariant measure on 65(V), then there exists a positive constant c such that v = cv0.
Proof. For f, g e , we have
ff(x)v(dx)fg(y)vo(dy) = fff(x - y)g(y)v(dx)Po(dy)
=fftf(- (y - x))g(y - x + x)vo(dy)v(dx)
= fff(-w)g(w + x)v0(dw)v(dx)
=ffff(-w)g(w + x)v(dx)v0(dw)
= ff(-w)vo(dw)Jg(x)v(dx)
= f(w)vo(dw)Jg(x)v(dx).
Therefore,
ff(x)v(dx)fg(y)vo(dy)= f(w)vo(dw)fg(y)v(dy)
for all f, g e 'C+. Fix f c VC such that ff(w)vo(dw) = 1 and set c
= Jf(x)v(dx). Then
fg(y)v(dy) = c g(y)vo(dy)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 3.15 125
for all g E W+. The constant c cannot be zero as the measure i' is not zero.
Thus c > O and v = cp.
The measure vo is called the Lebesgue measure on V and is henceforth
denoted by dv or dx, as is the Lebesgue measure on Rn. It is possible to
show that Po does not depend on the particular orthonormal basis used to
define it by using a Jacobian argument in Rn. However, the argument given
above contains more information than this. In fact, some minor technical
modifications of the proof of Proposition 3.15 yield the uniqueness (up to a positive constant) of invariant measures on locally compact topological groups. This topic is discussed in detail in Chapter 6.
An application of Proposition 3.14 to the situation treated in Example 3.1 follows.
* Example 3.2. For independent coordinate random vectors Wi E RP, i = 1,..., n, with E(Wi) = N(L, :), form the random matrix
X c E p n with rows Wi, i=1,..., n. As shown in Example 3.1,
(X) = N(e',I n ? 1)
on the inner product space (t, n< K )), where e E Rn is the
vector of ones. Let dX denote Lebesgue measure on the vector space (P . If 1 is nonsingular, then In X z is nonsingular and (In ? E) < =
In X 1 - . Thus when E is nonsingular, the density of X with
respect to dX is
(3.1) p(X) 2 )- (det(In -
x exp[ - (X X-e#' ( In 08 E - l1)( X-e'
It is shown in Chapter 5 that det(In ? l) = (det Ey)n. Since the
inner product ( , * is given by the trace, the density p can be written
p (X) = (V2, T-npn
x exp[ - 2tr(X - eM')'( X-eu') -
However, this form of the density is somewhat less revealing, from a statistical point of view, than (3.1). In order to make this statement
more precise and to motivate some future statistical considerations, we now think of it E RP and E as unknown parameters. Thus, we
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
126 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
can write (3.1) as
(3.2) p(Xjy, ) = (dety/
xexp[ -2 (X- e,u', (In (9 E- )( X-,)>
where ,u ranges over R P and : ranges over all p x p positive definite
matrices. Thus we have a parametric family of densities for the
distribution of the random vector X. As a first step in analyzing this
parametric family, let
M = {x C Ep nX = e,u', L E R'P).
It is clear that M is a p-dimensional linear subspace of Ep nand M is
simply the space of possible values for the mean vector of X. Let
P, = (1/n)ee' so Pe is the orthogonal projection onto span{e) c Rn.
Thus Pe ? Ip is an orthogonal projection and it is easily verified that
the range of Pe ?9 Ip is M. Therefore, the orthogonal projection onto
M is Pe 9 Ip. Let Qe = In-Pe so Qe ? Ip is the orthogonal projec
tion onto M' and (Qe C Ip)(Pe ? Ip) = 0. We now decompose X
into the part of X in M and the part of X in M' -that is, write
X = (Pe ? Ip)X + (Qe X Ip)X. Substituting this into the exponen
tial part of (3.2) and using the relation (Pe C) Ip)(In @ ')(Qe Ip) = 0, we have
<X - e,u',(IJI C) -')(X- eI') >
= (Pe (X - eM'), (In X 2 -') Pe (X -eW))
+(QeX,(In ?& Y'-)QeX)
= (PeX eW, (In ? 2 )(PeX- e,')) + trQeXy '(QeX)'
= (PeX e', (In ? 2 )(PeX- eM')) + trX'QeX' 1.
Thus the density p(X[L, L) is a function of the pair PeX and
X'QeX so Pe X and X'Qe X is a sufficient statistic for the parametric
family (3.2). Proposition 3.4 shows that (Pe ? Ip)X and (Qe p) I)X
are independent since (Pe ? Ip)(In ?) 2)(Qe ?& Ip) = (PeQe) ? = 0 as PeQe = 0. Therefore, PeX and X'QeX are independent since
PeX = (Pe C) Ip)X and X'QeX = ((Qe ?& Ip)X)'((Qe ? Ip)X). To
interpret the sufficient statistic in terms of the original random
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 127
vectors W,,..., W", first note that
PeX = --ee'X
= eW' en
where W= (1/n)lWi is the sample mean. Also,
X'QeX = (QeX)'(QeX) = ((In - Pe)X)(( Pe)X)
=( X - eW')'(X X-eW') I S( i - (i-W)'
The quantity (1/n)X'QeX is often called the sample covariance
matrix. Since eW' and Ware one-to-one functions of each other, we
have that the sample mean and sample covariance matrix form a
sufficient statistic and they are independent. It is clear that
=( N(,u, I.
The distribution of X'QeX, commonly called the Wishart distribu tion, is derived later. The procedure of decomposing X into the
projection onto the mean space (the subspace M) and the projec
tion onto the orthogonal complement of the mean space is funda
mental in multivariate analysis as in univariate statistical analysis. In fact, this procedure is at the heart of analyzing linear models-a
topic to be considered in the next chapter.
PROBLEMS
1. Suppose X1,.. ., Xn are independent with values in (V,(., *)) and
C(Xi) = N(tti, Ai), i = 1,..., n. Show that E(Y.Xi) = N(I,ii E:Ai).
2. Let X and Y be random vectors in Rn with a joint normal distribution
given by
( Y PIn In )
where p is a scalar. Show that IPI < 1 and the covariance is positive
definite iff IPl < 1. Let Q(Y) = In - (Y'Y)-fYY'. Prove that W=
X'Q(Y)X has the distribution of (1 - P2)X2_-I (the constant 1 - p2 times a chi-squared random variable with n - 1 degrees of freedom).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
128 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
3. When X E R' and E(X) = N(O, L) with Y. nonsingular, then E(X) -
t (CZ) where l (Z) = N(O, I) and CC' = E. Hence, E(C-'X)= E (Z) so C-1 transforms X into a vector of i.i.d. N(O, 1) random
variables. There are many C-'"s that do this. The problem at hand
concerns the construction of one such C'. Given any p x p positive definite matrix A, p > 2, partition A as
(a,, A12
A A21 A22)
where all E R', A2 = A'2 E RP-'. Define Tp(A) by
(a-1/2 0
Tp (A) A21 I I
(i) Partition E: n x n as A is partitioned and set X(') Tn (E)X. Show that
cov( x(')) I
( (1
where V(l) = :22 - E21212/1 l
(ii) For k = 1, 2,.. ., n - 2, define X(k?I) by
X(k?l - (1 T-k (2k)) )
Prove that
Cov(X(k?) - (k I
2(k 1))
for some positive definite 2(k+ 1).
(iii) Fork= O,..., n - 2, let
T (k) = A
0 Tn -k(y())
where To)0= T-(2). With T= T( -2)... T(?), show that X (n-1) = TX and Cov( Xn- 1)) = In. Also, show that T is lower triangu lar and E:-' = T'T.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 129
4. Suppose X E R2 has coordinates Xl and X2, and has a density
p(x) 4-exp[- (x2 +x2)] if x1x2>0 0 otherwise
so p is zero in the second and fourth quadrants. Show X, and X2 are
both normal but X is not normal.
5. Let X,,..., X, be i.i.d. N(,u, a2) random variables. Show that U =
- Xj)2 and W = 2:X1 are independent. What is the distribution of U?
6. For X E (V,(,*)) with e(X) = N(O, I), suppose (X, AX) and
(X, BX) are independent. If A and B are both positive semidefinite, prove that AB = 0. Hint: Show that tr AB = 0 by using
cov((X, AX), (X, BX)} = 0. Then use the positive semidefiniteness and tr AB = 0 to conclude that AB = 0.
7. The method used to define the normal distribution on (V,(,-)) consisted of three steps: (i) first, an N(O, 1) distribution was defined on
R'; (ii) next, if (Z) = N(O, 1), then W is N(M, a2) if e (W) = (OZ + ,u); and (iii) X with values in (V, (*, *)) is normal if (x, X) is normal
on R' for each x E V. It is natural to ask if this procedure can be used
to define other types of distributions on (V, (, * )). Here is an attempt for the Cauchy distribution. For X E R', say Z is standard Cauchy (which we write as e(Z) = C(0, 1)) if the density of Z is
AZ) = 1 1 Z E R'
Say W has a Cauchy distribution on R1
if l (W) = I (aZ + u) for
some ,u E R and a > 0-in this case write e(W) = C(,u, a). Finally, say X e (V,(., *)) is Cauchy if (x, X) is Cauchy on R'.
(i) Let W1, . . ., W,,be independent C(Quj, aj), j = 1,..., n. Show that
,(Y:ajWj) = C(QajjLj, Ilajlak). Hint: The characteristic function of a C(0, 1) distribution is exp[-ItlI, t E R'.
(ii) Let Z1,..., Z, be i.i.d. C(0, 1) and let xl,..., xn be any basis for (V,(-, *)). Show X = EZjxj has a Cauchy distribution on
(V,(., ))
8. Consider a density on R' given by
00 f(U) = |t-14o(ult)G(di)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
130 THE NORMAL DISTRIBUTION ON A VECTOR SPACE
where 4 is the density of an N(O, 1) distribution and G is a distribution function with G(0) = 0. The distribution defined by f is called a scale
mixture of normals. (i) Let ZO be N(O, 1) and let R be independent of ZO with E(R) = G.
Show that U = RZo has f as its density function.
If X(Y) = i(cU) for some c > 0, we can say that Y has a type-f
distribution. (ii) In (V,( *)), suppose e (Z) = N(O, I) and form X = RZ where R
and Z are independent and e(R)= G. For each x E V, show
(x, X) has a type-f distribution. Remark. The distribution of X in (V, (, )) provides a possible vector space generalization of a type-f distribution on R'.
9. In the notation of Example 3.1, assume that jt = 0 so fE(X) = N(O, In X X) on(p, n )). Also,
E(XIIX2 = X2) = N(X222 21' In ? 211-2)
where EHI2 = I - 212-22,2I2 Show that the conditional distribu
tion of X2X1 given X2 is the same as the conditional distribution of
X2 X, given X2 X2.
10. The map T of Section 3.5 has been defined on Rn to (V,(, )) by
Ta = Enaixi where xl,..., x,, is an orthonormal basis for (V,(., )).
Also, we have defined vo by vO(B) = I(T-'(B)) for B E @(V). Con
sider another orthonormal basis y1,..., yn for (V(., -)) and define T1
by T1a = Enaiyi, a E Rn. Define vI by vI(B) = l(Tj l(B)) for B E
qi(V). Prove that Po = vp.
11. The measure vo in Problem 10 depends on the inner product (, ) on
V. Suppose [, -] is another inner product given by [x, y] = (x, Ay)
where A > 0. Let vI be the measure constructed on (V, [, -]) in the
same manner that P0 was constructed on (V,(-, *)). Show that PI = cvO where c = (det(A))'/2.
12. Consider the space Sp of p x p symmetric matrices with the inner
product given by KS1, S2) = trS,S2. Show that the density function of
an N(O, I) distribution on (,, K *,)) with respect to the measure Po is
p(S) = (27T) -P+1)74exp '(IPs2 + 2EEs,j
where S = (si;), i, j = 1.... , p. Explain your answer (what is Po)?
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 131
13. Consider X1,..., Xn, which are i.i.d. N(tu, E) on RP. Let X E n
have rows X,..., X, so P,(X)= N(e,u', In ? 2). Assume that 2 has
the form
'p *.. p
p p ...
where a2 > 0 and - 1/(p - 1) < p < 1 so 2 is positive definite. Such
a covariance matrix is said to have intraclass covariance structure.
(i) On RP, let A = (l/p)e1e' where e1 E RP is the vector of ones.
Show that a positive definite covariance matrix has intraclass covari ance structure iff 2 =aA + ,B(I - A) for some positive scalars a and
P3. In this case :-' = a-'A + /-'(I - A).
(ii) Using the notation and methods of Example 3.2, show that when (,U, a2, p) are unknown parameters, then (X, trAX'QeX, tr(I -
A)X'QeX) is a sufficient statistic.
NOTES AND REFERENCES
1. A coordinate treatment of the normal distribution similar to the treat ment given here can be found in Muirhead (1982).
2. Examples 3.1 and 3.2 indicate some of the advantages of vector space
techniques over coordinate techniques. For comparison, the reader may find it instructive to formulate coordinate versions of these examples.
3. The converse of Proposition 3.11 is true. The only proof I know involves characteristic functions. For a discussion of this, see Srivastava and
Khatri (1979, p. 64).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 4
Linear Statistical Models
The purpose of this chapter is to develop a theory of linear unbiased estimation that is sufficiently general to be applicable to the linear models arising in multivariate analysis. Our starting point is the classical regression
model where the Gauss-Markov Theorem is formulated in vector space language. The approach taken here is to first isolate the essential aspects of a regression model and then use the vector space machinery developed thus
far to derive the Gauss-Markov estimator of a mean vector.
After presenting a useful necessary and sufficient condition for the equality of the Gauss-Markov and least-squares estimators of a mean
vector, we then discuss the existence of Gauss-Markov estimators for what might be called generalized linear models. This discussion leads to a version
of the Gauss-Markov Theorem that is directly applicable to the general linear model of multivariate analysis.
4.1. THE CLASSICAL LINEAR MODEL
The linear regression model arises from the following considerations. Sup pose we observe a random variable Yi E R and associated with Yi are known
numbers zi,... , Zik, i = 1,..., n. The numbers ziI,..., Zik might be indica
tor variables denoting the presence or absence of a treatment as in the case
of an analysis of variance situation or they might be the numerical levels of
some physical parameters that affect the observed value of Yi. It is assumed
that the mean value of 1K is 6 I = j4z1f, where the f3j are unknown
parameters. It is also assumed that var(Yi)= a' > 0 and cov(Yi, Yj) = 0 if
i *j. Let YE R' be the random vector with coordinates Y,,..., Y1, let
Z = {zij) be the n x k matrix of zij's, and let fi E Rk be the vector with
coordinates , 1 3 k. In vector form, the assumptions we have made
132
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
THE CLASSICAL LINEAR MODEL 133
concerning Y are that &Y = Zf3 and Cov(Y) = a2In. In summary, we
observe the vector Y whose mean is Z,B where Z is a known n x k matrix,
,B E Rk is a vector of unknown parameters, and Cov(Y) = a21, where a2 is
an unknown parameter. The two essential features of this parametric model are: (i) the mean vector of Y is an unknown element of a known subspace of
Rn_ namely, 6 Y is an element of the range of the known linear transforma
tion determined by Z that maps Rk to Rn; (ii) Cov(Y) - a 24 that is, the distribution of Y is weakly spherical. For a discussion of the classical
statistical problems related to the above model, the reader is referred to Scheffe (1959).
Now, consider a finite dimensional inner product space (V,(, )). With the above regression model in mind, we define a weakly spherical linear
model for a random vector with values in (V, ( * )).
Definition 4.1. Let M be a subspace of V and let EO be a random vector in
V with a distribution that satisfies Seo = 0 and Cov(EO) = I. For each
,E iM and a > 0, let Q.,
denote the distribution of ,L + aeO. The family
(QIL, c E M, a > 0) is a weakly spherical linear model for Y E V if the
distribution of Y is in {Q,,, jtt E M, a > 0).
This definition is just a very formal statement of the assumption that the mean vector of Y is an element of the subspace of M and the distribution of
Y is weakly spherical so Cov(Y)= a2I for some a2 > 0. In an abuse of
notation, we often write Y = u + e for ,u E M where E is a random vector
with SE = 0 and Cov(E) = a 2I. This is to indicate the assumption that we
have a weakly spherical linear parametric model for the distribution of Y. The unobserved random vector e is often called the error vector. The subspace M is called the regression subspace (or manifold) and the subspace
M' is called the error subspace. Further, the parameter ,u Ee M is assumed
unknown as is the parameter a 2. It is clear that the regression model used to
motivate Definition 4.1 is a weakly spherical linear model for the observed random vector and the subspace M is just the range of Z.
Given a linear model Y = ,u + e, ,U E M, SE = 0, Cov(E) = a2I, we now
want to discuss the problem of estimating ,u. The classical Gauss-Markov approach to estimating ,u is to first restrict attention to linear transforma tions of Y that are unbiased estimators and then, within this class of
estimators, find the estimator with minimum expected norm-squared devia tion from ,u. To make all of this precise, we proceed as follows. By a linear estimator of ,i, we mean an estimator of the form A Y where A E f,(V, V).
(We could consider affine estimators A Y + vo, vo E V, but the unbiased
ness restriction would imply vo = 0.) A linear estimator A Y of ,t is unbiased
iff, when I E M is the mean of Y, we have 6(AY) = IL. This is equivalent
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
134 LINEAR STATISTICAL MODELS
to the condition that At = ,u for all u e M since &A Y = A&;Y = A,u. Thus
A Y is an unbiased estimator of ,u iff Ali = IL for all ,u e M. Let
C= (AIA EE f(V, V),Al = forMu E M).
The linear unbiased estimators of ,i are those estimators of the form A Y with A E i. We now want to choose the one estimator (i.e., A E C,) that
minimizes the expected norm-squared deviation of the estimator from P. In other words, the problem is to find an element A E cT that minimizes
&;IIAY - ,IlI2. The justification for choosing such an A is that IIAY - /1I2 iS
the squared distance between A Y and u so &IIA Y - I112 is the average
squared distance between A Y and IL. Since we would like A Y to be close to
,u, such a criterion for choosing A E e seems reasonable. The first result in
this chapter, the Gauss-Markov Theorem, shows that the orthogonal pro jection onto M, say P, is the unique element in C that minimizes &IIA Y -
,L1 2.
Theorem 4.1 (Gauss-Markov Theorem). For each A E C1, ,u e M, and 2 > 0,
&1jAY - L2 > &IIPY -L12
where P is the orthogonal projection onto M. There is equality in this
inequality iff A = P.
Proof. Write A =P + C so C = A - P. Since Ala = ,u for ,u E M, CP = O
for ,u E M and this implies that CP = 0. Therefore, C(Y - ,) and P(Y - u) are uncorrelated random vectors, so & (C(Y - t), P(Y - ,u)) = 0 (see Prop osition 2.21). Now,
&IIAY -. Il12 = &IIA(Y -)I2 = llP(Y -_ ) + C(Y -)112
= qllP(Y - L)II2 + ;IIC(y - L12
> &jlP(Y - L)II2 = &lJpy- _112.
The third equality results from the fact that the cross product term is zero.
This establishes the desired inequality. It is clear that there is equality in this
inequality iff &;IIC(Y - M)II2 = 0. However, C(Y - M) has mean zero and
covariance u2CC' so
;IIC(y - )112 = 2(j, CC')
by Proposition 2.21. Since a 2 > 0, there is equality iff (I, CC') = 0. But
(I, CC') = (C, C) and this is zero iff C = A - P = 0. o
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
THE CLASSICAL LINEAR MODEL 135
The estimator PY of I E M is called the Gauss-Markov estimator of the mean vector and the notation a = PY is used here. A moment's reflection
shows that the validity of Theorem 4.1 has nothing to do with the parameter
a2, be it known or unknown, as long as a2 > 0. The estimator 1 = PY is
also called the least-squares estimator of ,u for the following reason. Given the observation vector Y, we ask for that vector in M that is closest, in the
given norm, to Y- that is, we want to minimize, over x E M, the expression
jy - x112. But Y= PY + QY where Q = (I - P) so, for x E M,
Iy- x112 = IIPY - X + Qyjj2 = IlPY - X112 + IIQyII2.
The second equality is a consequence of Qx = 0 and QP = 0. Thus
IIY- x112 > 1jQY112
with equality iff x = PY. In other words, the point in M that is closest to Y
is ,i = PY. When the vector space V is Rn with the usual inner product, then
Y - x112 is just a sum of squares and ,u = PY e M minimizes this sum of
squares-hence the term least-squares estimator.
* Example 4.1. Consider the regression model used to motivate Definition 4.1. Here, Y E R' has a mean vector Z,8 when ,B E Rk
and Z is an n x k known matrix with k < n. Also, it is assumed
that Cov(Y) = a2I,, a2 > 0. Therefore, we have a weakly spherical
linear model for Y and ,u Z13 is the mean vector of Y. The regression manifold M is just the range of Z. To compute the
Gauss-Markov estimator of ,u, the orthogonal projection onto M, relative to the usual inner product on R', must be found. To find this projection explicitly in terms of Z, it is now assumed that the
rank of Z is k. The claim is that P Z(Z'Z)- 'Z' is the orthogonal
projection onto M. Clearly, p2 = P and P is self-adjoint so P is the orthogonal projection onto its range. However, Z' maps Rn onto Rk since the rank of Z' is k. Thus (Z'Z)-'Z' maps Rn onto Rk.
Therefore, the range of Z(Z'Z)- 'Z' is Z(Rk), which is just M, so P
is the orthogonal projection onto M. Hence = Z(Z'Z)- 'Z'Y is the Gauss-Markov and least-squares estimator of ,u. Since ,u = Z,B,
Z' = Z'Z,B and thus ,B = (Z'Z) 'Z',. There is the obvious
temptation to call
A (Z'Z)Z'' - (z'z ,zz(z'z)- yzy= (Z'Z) zyz'y
the Gauss-Markov and least-squares estimator of the parameter /3.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
136 LINEAR STATISTICAL MODELS
Certainly, calling 13 the least-squares estimator of /3 is justified since
ily- Z.YI12 > ily _
Zfl112
for all y E Rk, as Z4 = ,u and Zy e M. Thus 18 minimizes the sum
of squares IIY - Zyjf2 as a function of y. However, it is not clear why ,B should be called the Gauss-Markov estimator of /P. The discussion below rectifies this situation.
Again, consider the linear model in (V, (-, *)), Y = -, + E, where ,u E M,
SE = 0, and Cov(e) = a2I. As usual, M is a linear subspace of V and E is a
random vector in V. Let (W, [-, * ]) be an inner product space. Motivated by
the considerations in Example 4.1, consider the problem of estimating By, B E=e ,(V, W), by a linear unbiased estimator A Y where A e P, (V, W). That
A Y is an unbiased estimator of B,u for each ,u E M is clearly equivalent to
A-= B,u for ,u E M since GA Y = Ajt. Let
d = {AIA e E (V,W),Ap= B.t foru e=M),
so A Y is an unbiased estimator of B,u, ,u E M iff A E 6i1. The following
result, which is a generalization of Theorem 4.1, shows that B,i is the Gauss-Markov estimator for B,u in the sense that, for all A E ,
IIA Y- BAtlll > &IIBPY - BIll.
Here II is the norm on the space (W,[-, *]).
Proposition 4.1. For each A E LI
SIIA Y - BtL I SIIBPY Ill
where P is the orthogonal projection onto M. There is equality in this inequality iff A = BP.
Proof. The proof is very similar to the proof of Theorem 4.1. Define
C E e(V, W) by C = A - BP and note that C,u = Ay - BP,u = Bu - By = 0 since A e c and P,u = ,u for ,u E M. Thus CP = 0, and this implies
that BP(Y - ,u) and C(Y - ,u) are uncorrelated random vectors. Since these
random vectors have zero means,
& [BP(Y - u), C(Y- = 0.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 4.1 137
For A E i
&IIAY - B,IIJ, = &IIBP(Y - i) + C(Y
= &IIBP(Y - I)112 + &IIC(Y -)112
> &IIBP(Y - I)jj = &jjBPY - BAL I.
This establishes the desired inequality. There is equality in this inequality iff &IIC(Y - = 0. The argument used in Theorem 4.1 applies here so there is equality iff C = A - BP =0. O
Proposition 4.1 makes precise the statement that the Gauss-Markov estimator of a linear transformation of ,u is just the linear transformation applied to the Gauss-Markov estimator of IL. In other words, the
Gauss-Markov estimator of B,t is B,i where B e fC(V, W). There is one
particular case of this that is especially interesting. When W = R, the real
line, then a linear transformation on V to W is just a linear functional on V.
By Proposition 1.10, every linear functional on V has the form (x0, x) for some x0 E V. Thus the Gauss-Markov estimator of (x0, ,u) is just (x0, j) =
(xo, PY) = (Pxo, Y). Further, a linear estimator of (x0, ,u), say (z, Y), is
an unbiased estimator of (x0, ,u) iff (z, ,u) = (x0, A) for all ,u E M. For any
such vector z, Proposition 4.1 shows that
var(z, Y) > var(Px0, Y).
Thus the minimum of var(z, Y), over the class of all z's such that (z, Y) is
an unbiased estimator of (x0, A), is achieved uniquely for z = Px0. In
particular, if x0 e M, z = x0 achieves the minimum variance.
In the definition of a linear model, Y = ,I + E, no distributional assump
tions concerning - were made, other than the first and second moment
assumptions &E = 0 and Cov(e) = a2I. One of the attractive features of
Proposition 4.1 is its validity under these relatively weak assumptions. However, very little can be said concerning the distribution of a = PY other than &,i = ,u and Cov(u) = a2P. In the following example, some of the
implications of assuming that E has a normal distribution are discussed.
* Example 4.2. Consider the situation treated in Example 4.1. A coordinate random vector Y E R' has a mean vector ,u = Z,B where
Z is an n x k known matrix of rank k (k < n) and A e Rk is a
vector of unknown parameters. It is also assumed that Cov(Y) =
u2In. The Gauss-Markov estimator of ,u is a = Z(Z'Z) 'Z'Y. Since
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
138 LINEAR STATISTICAL MODELS
,B = (Z'Z)- 1Z',., Proposition 4.1 shows that the Gauss-Markov estimator of 13 is 13 = (Z'Z) IZ',u= (Z'Z)-Z'Y. Now, add the assumption that Y has a normal distribution-that is, f(Y) =
N(,u, a2JI) where ,t E M and M is the range of Z. For this particu
lar parametric model, we want to find a minimal sufficient statistic and the maximum likelihood estimators of the unknown parame ters. The density function of Y, with respect to Lebesgue measure, is
p(ylf, 02)= (2 7u2) n/2exp[- y IIY- tLPl2
where y E Rn, ,u E M, and a2 > 0. Let P denote the orthogonal
projection onto M, so Q I - P is the orthogonal projection onto
M'. Since IIY- y112 = jiPy - pL12 + IIQy1I2, the density of y can be
written
P(YIt, a2) =(2 o2 yn/2 exp[ - 1lIIPy _ ,.12 - IIQ 12 ]
This shows that the pair (Py, IIQy112} is a sufficient statistic as the density is a function of the pair (Py, IIQyII 2). The normality assump
tion implies that PY and QY are independent random vectors as
they are uncorrelated (see Proposition 3.4). Thus PY and IIQYII2 are
independent. That the pair (Py, IIQyyI2} is minimal sufficient and complete follows from results about exponential families (see Lehmann 1959, Chapter 2). To find the maximum likelihood esti
mators of ,u E M and a2, the density p(yItL, a2) must be maximized
over all values of IL E M and a2 For each fixed a2 > O,
p(ylft,a2) = (27T2) n/2exp- ilPy - L12 - I 2Iy']
[x 1 2lIQ211Il21
< (2Sra2) / exp _- 2 lQyII2
with equality iff ,u= Py. Therefore, the Gauss-Markov estimator = PY is the maximum likelihood estimator for ,t. Of course, this
also shows that ,B = (Z'Z)- 'Z'Y is the maximum likelihood estima
tor of 13. To find the maximum likelihood estimator of a2, it
remains to maximize
p(ylPy, 2) =(2a2 n/2 - 11QY112
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 4.2 139
An easy differentiation argument shows that p(y yjPy, a 2) is maxi mized for a2 equal to IIQyII2/n. Thus 62 = IlQyII2/n is the maxi mum likelihood estimator of a2. From our previous observation,
= PY and 62 are independent. Since E(Y) = N(,u, a21),
E()=e (PY) =N(u, a2P)
and
e~(A~) = f((z'z)'z'y) = N(1f, a2(Z,Z)-I). Also,
E(QY) = N(O, a2Q)
since Q,u = 0 and Q2 = Q = Q'. Hence from Proposition 3.7,
1(IlQYIl2 2 L12) Xn-k
since Q is a rank n - k orthogonal projection. Therefore,
& = n -_ka2 n
It is common practice to replace the estimator a2 by the unbiased estimator
o2 - IIQYII n - k
It is clear that 62 is distributed as the constant a 2/(n - k) times a
X n_k random variable.
The final result of this section shows that the unbiased estimator of a2, derived in the example above, is in fact unbiased without the normality assumption. Let Y = ,u + e be a random vector in V where y E M c V, SE = 0, and Cov(e)= - 2I. Given this linear model for Y, let P be the
orthogonal projection onto M and set Q = I - P.
Proposition 4.2. Let n = dim V, k = dim M, and assume that k < n. Then
the estimator
A 2 IIQYII2 a n - k
is an unbiased estimator of a2.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
140 LINEAR STATISTICAL MODELS
Proof. The random vector QY has mean zero and Cov(QY) =2Q. By Proposition 2.21,
&IlQyII2 = (I, (Y2Q) = 02(J, Q) = a2(n - k).
The last equality follows from the observation that for any self-adjoint operator S, (I, S) is just the sum of the eigenvalues of S. Specializing this to the projection Q yields (I, Q) = n - k. o
4.2. MORE ABOUT THE GAUSS-MARKOV THEOREM
The purpose of this section is to investigate to what extent Theorem 4.1 depends on the weak sphericity assumption. In this regard, Proposition 4.1 provides some information. If we take W = V and B = I, then Proposition
4.1 implies that
SIIAY - ; &1PY - ,L112
where 11 is the norm obtained from an inner product [, ]. Thus the orthogonal projection P minimizes &IIA Y -112 over A E 6B no matter what
inner product is used to measure deviations of A Y from ,. The key to the
proof of Theorem 4.1 is the relationship
[P(Y - ), (A - P)(Y -,)] = 0.
This follows from the fact that the random vectors P(Y - ,) and (A -
P)( Y - IL) are uncorrelated and
gP(Y-1I)= (A -P)(Y- )=0 forA e i.
This observation is central to the presentation below. The following alterna
tive development of linear estimation theory provides the needed generality to apply the theory to multivariate linear models.
Consider a random vector Y with values in an inner product space
(V,(-, -)) and assume that the mean vector of Y, say p= & Y, lies in a
known regression manifold M c V. For the moment, we suppose that
Cov(Y) = I where I is fixed and known (E is not necessarily nonsingular). As in the previous section, a linear estimator of ,u, say A Y, is unbiased iff
A E Ei- {AIAM = ,L, ,u E M).
Given any inner product [,-*] on V, the problem is to choose A E C to
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
MORE ABOUT THE GAUSS-MARKOV THEOREM 141
minimize
I(A) = &IAY- I112 = &[AY- ,L, AY- u]
where the expectation is computed under the assumption that & Y = ,i and
Cov(Y) = E. Because of Proposition 4.1, it is reasonable to expect that the minimum of +(A) occurs at a point PO e C where PO is a projection onto M
along some subspace N such that M r N = {O) and M + N = V. Of course,
N is the null space of PO and the pair M, N determines PO. To find the
appropriate subspace N, write I(A) as
'(A)= IIAY- tLY-I
= S&IPO(Y - ,) + (A-P0)(Y
- &jjP0(Y - M)12 + ;I(A -P0)(Y )11
+ 2& [Po(Y- ), (A - P0)(Y- )]
When the third term in the final expression for +(A) is zero, then PO minimizes I(A). If PO(Y - p) and (A - P0)(Y- IL) are uncorrelated, the
third term will be zero (shown below), so the proper choice of PO, and hence N, will be to make PO(Y - ,L) and (A - P0)(Y - ,u) uncorrelated. Setting
C = A - PO, it follows that 6%(C) D M. The absence of correlation be
tween PO(Y - ,u) and C(Y - t) is equivalent to the condition
PolC' = 0.
Here, C' is the adjoint of C relative to the initial inner product (*, *) on V.
Since 6L(C)
D M, we have
((C)= (c))' C
and
61 (yC') c Z(M-L).
The symbol 1 refers to the inner product (-, *). Therefore, if the null space of P0, namely N, is chosen so that N D I(M'), then PO:C' = 0 and P0
minimizes I(A). Now, it remains to clean up the technical details of the above argument. Obviously, the subspace Y(M') is going to play a role in what follows.
First, a couple of preliminary results.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
142 LINEAR STATISTICAL MODELS
Proposition 4.3. Suppose l = Cov(Y) in (V,(,)) and M is a linear subspace of V. Then:
(i) Y2(M-'-)nM= (0). (ii) The subspace 2(M') does not depend on the inner product on V.
Proof To prove (i), recall that the null space of Y. is
(xl(x, Ex) = 0)
since l is positive semidefinite. If u E- E(M') n M, then u = 2ul for some
ut E M'. Since EYu1 E M, (u,, lu,) = 0 so u = Eu, = 0. Thus (i) holds.
For (ii), let [*, * be any other inner product on V. Then
[x, y] = (x, Aoy)
for some positive definite linear transformation AO. The covariance transfor mation of Y with respect to the inner product [, ] is I2Ao (see Proposition
2.5). Further, the orthogonal complement of M relative to the inner product [ ] is
{yj[x, y] = 0 for all x e M) = {yI(x, Aoy) = 0 for all x e M)
={A-'ul(x,u)= Oforallxe M =-e(M-)
Thus 2 (M') = (>2A0)(A- 1(M')). Therefore, the image of the orthogonal
complement of M under the covariance transformation of Y is the same no
matter what inner product is used on V. o
Proposition 4.4. Suppose Xl and X2 are random vectors with values in
(V,(-, *)). If Xl and X2 are uncorrelated and &X2 = 0, then
Sf1[XI, X2] = 0
for every bilinear function f defined on V x V.
Proof. Since X, and X2 are uncorrelated and X2 has mean zero, for
x1, x2e V, we have
O = cov{(x1, X1), (X2, X2)) = (X1, X)(x2, X2) - $(X, XO)(x2, X2)
= &(x1, X)(x2, X2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 4.4 143
However, every bilinear form f on (V, (*, *)) is given by
f [Ul, U2] = (u,, Bu2)
where BE E (V, V). Also, every B can be written as
B = , bijyiEJy1 ij
where y.... ., Yn is a basis for V. Therefore,
gf[Xi, X2] = & bij (XI, 0 yjX2)= bij&(yi, Xl)(yj, X2) = 0.
We are now in a position to generalize Theorem 4.1. To review the
assumptions, Y is a random vector in (V,(., *)) with & Y = IL E M and
Cov(Y) = 2. Here, M is a known subspace of V and Y: is the covariance of
Y relative to the given inner product (, ). Let [*, * be another product on V and set
'I(A) = &IIAY- -Ll1
for A E 6i , where 1 is the norm defined by [*, *1]
Theorem 4.2. Let N be any subspace of V that is complementary to M and
contains the subspace I(M'). Here M1 is the orthogonal complement of M relative to (*, *). Let PO be the projection onto M along N. Then
(4.1) '(A)>, 4(PO) forA E C.
If l is nonsingular, define a new inner product (, ) by
(x, Yy E(x, E-'ly).
Then PO is the unique element of 6i that minimizes 'I(A). Further, PO is the orthogonal projection, relative to the inner product (-, .)J, onto M.
Proof. The existence of a subspace N D I(M'), which is complementary to M, is guaranteed by Proposition 4.3. Let C E c (V, V) be such that
M C %(C). Therefore,
(CI) = (6L(C))' CM
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
144 LINEAR STATISTICAL MODELS
so
6A (EC) C (MI).
This implies that
PoEc' = o
since 9(PO) = N D 7(M'). However, the condition POC' = 0 is equiva
lent to the condition that PO(Y - IL) and C(Y - ,u) are uncorrelated.
With these prelminaries out of the way, consider A e C and let C = A
- P sO Go (C) p M. Thus
'(A) = SIIA(Y - &= P0(Y -_ ) + C(Y - )112
= SliP0(y- _)112 + &lIC(Y- _.t)112 + 2&[P0(Y- j),C(Y- t)] = -~~~ 1 [+o &j ,C(Y -
= Sllp0(y- _
)112 + SlIC(y -
U)112
The last equality follows by applying Proposition 4.4 to PO(Y - p) and
C(Y - I). Therefore,
+(A) = I'(P0) + SlIC(Y - tL)112
so PO minimizes 1 over A e d.
Now, assume that L is nonsingular. Then the subspace N is uniquely
defined (N = 7.(M')) since dim(E(M')) = dim(M') and M + I(M') =
V. Therefore, PO is uniquely defined as its range and null space have been
specified. To show that PO uniquely minimizes I, for A E C, we have
+'(A) = '(P0) + &IIC(Y - )112
where C = A - PO. Thus 'I(A) > I(PO) with equality iff
SlIC(y_ tL)112 = 0.
This expectation can be zero iff C(Y - I) = 0 (a.e.) and this happens iff the
covariance transformation of C(Y - ,u) is zero in some (and hence every)
inner product. But in the inner product (*, *),
Cov(C(Y - la)) = C2C'
and this is zero iff C = 0 as Y is nonsingular. Therefore, PO is the unique
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 4.5 145
minimizer of I. For the last assertion, let N, be the orthogonal complement of M relative to the inner product (', *)1. Then,
N, = (yl(x, y), = 0 for all x E M) = {yj(x, -'y) = 0 for all x e M)
= (Xyl(x, y) = 0 for all x E M} = I(M1).
Since 6Z(PO) = I(M'), it follows that PO is the orthogonal projection onto M relative to (, *).
In all of the applications of Theorem 4.2 in this book, the covanance of Y is nonsingular. Thus the projection PO is unique and ,I = POY is called the
Gauss-Markov estimator of ,u E M. In the context of Theorem 4.2, if
CoV(Y) = a2 where I is known and nonsingular and a2 > 0 is unknown,
then PoY is still the Gauss-Markov estimator for ,u E M since (a2 :)(MI)
= 2(M') for each a2 > 0. That is, the presence of an unknown scale
parameter a2 does not affect the projection PO. Thus PO still minimizes ' for each fixed a2 > 0.
Consider a random vector Y taking values in (V, (-, *)) with & Y = ,u E M
and
Cov(Y) = a 2 a2 > 0.
Here, Y., is assumed known and positive definite while a2 > 0 is unknown.
Theorem 4.2 implies that the Gauss-Markov estimator of ,u is ,u = POY where PO is the projection onto M along II (M'). Recall that the least-squares
estimator of ,u is PY where P is the orthogonal projection onto M in the given inner product, that is, P is the projection onto M along MI.
Proposition 4.5. The Gauss-Markov and least-squares estimators of ,u are the same iff II(M) C M.
Proof. Since P0 and P are both projections onto M, POY = PY iff both PO and P have the same null spaces-that is, the Gauss-Markov and least squares estimators are the same iff
11 (M-,) = M
Since :1 is nonsingular and self-adjoint, this condition is equivalent to the condition 2,(M) C M. O
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
146 LINEAR STATISTICAL MODELS
The above result shows that if I(M) c M, we are free to compute either
P or PO to find ,i The implications of this observation become clearer in the next section.
4.3. GENERALIZED LINEAR MODELS
First, consider the linear model introduced in Section 4.2. The random vector Y in (V, (*, .)) has a mean vector ,u E M where M is a subspace of V
and Cov(Y) = a21 i. Here, 1: is a fixed positive definite linear transforma tion and a2 > 0. The essential features of this linear model are: (i) the mean vector of Y is assumed to be an element of a known -subspace M and (ii) the
covariance of Y is an element of the set (a 21 I2 > 0). The assumption concerning the mean vector of Y is not especially restrictive since no special assumptions have been made about the subspace M. However, the covari ance structure of Y is quite restricted. The set (a21, 1I2 > 0) is an open half line from 0 E f&(V, V) through the point 2, E C(V, V) so the set of the
possible covariances for Y is a one-dimensional set. It is this assumption
concerning the covariance of Y that we want to modify so that linear models
become general enough to include certain models in multivariate analysis. In particular, we would like to discuss Example 3.2 within the framework of
linear models. Now, let M be a fixed subspace of (V, (., *)) and let -y be an arbitrary set
of positive definite linear transformations on V to V. We say that (M, -y} is
the parameter set of a linear model for Y if SY = ,u E M and Cov(Y) E y.
For a general parameter set (M, y}, not much can be said about a linear
model for Y. In order to restrict the class of parameter sets under considera
tion, we now turn to the question of existence of Gauss-Markov estimators (to be defined below) for IL. As in Section 4.1, let
Ci = {AIA e $B(V,V),AMu = forM eM).
Thus a linear transformation of Y is an unbiased estimator of ,u E M iff it
has the form A Y for A E Z. The following definition is motivated by
Theorem 4.2.
Definition 4.2. Let (M, y) be the parameter set of a linear model for Y.
For AO E i, AOY is a Gauss-Markov estimator of , iff
&2IIAY -112 > S - l Y12
for all A E C, and E: E y. The subscript I on the expectation means that the
expectation is computed when Cov(Y) = E.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 4.6 147
When y = (a'2 Ia > O), Theorem 4.1 establishes the existence and uniqueness of a Gauss-Markov estimator for ,u. More generally, when y = (g2 I2l2> 0), Theorem 4.2 shows that the Gauss-Markov estimator for ,u is P1Y where Pi is the orthogonal projection onto M relative to the
inner product (., ) given by
(x, Y) (x, 2:y), x, ye V.
The problem of the existence of a Gauss-Markov estimator for general y is taken up in the next paragraph.
Suppose that {M, y) is the parameter set for a linear model for Y.
Consider a fixed element :, E y, and let (, ) be the inner product on V
defined by
(x, Y)i (x, 1 y), x, y E V.
As asserted in Theorem 4.2, the unique element in 6? that minimizes
61,11A Y - ,L12 is P,-the orthogonal projection onto M relative to (*, .)J. Thus if a Gauss-Markov estimator AOY exists according to Definition 4.2, Ao must be P1. However, exactly the same argument applies for 22 E y, So
Ao must be P2-the orthogonal projection onto M relative to the inner product defined by 22' These two projections are the same iff Y1(M') =
22(M')-see Theorem 4.2. Since 2l and 22 were arbitrary elements of y, the conclusion is that a Gauss-Markov estimator can exist iff Y.,(M')= E2(M') for all , Y 2 E y. Summarizing this leads to the following.
Proposition 4.6. Suppose that (M, y) is the parameter set of a linear model for Y in (V, (-, ,)). Let 2, be a fixed element of y. A Gauss-Markov estimator of ,u exists iff
:(M-)= 21(M') for all 2 E y.
When a Gauss-Markov estimator of u exists, it is ,u = PY where P is the orthogonal projection onto M relative to any inner product [*, * given by [x, y] = (x, .- 'y) for some E E y.
Proof. It has been argued that a Gauss-Markov estimator for ,u can exist
iff Y1(M') = 12(M') for all El ,2 e y. This is clearly equivalent to
X((M') = 1,(M') for all I e M. The second assertion follows from the
observation that when 2(M') = Y:,(M'), then all the projections onto M, relative to the inner products determined by elements of y, are the same. That a = PY is a consequence of Theorem 4.2. a
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
148 LINEAR STATISTICAL MODELS
An interesting special case of Proposition 4.6 occurs when I E y. In this case, choose 2E = I so a Gauss-Markov estimator exists iff I(M') = M' for all L E y. This is clearly equivalent to E(M) = M for all 2 E y, which
is equivalent to the condition
2(M) c M for all I E y
since each I E y is nonsingular. It is this condition that is verified in the examples that follow.
* Example 4.3. As motivation for the discussion of the general multivanate linear model, we first consider the multivariate version of the k-sample situation. Suppose Xij's, j = 1,..., ni and i =
1,..., k, are random vectors in RP. It is assumed that 6Xij = [i,
Cov(Xij) = L, and different random vectors are uncorrelated. Form
the random matrix X whose first n1 rows are X1, j = 1,..., n1, the
next n2 rows of X are X2j, j = 1,..., n2, and so on. Then X is a
random vector in (e ( * , )) where n = Ekni . It was argued in
the discussion following Proposition 2.18 that
COv(X) = In ? z
relative to the inner product (,*) on lip ,. The mean of X, say ,u = &X, is an n X p matrix whose first n1 rows are all IL, whose
next n2 rows are all 2, and so on. Let B be the k x p matrix with
rows , it,..., '4. Thus the mean of X can be written ,u = ZB where
Z is an n X k matrix with the following structure: the first column
of Z consists of n1 ones followed by n - n1 zeroes, the second
column of Z consists of nI zeroes followed by n2 ones followed by n - n I - n2 zeroes, and so on. Define the linear subspace M of ep n by
M={ i=ZB, B etpk}
so M is the range of Z ? Ip, as a linear transformation on Ep k to n. Further, set
y = (In X 2 IE fC, , Y positive definite)
and note that y is a set of positive definite linear transformations on
p to e . Therefore, &X E M and Cov(X) E y, and (M, y} is a
parameter set for a linear model for X. Since In ? Ip is the identity
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 4.6 149
linear transformation on p and I h Ip e y, to show that a
Gauss-Markov estimator for ,u E M exists, it is sufficient to verify that, if x E M, then (I, ? 2)x E M. For x e M, x = ZB for some
B E fp k. Therefore,
(In ? 2)(ZB) = ZBE = (Z 0 Ip)(B?)q
which is an element of M. Thus M is invariant under each element of y so a Gauss-Markov estimator for y exists. Since the identity is
an element of y, the Gauss-Markov estimator is just the orthogonal projection of X on M relative to the given inner product ( *, * ). To
find this projection, we argue as in Example 4.1. The regression subspace M is the range of Z 0 Ip and, clearly, Z has rank k. Let
P = (Z ? Ip)[(Z ? ip)f(Z ? ip)] (Z 0 IJ
= (z ? Ip)[(Z'Z) ? ip] ?(Z Ip) =
Z(Z'Z) z' ? ip,
which is an orthogonal projection; see Proposition 1.28. To verify that P is the orthogonal projection onto M, it suffices to show that the range of P is M. For any x E Ep, n
Px = (z(z'z)I'z ? Ip4x = (Z ? Ip)[(ZZ) Zx
which is an element of M since (Z'Z)f-Z'x e EP k. However, if
x e M, then x = ZB and Px = P(ZB) = ZB-that is, P is the identity on M. Hence, the range of P is M and the Gauss-Markov
estimator of ,u is
= PX= z(z'z)y'z'x.
Since ,u = ZB,
B = (ZZ) Zz - ((Z'Z) Z
and, by Proposition 4.1,
B = ((ZZ) 'Z ? = (ZZ)'ZX
is the Gauss-Markov estimator of the matrix B. Further, S( B) = B
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
150 LINEAR STATISTICAL MODELS
and
Cov(B)= Cov[((Z'Z) Z' I,)x]
((Z Z) IZ X Ip )(In C ) )( (Z ) p ) I
- ((z'z 'z X ? i =(z'z)' ? Y.
For the particular matrix Z, Z'Z is a k X k diagonal matrix with
diagonal entries nl,..., nk so (Z'Z)-' is diagonal with diagonal elements nT ..., n. A bit of calculation shows that the matrix B = (Z'Z)- Z'X has rows X, ...,Xk where
ni1
ij n I
is the sample mean in the ith sample. Thus the Gauss-Markov
estimator of the ith mean ,ui is Xi, i = 1,..., k.
It is fairly clear that the explicit form of the matrix Z in the previous
example did not play a role in proving that a Gauss-Markov estimator for
the mean vector exists. This observation leads quite naturally to what is
usually called the general linear model of multivariate analysis. After
introducing this model in the next example, we then discuss the implications of adding the assumption of normality.
* Example 4.4 (Multivariate General Linear Model). As in Example
4.3, consider a random matrix X in (ltp n,,*, *) and assume that
(i) &X = ZB where Z is a known n x k matrix of rank k and B is a
k x p matrix of parameters, (ii) Cov(X) = In X L where L: is a
p x p positive definite matrix-that is, the rows of X are uncorre
lated and each row of X has covariance matrix E:. It is clear we have
simply abstracted the essential features of the linear model in
Example 4.3 into assumptions for the linear model of this example.
The similarity between the current example and Example 4.1 should
also be noted. Each component of the observation vector in Exam
ple 4.1 has become a vector, the parameter vector has become a
matrix, and the rows of the observation matrix are still uncorre
lated. Of course, the rows of the observation vector in Example 4.1
are just scalars. For the example at hand, it is clear that
M {,L,u = ZB, B Ep k
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 4.6 151
is a subspace of C,, n and is the range of Z X Ip. Setting
y = (In e I Ij is ap x p positive definite matrix),
(M, y) is the parameter set of a linear model for X. More specifi cally, the linear model for X is that & X = IL e M and Cov(X) E y.
Just as in Example 4.3, M is invariant under each element of y so a
Gauss-Markov estimator of u = &X exists and is PX where
P Z(Z'z)1 z' Ip
is the orthogonal projection onto M relative to K *). Mimicking the argument given in Example 4.3 yields
B= (z'z) Z'X ((z 'z' Z Ip? X
and
Cov(B)= (Z'Z) ' 2:.
In addition to the linear model assumptions for X, we now assume that E(X)= N(ZB, In ? E) so X has a normal distribution in
(Ep n' ( , ,)). As in Example 4.2, a discussion of sufficient statis
tics and maximum likelihood estimators follows. The density func tion of X with respect to Lebesgue measure is
p (x |,, I:) = (2 7T)_ n p/2 i ,rln/2
xexp[-2 <x - L(I E )(x- )> ,
as discussed in Chapter 3. Let PO = Z(Z'Z)- Z' and QO = I - PO so P = PO 0 Ip is the orthogonal projection onto M and Q QO X
Ip is the orthogonal projection onto M'. Note that both P and Q commute with In 0 2 for any E. Since ,u E M, we have
(x - ,u, (In C - )(x
= (P(x - U) + Qx, (In 2Y 1)(P(x - ,) + Qx))
= (P(x
- L), (In
? Y. r)P(x -)) + <Qx, (In ? 2Y )Qx)
because (Qx,(In 0 -')P(x - 1))= (x, Q(In 0 Y-I)P(x - ,u))
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
152 LINEAR STATISTICAL MODELS
- 0 since Q(In C J- ')P = QP(In e -') = 0. However,
(Qx,~ (In e 7- -)QX)
= (x, Q(In X -')Qx) = (x, Q(In ? 2 )x)
= Kx, (QO X ? )x = (x, QoXYJ-)
= tr(xE-x'Q0)= tr(x'Q0x2-')
Thus
(x -
,u, (In X - tO)X
= (PX
- It, (I,
? Y )(Px- + tr(x'QOx -).
Therefore, the densityp(xjjL, Y.) is a function of the pair (Px, x'Qox) so the pair (Px, x'Qox} is sufficient. That this pair is minimal sufficient and complete for the parametric family { p( * Ij, E); , E M,
2 positive definite) follows from exponential family theory. Since P(In 0 2)Q = PQ(In 0 = 0, the random vectors PX and QX
are independent. Also, X'QOX = (QX)'(QX) so the random vectors PX and X'Qo X are independent. In other words, {PX, X'Qo X) is a
sufficient statistic and PX and X'QOX are independent. To derive the maximum likelihood estimator of ,u E M, fix 2. Then
p(x|Ju, Y-) = (2qT)-np/2111-n/2
Xexp[- <(Px - , (In C 1 -)(Px -)) -trx'Q0xE-]
< (27rY)-nP/2jIn/2exp[ trx'Q0x2-']
with equality iff ,u = Px. Thus the maximum likelihood estimator of
IL is ,I = PX, which is also the Gauss-Markov and least-squares estimator of ,u. It follows immediately that
B = (Z'Z)Z'X
is the maximum likelihood estimator of B, and
f (B) = N(B,(Z'Z)' 1 ).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 4.6 153
To find the maximum likelihood estimator of 2, the function
p(xl I':Y) = (2,r -B2z-/exp [-
I-l trx'QoxE:-']
must be maximized over all p X p positive definite matrices E. When x'Qox is positive definite, this maximum occurs uniquely at
-x'Q x n
so the maximum likelihood estimator of E is stochastically indepen dent of a. A proof that I is the maximum likelihood estimator of E and a derivation of the distribution of l is deferred until later.
The principal result of this chapter, Proposition 4.6, gives necessary and sufficient conditions on the parameter set {M, y) of a linear model in order that the Gauss-Markov estimator of ,t E M exists. Many of the classical parametric models in multivariate analysis are in fact linear models with a parameter set {M, y) so that there is a Gauss-Markov estimator for ,E E M.
For such models, the additional assumption of normality implies that ,u is also the maximum likelihood estimator of A, and the estimation of ,u is relatively easy if we are satisfied with the maximum likelihood estimator. For the time being, let us agree that the problem of estimating it has been solved in these models. However, very little has been said about the estimation of the covariance other than in Example 4.4. To be specific, assume C(X) = N(,t, E) where yt E M c (V,(-, *)) and (M, y) is the
parameter set of this linear model for x. Assume that I E -y and a = PX is the Gauss-Markov estimator for ,u so EM = M for all E y. Here, P is the orthogonal projection onto M in the given inner product on V. It follows immediately from Proposition 4.6 that a = PX is also the maximum likeli hood estimator of ,u E M. Substituting aL into the density of X yields
p(xl#>:)=(2sr-/ll / exp[ - I (QxI 1Q )]
where n = dim V and Q = I - P is the orthogonal projection onto M'.
Thus to find the maximum likelihood estimator of L E y, we must compute
sup p(xI, A ) --p(x , E);
assuming that the supremum is attained at a point I' E y. Although many
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
154 LINEAR STATISTICAL MODELS
examples of explicit sets -y are known where I is not too difficult to find, general conditions on y that yield an explicit E are not available. This overview of the maximum likelihood estimation problem in linear models where Gauss-Markov estimators exists has been given to provide the reader with a general framework in which to view many of the estimation and testing problems to be discussed in later chapters.
PROBLEMS
1. Let Z be an n X k matrix (not necessarily of full rank) so Z defines a
linear transformation on Rk to Rn. Let M be the range of Z and let
zl- ... Zk be the columns of Z.
(i) Show that M = span(zl,..., Zk).
(ii) Show that Z(Z'Z)-Z' is the orthogonal projection onto M where
(Z'Z)- is the generalized inverse of Z'Z.
2. Suppose Xl,,..., X,, are i.i.d. from a density p(xjf3) = f (x - ,B) where
f is a symmetric density on R' and Jx2f(x) dx = 1. Here, /3 is an
unknown translation parameter. Let X e Rn have coordinates XI,...,
xn .
(i) Show that E(X) = C(/3e + E) where .1...., En are i.i.d. with densityf. Show that &X = 13e and Cov(X) = In.
(ii) Based on (i), find the Gauss-Markov estimator of /3. (iii) Let U be the vector of order statistics for X (Ul < U2 < ...<
U.) so E (U) = P-(e + v) where v is the vector of order statis
tics of e. Show that &(U) = /3e + ao where ao = iv is a known
vector (f is assumed known), and Cov(U) = :50 Cov(v) where :0 is also known. Thus P_(U - ao) = f_(/e + (v - ao)) where
6(v - ao) = 0 and Cov(v - ao) = 20. Based on this linear
model, find the Gauss-Markov estimator for /3.
(iv) How do these two estimators of /3 compare?
3. Consider the linear model Y = ,i + e where ,L E M, SE = 0, and
Cov(e) = a 21n. At times, a submodel of this model is of interest. In
particular, assume ,t E w where w is a linear subspace of M.
(i) Let M - w = (xIx e M, x I w}. Show that M - , = M n l.
(ii) Show that PM - P.,, is the orthogonal projection onto M - X and
verify that Ii(PM - P",)XII2 = IPMXII2 - IIP,XII2.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 155
4. For this problem, we use the notation of Problem 1.15. Consider subspaces of R'I given by
MO = (YIyij = y.. for all i, j)
Ml = (YlYij = Yik for allj, k; i = I
M2 ={YlYij = Ykj for all i, k; j ,. .., J}
(i) Show that 'I(A) = MO, 6A(BI)= Ml - MO, and 6A(B2)= M2 -
MO.
Let M3 be the range of B3. (ii) Show that RIJ = MO @ (M1 - MO) @ (M2 - MO) C M3.
(iii) Show that a vector ,u is in M = MO E (Ml - MO) @ (M2 - MO) iff ,u can be writte.n as ,Uij
= a + ,Bi + -Yj,i-1 .,I P,..
J, where a, f3i, and y. are scalars that satisfy i = Zyj= 0.
5. (The IF-test.) Most of the classical hypothesis testing problems in regression analysis or ANOVA can be described as follows. A linear
model Y= , + E, ,u E M, &E = 0, and Cov(e) = o2I is given in
(V,(., ( )). A subspace w of M (o * M) is given and the problem is to
test Ho: , E w versus H, : ,t 0 X, ,u E M. Assume that 1 (Y)=
N(p, a2I) in (V,(, .)).
(i) Show that the likelihood ratio test of Ho versus H1 rejects for large values of F = IIPM_.YII2/I1QMYII2 where QM = I - PM.
(ii) Under Ho, show that F is distributed as the ratio of two indepen dent chi-squared variables.
6. In the notation of Problem 4, consider Y e RIJ with E&Y = e M (M is given in (iii) of Problem 4). Under the assumption of normality, use the results of Problem 5 to show that the '-test for testing Ho: I = fB2 = * /*3-,t rejects for large values of
JEi (Yi-Y..)2
?ijEj(YijYi. Yj +.Y )2
Identify w for this problem.
7. (The normal equations.) Suppose the elements of the regression sub space M c- RI are given by u = X,8 where X is n X k and /3 E Rk.
Given an observation vector y, the problem is to find Pa = My. The
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
156 LINEAR STATISTICAL MODELS
equations (in ,B)
(4.2) X'yP=EX'X3, /ERk
are often called the normal equations. (i) Show that (4.2) always has a solution b E Rk.
(ii) If b is any solution to (4.2), show that Xb = PMY.
8. For Y e R', assume ,t = &Y e M and Cov(Y) E y where y = ( =
aPe + fiQe, a > 0, / > 0). As usual, e is the vector of ones, Pe is the
orthogonal projection onto span{e}, and Qe = I - Pe
(i) If e e M or e E M', show that the Gauss-Markov and least
squares estimators for ,t are the same for each a and /3.
(ii) If e Z M and e 0 M', show that there are values of a and / so
that the least-squares and Gauss-Markov estimators of ,u differ. (iii) If C(Y)= N(I,) with I E y and Mc (span{e))
I (M *
(span(e))'), find the maximum likelihood estimates for ,u, a, and ,B. What happens when M = span(e)?
9. In the linear model Y = X,8 + e on Rn with X: n x k of full rank, = 0, and Cov(E) = a2'1 (2, is positive definite and known), show
that =X(X'E 'X)-'X'2 -Y and P = (X' IX)-X'YI y.
10. (Invariance in the simple linear model.) In (V, (-,)), suppose that (M, y) is the parameter set for a linear model for Y where y = {E12 =
a2I, a > 0). Thus GY = ,u E M and Cov(Y) E y. This problem has to
do with the invariance of this linear model under affine transforma
tions: (i) If r e ((V) satisfies F(M) C M. show that F'(M) C M.
Let EM(V) be those F e ((V) that satisfy F(M) c M.
(ii) For xo E M, c > 0, and r E (OM(V), define the function
(c, F, xo) on V to V by (c, F, xo)y = cry + xo. Show that this
function is one-to-one and onto and find the inverse of this
function. Show that this function maps M onto M.
(iii) Let Y = (c, F, x0)Y. Show that &;Y E M and Cov(Y) E y. Thus
(M, y) is the parameter set for Y and we say that the linear
model for Y is invariant under the transformation (c, F, xo). Since &Y = u, it follows that &4Y = (c, F, xo)0L for IL E M. If t(Y) (t
maps V into M) is any point estimator for ,u, then it seems plausible to
use t(Y) as a point estimator for (c, r, x0o) = crFu + xo. Solving for
,u, it then seems plausible to use c- IF'(t(Y) - xo) as a point estimator
for ,u. Equating these estimators of u leads to t(Y) = c- F'(t(cFY +
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 157
xo) - xo) or
(4.3) t(crY + x0) = crt(Y) + xO.
An estimator that satisfies (4.3) for all c >O, r E cM(V), and xo E M
is called equivariant. (iv) Show that to(Y) = PMY is equivariant.
(v) Show that if t maps V into M and satisfies the equation
t(rY + xo) = Ft(Y) + xo for all F E ?M(V) and xo E M, then
t(Y) = PMY.
11. Consider U s R' and V e Rn and assume e&(U) = N(ZI/3I, aHIn) and f (V) = N(Z2fP2, 022In). Here, Zi is n x k of rank k and fPi E Rk
is an unknown vector of parameters, i = 1, 2. Also, aii > 0 is unknown,
i = 1, 2. Now, let X = (UV): n X 2 so u = &X has first column Zfl, and second column Z2/32.
(i) When U and V are independent, then Cov(X) = In 0 A where
A = (h 02
V ? ?22J
In this case, show that the Gauss-Markov and least-squares estimates for A are the same. Further, show that the
Gauss-Markov estimates for P, and 2 are the same as what we
obtain by treating the two regression problems separately.
(ii) Now, suppose Cov(X) = In e where
(a11 a12
012 022
is positive definite and unknown. For general Z, and Z2, show that the regression subspace of X is not invariant under all
I, 0 . so the Gauss-Markov and least-squares estimators are
not the same in general. However, if Z1 = Z2, show that the
results given in Example 4.4 apply directly. (iii) If the column space of Z1 equals the column space of Z2, show
that the Gauss-Markov and least-squares estimators of ,u are the same for each In 0 E.
NOTES AND REFERENCES
1. Scheff? (1959) contains a coordinate account of what might be called
univariate linear model theory. The material in the first section here
follows Kruskal (1961) most closely.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
158 LINEAR STATISTICAL MODELS
2. The result of Proposition 4.5 is due to Kruskal (1968).
3. Proposition 4.3 suggests that a theory of best linear unbiased estimation
can be developed in vector spaces without inner products (i.e., dual
spaces are not identified with the vector space via the inner product). For a version of such a theory, see Eaton (1978).
4. The arguments used in Section 4.3 were used in Eaton (1970) to help answer the following question. Given X e ? with Cov(X)
= In ? 2
where 2 is unknown but positive definite, for what subspaces M does
there exist a Gauss-Markov estimator for ?i e M ? In other words, with
y as in Example 4.4, for what M's can the parameter set {M, y) admit a
Gauss-Markov estimator? The answer to this question is that M must
have the form of the subspaces considered in Example 4.4. Further
details and other examples can be found in Eaton (1970).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 5
Matrix Factorizations
and Jacobians
This chapter contains a collection of results concerning the factorization of matrices and the Jacobians of certain transformations on Euclidean spaces. The factorizations and Jacobians established here do have some intrinsic interest. Rather than interrupt the flow of later material to present these results, we have chosen to collect them together for easy reference. The reader is asked to mentally file the results and await their application in future chapters.
5.1. MATRIX FACTORIZATIONS
We begin by fixing some notation. As usual, Rn denotes n-dimensional coordinate space and Emn,n is the space of n X m real matrices. The linear space of n X n symmetric real matrices, a subspace of en", is denoted bye5. If S E Sn, we write S > 0 to mean S is positive definite and S > 0 means that S is positive semidefinite.
Recall that n,, is the set of all n x p linear isometries of RP into R , that is, I E iff 'In = Ip. Also, ifTen , then T = (tij) is lower triangu lar if tij =0 for i < j. The set of all n X n lower triangular matrices with tu> 0, i = 1,..., n, is denoted by GT. The dependence of GT on the dimension n is usually clear from context. A matrix U E en, n is upper
triangular if U' is lower triangular and Gt denotes the set of all n X n
upper triangular matrices with positive diagonal elements.
159
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
160 MATRIX FACTORIZATIONS AND JACOBIANS
Our first result shows that GT and GU are closed under matrix multipli
cation and matrix inverse. In other words, GT and GU are groups of matrices with the group operation being matrix multiplication.
Proposition 5.1. If T = {tij) E GT, then T-1 E GT and the ith diagonal
element of T` is l/ti, i = 1,..., n. If T, and T2 E G ', then T,T2 E GT.
Proof To prove the first assertion, we proceed by induction on n. Assume the result is true for integers 1, 2,. . ., n - 1. When T is n X n, partition T as
VT2, tnn
where T,, is (n - 1) X (n - 1), T2, is 1 X (n - 1), and tnn is the (n, n)
diagonal element of T. In order to be T-', the matrix
{All 0|
A21 annJ
must satisfy the equation TA = In. Thus
T, I 0 8All ?| TIAI ? 0 = In- 0
T21 tnn]A21 annl VT2,All + tnnA21 tnnannj (? 1)
so A,1 = T 17, ann = l/tnn, and
A T2 T-' A2,= - "2 1'.
nn
The induction hypothesis implies that Tj11 is lower triangular with diagonal elements l/ti1, i = l,..., n - 1. Thus the first assertion holds. The second
assertion follows easily from the definition of matrix multiplication. o
Arguing in exactly the same way, Gu is closed under matrix inverse and
matrix multiplication. The first factorization result in this chapter is next.
Proposition 5.2. Suppose A E En where p < n and A has rank p. Then
A - PU where 'I E 6p, and U E Gt is p x p. Further, I and U are
unique.
Proof. The idea of the proof is to apply the Gram-Schmidt orthogonaliza tion procedure to the columns of the matrix A. Let a l,..., ap be the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.3 161
columns of A so ai e R , i = l,...,p. Since A is of rank p, the vectors
al,..., ap are linearly independent. Let (b,,..., bp) be the orthonormal set of vectors obtained by applying the Gram-Schmidt process to al,. .., ap in the order 1,2,..., p. Thus the matrix ' with columns b1,..., bp is an
element of F n as I'I = Since span(al,..., ai)= span(bl,..., bi) for i = 1,... ., p, bjai = 0 if j > i, and an examination of the Gram-Schmidt
Process shows that b,a > 0 for i = 1,... , p. Thus the matrix U V'A is an element of Gt, and
*U = 4"'A.
But 'P4 is the orthogonal projection onto span(b,,..., bp) = span(a,,.....
ap) so **'A = A, as **' is the identity transformation on its range. This
establishes the first assertion. For the uniqueness of I and U, assume that A = 'IU1 for % e 6p and U1 E Gt. Then *,U, = 'U, which implies
that *'I" = UU1-'. Since A is of rank p, U1 must have rank p so @,(A) =
(*)= i(+). Therefore, I'" = ' since 'P' is the orthogonal pro jection onto its range. Thus 'P"I'4 - Ip-that is, *'I, is a p x p orthogonal matrix. Therefore, UUI' = 'I4 is an orthogonal matrix and UUj E Gt. However, a bit of reflection shows that the only matrix that is
both orthogonal and an element of Gu is Ip. Thus U= U1 so 4'-'as U
has rank p. O
The main statistical application of Proposition 5.2 is the decomposition of the random matrix Y discussed in Example 2.3. This decomposition is used to give a derivation of the Wishart density function and, under certain assumptions on the distribution of Y = *U, it can be proved that * and U
are independent. The above decomposition also has some numerical appli cations. For example, the proof of Proposition 5.2 shows that if A = *U, then the orthogonal projection onto the range of A is ' 4 = A(A'A)- 'A'.
Hence this projection can be computed without computing (A'A) l. Also, if p = n and A = 'PU, then A -' = U- "P'. Thus to compute A', we need
only to compute U' and this computation can be done iteratively, as the proof of Proposition 5.1 shows.
Our next decomposition result establishes a one-to-one correspondence between positive definite matrices and elements of GT. First, a property of positive definite matrices is needed.
Proposition 5.3. For S E 5p and S > 0, partition S as
S1 S12
VS21 S22
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
162 MATRIX FACTORIZATIONS AND JACOBIANS
where S1 and S22 are both square matrices. Then S1, 122, S11 - S 1S21,
and S22 - S21SI I S,2 are all positive definite.
Proof. For x E RP, partition x into y and z to be comformable with the
partition of S. Then, for x * 0,
0 < x'Sx y'S,,y + 2z'S21y + Z'S22z.
For y * 0 and z = 0, x * 0 so y'S1,y>0, which shows that S, >0.
Similarly, S22 > 0. Fory * 0 and z = -S-2S21 y,
0 < x'Sx = AS-S O < X'S-y'( S 1 - 12 S2 IS21 ) Y,
which shows that S - S12 S21 > 0. Similarly, S22 - S2lS12 > 0. O
Proposition 5.4. If S > 0, then S = TT' for a unique element T e GT.
Proof. First, we establish the existence of T and then prove it is unique.
The proof is by induction on dimension. If S E p with S > 0, partition S
as
S (SI S,12 S 21 S22
where S,1 is (p - 1) x (p - 1) and S22 E (0, xo). By the induction hypothe
sis, S,I = T IT'1I for T11 e GT. Consider the equation
S21 S22! T21 T22)( T2 T22)
which is to be solved for T21: 1 x (p - 1) and T22 e (0, oo). This leads to
the two equations T2ITil = S21 and T2,T2'1 + T222 S22- Thus T21 =
S21 (T I) -so
S22 = T222 +S2 (T1
= T22 + S21(T11T7 1) S12 =
T22 + S21SIl2.
Therefore, TA S= 222- S21S,,'12' which is positive by Proposition 5.3.
Hence, T22 = (S22 - S21Sj11S12)'/2 is the solution for T22 > 0. This shows
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.5 163
that S = TT' for some T E GT. For uniqueness, if S = TT' = T1T,, then
TV ITT'(T1)1 = s O Tj 'T is an orthogonal matrix. But TV 'T E GT and
the only matrix that is both orthogonal and in GT is IP. Hence, T'1 T= Ip and uniqueness follows. O
Let S' denote the set of p x p positive definite matrices. Proposition 5.4 p
shows that the function F: GT SP defined by F(T) = TT' is both
one-to-one and onto. Of course, the existence of F-: + + - GT is also part
of the content of Proposition 5.4. For T, E G', the uniqueness part of
Proposition 5.4 yields F- l (T, ST,') = T, F (S). This relationship is used
later in this chapter. It is clear that the above result holds for GT replaced
by G . In other words, every S E S + has a unique decomposition S = UU'
for Ue Gt.
Proposition 5.5. Suppose A e n where p < n and A has rank p. Then
A = S where IE ,,, and S is positive definite. Furthermore, 'I and S
are unique.
Proof. Since A has rank p, A'A has rank p and is positive definite. Let S be
the positive definite square root of A'A, so A'A = SS. From Proposition 1.31, there exists a linear isometry I e S5 n such that A = PS. To establish
the uniqueness of I and S, suppose that A = *S = ',S, where I, I' e
6,, and S and S are both positive definite. Then 6it(A) = 6(4')- (= ).
As in the proof of Proposition 5.2, this implies that I"'p,II = since * I' 4 is the orthogonal projection onto 6A(% ) = 6A(4'). Therefore, SS`l = *'I" is a p x p orthogonal matrix so the eigenvalues of SS l are all on the
unit circle in the complex plane. But the eigenvalues of SS`' are the same
as the eigenvalues of S1/2S- S1S12 (see Proposition 1.39) where S'/2 is the positive definite square root of S. Since S/2S 1- 1/2 is positive definite, the eigenvalues of S'2S- iS'12 are all positive. Therefore, the eigenvalues of S 1/2Sj 'S'/2 must all be equal to one, as this is the only point of intersection
of (0, oo) with the unit circle in the complex plane. Since the only p x p
matrix with all eigenvalues equal to one is the identity, S112S- IS/2 = 0so
S = S. Since S is nonsingular, I I = I. o
The factorizations established this far were concerned with writing one matrix as the product of two other matrices with special properties. The results below are concerned with factorizations for two or more matrices. Statistical applications of these factorizations occur in later chapters.
Proposition 5.6. Suppose A is a p x p positive definite matrix and B is a
p x p symmetric matrix. There exists a nonsingular p x p matrix C and a
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
164 MATRIX FACTORIZATIONS AND JACOBIANS
p x p diagonal matrix D such that A = CC' and B = CDC'. The diagonal
elements of D are the eigenvalues of A- 'B.
Proof Let A',2 be the positive definite square root of A and A-1/2= (A'1/2)- '. By the spectral theorem for matrices, there exists a p x p orthogo nal matrix r such that r'A 'r2BA -'/2- D is diagonal (see Proposition 1.45), and the eigenvalues of A - '12BA -1/2 are the diagonal elements of D.
Let C = A'/21r. Then CC' = A"/2rr'A1/2 = A and CDC' = B. Since the eigenvalues of A - '12BA -1/2 are the same as the eigenvalues of A - 'B, the proof is complete. LI
Proposition 5.7. Suppose S is a p x p positive definite matrix and partition S as
s- (5,, S12
S12 S22
where S, 1 is p 1 x p I and S22 is P2 X P2 with p I< P2. Then there exist
nonsingular matrices A1i of dimension pi x pi, i = 1, 2, such that A ii =
Ip, i = 1,2, and A15,2A'22 = (DO) where D is a P1 x pI diagonal matrix and 0 is apI X (P2 - p1) matrix of zeroes. The diagonal elements of D2 are
the eigenvalues of S, 'Sl2U22S21 where S21 = S12, and these eigenvalues are all in the interval [0, 1].
Proof. Since S is positive definite, S,I and S22 are positive definite. Let S 12 and S21/2 be the positive definite square roots of SI, and S22* Using Proposition 1.46, write the matrix SHj1/2S12S21/2 in the form
1l -
1/2S21/2 = rD'I
where r is aP1 x P I orthogonal matrix, D is a P I x p I diagonal matrix, and I is ap1 X P2 linear isometry. The pI rows of I form an orthonormal set in
RP2 and P2 - PI orthonormal vectors can be adjoined to ' to obtain a
P2 X P2 orthogonal matrix 'I, whose first P1 rows are the rows of I. It is clear that
D+ = (DO)*,
where 0 is a P1 X (P2 -PI) matrix of zeroes. Set A = -
S1l/2 and
A22 = 4,5S1/2 so AiiSiiA'i = Ip for i = 1, 2. Obviously, A, IS,2 A'22 = (DO). Since S, I ,/2S,2SI/2 =FrD4,
S,-'/255-155-1S21S -
/2= rD2Fr
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.8 165
so the eigenvalues of Sj1 '/2S12S23'S21 Sj- 1/2 are the diagonal elements of D2. Since the eigenvalues of Sjl '/2S12S~3'S2 I Sj- 1/2 are the same as the eigen
values of Sj~ 'SllSl2S , ~ it remains to show that these eigenvalues are in [0, 1]. By Proposition 5.3, S1l - SU2S,2%S, is positive definite so IP5 -
S,,'/2S,~'2S21 Sj-Q1/2 is positive definite. Thus for x E R1P,
0 < X'Sl'12S12S-'S21Sjj'/2x i X'X,
which implies that (see Proposition 1.44) the eigenvalues of S S -2lS21S -/2 are in the interval [0, 1]. o
It is shown later that the eigenvalues of SiljSI2S2 21S21 are related to the
angles between two subspaces of RP. However, it is also shown that these eigenvalues have a direct statistical interpretation in terms of correlation coefficients, and this establishes the connection between canonical correla tion coefficients and angles between subspaces. The final decomposition result in this section provides a useful result for evaluating integrals over the space of p X p positive definite matrices.
Proposition 5.8. Let S I denote the space of p x p positive definite matrices. p For S E 5 , partition S as
S - l l S12
S21 S22 J
where Sii is pi X pi, i = 1, 2, S12 iS PI X P2, and S21 = Sl2. The function f defined on Sp+ to SI X S' x P by
f (S) = (S I - S12Si2IS21, S22' S12Si22)
is a one-to-one onto function. The function h on Sp+ x S'+ X p to S+ pi P2 P2, PI P
given by
h(A1j, A22, A12) = (A11 + 1A2A22A12 A12A22) 22 12 22
is the inverse off.
Proof. It is routine to verify that f o h is the identity function on Sp+ x SP?
X and h o f is the identity function on S . This implies the assertions
of the proposition. O
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
166 MATRIX FACTORIZATIONS AND JACOBIANS
5.2. JACOBIANS
Jacobians provide the basic technical tool for describing how multivariate integrals over open subsets of Rn transform under a change of variable. To describe the situation more precisely, let Bo and B, be fixed open subsets of Rn and let g be a one-to-one onto mapping from Bo to B,. Recall that the
differential of g, assuming the differential exists, is a function Dg defined on
Bo that takes values in en, n and satisfies
llg(x + 3) - g(x) - Dg(X)1ll
for each x E Bo. Here 8 is a vector in Rn chosen small enough so that
x + 8 E Bo. Also, Dg(x)S is the matrix Dg(x) applied to the vector 6, and
* -1 denotes the standard norm on Rn. Let gl,..., gn denote the coordinate
functions of the vector valued function g. It is well known that the matrix Dg(x) is given by
Dg(x) dg{ ()~ x E Bo.
In other words, the (i, j) element of the matrix Dg(x) is the partial
derivative of gi with respect to xj evaluated at x E Bo. The Jacobian of g is
defined by
Jg(x) = Idet Dg(x)l, x E Bo
so the Jacobian is the absolute value of the determinant of Dg. A formal
statement of the change of variables theorem goes as follows. Consider any real valued Borel measurable function f defined on the open set B1 such that
|If (y)I dy < + oo
JB,
where dy means Lebesgue measure. Introduce the change of variables y = g(x), x E Bo in the integral JB, f(y) dy. Then the change of variables
theorem asserts that
(5.1) f (y) dy = f (g(x))Jg(x) dx. I1B
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
JACOBIANS 167
An alternative way to express (5.1) is by the formal expression
(5.2) d(g(x)) = Jg(x) dx, x E Bo.
To give a precise meaning to (5.2), proceed as follows. For each Borel measurable function h defined on Bo such that JB0Ih(x)lJg(x) dx < + oo, define
I, (h) h (X) Jg(x) dx, B0
and define
I2(h)-f h(x)d(g(x))-f h(g- '(x)) dx. B0 g(BO)
Then (5.2) means that I,(h) = I2(h) for all h such that I,(IhI) < + oo. To
show that (5.1) and the equality of I, and I2 are equivalent, simply set f=hog-' sofog = h. ThusI,(h)= I2(h) iff
f f(g(x))Jg(x) dx =- f(x) dx
since B, g(Bo). One property of Jacobians that is often useful in simplifying computa
tions is the following. Let Bog B1, and B2 be open subsets of R', suppose g, is a one-to-one onto map from Bo to B1, and suppose Dg, exists. Also, suppose g2 is a one-to-one onto map from B1 to B2 and assume that Dg2
exists. Then, g2 o g9 is a one-to-one onto map from Bo to B2 and it is not
difficult to show that
Dg2.,g(X) Dg2(gi(x))Dg,(x), x E Bo.
Of course, the right-hand side of this equality means the matrix product of
Dg2(g,(x)) and Dg,(x). From this equality, it follows that
Jg2.g,(x) =
Jg2(g1(x))Jgl(x), x E Bo.
In particular, if B2 = Bo and g2 = gj 1, then g2 o =1 = g o g is the identity
function on Bo so its Jacobian is one. Thus
I = Jg22g1(X)
= Jg2(gi(x))Jg,(x),
x E Bo
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
168 MATRIX FACTORIZATIONS AND JACOBIANS
and
___I(y Y EI i
Jg,I(Y) (- ( ))
We now turn to the problem of explicitly computing some Jacobians that are needed later. The first few results present Jacobians for linear transfor mations.
Proposition 5.9. Let A be an n x n nonsingular matrix and define g on Rn
to Rn by g(x) = A(x). Then Jg(x) = Idet(A)l for x E R .
Proof. We must compute the differential matrix of g. It is clear that the ith coordinate function of f is gi where
n
gi(x) = Z aik Xk k = I
Here A = (aij) and x has coordinates x1,. . ., xn. Thus
OgX (X) = aij
so Dg(x) = (aij). Thus Jg(x) = jdet(A)j. []
Proposition 5.10. Let A be an n x n nonsingular matrix and let B be a
p x p nonsingular matrix. Define g on the np-dimensional coordinate space
lp, n to fp, nby
g(X) = AXB' = (A X B)X.
Then Jg(X) = Idet AIPIdet BIN.
Proof First note that A ? B = (In ? B)(A X Ip). Setting g,(X) = (A X
Ip)X and g2(X) = (In 0 B)X, it is sufficient to verify that
Jgl(X) = Idet AIP
and
Jg2(X) = Idet BI.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.11 169
Let x,,..., xp be the columns of the n x p matrix X so xi E R'. Form the
np-dimensional vector
XI X2
[X]= xp
Since (A ? Ip)X has columns Ax,,..., Axn, the matrix of A ? Ip as a linear
transformation on [X] is
A A
. : ~(np) x (np)
where the elements not indicated are zero. Clearly, the determinant of this matrix is (det A) P since A occurs p times on the diagonal. Since the
determinant of a linear transformation is independent of a matrix represen tation, we have that
det(A e Ip) = (det A)P.
Applying Proposition 5.9, it follows that
Jgl(X) = detAIP.
Using the rows instead of the columns, we find that
det(In ? B) = (det B) ,
so
Jg2(X) = Idet B.
Proposition 5.11. Let A be a p x p nonsingular matrix and define the function g on the linear space p of p x p real symmetric matrices by
g(S) = ASA' = (A ? A)S.
Then Jg(S) = Idet AIP i'.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
170 MATRIX FACTORIZATIONS AND JACOBIANS
Proof The result of the previous proposition shows that det(A ? A) = (det A)2p when A 0 A is regarded as a linear transformation on E
However, this result is not applicable to the current case since we are considering the restriction of A 0 A to the subspace Sp of p
To establish the present result, write A = P1Dr2 where F1 and F2 are p x p orthogonal matrices and D is a diagonal matrix with positive diagonal elements (see Proposition 1.47). Then,
ASA' = (A 0 A)S = (F1 X F1)(D 0 D)(I2 ? F2)S
so the linear transformation A 0 A has been decomposed into the composi tion of three linear transformations, two of which are determined by orthogonal matrices.
We now claim that if r is a p x p orthogonal matrix and g, is defined on
p by
g1(s) = rsrF = (r X r)s,
then Jg, = 1. To see this, let K *, * ) be the natural inner product on restricted to Sp, that is, let
(SI, S2) = trS,S2.
Then
K(r ? r)S, (r X r)S2) = trrs,Frrs2Ir' = tr rSIs2rF
= tr rFrs1S2 = trS,S2 = KSI, S2)
Therefore, r r F is an orthogonal transformation on the inner product space (SP, ( K )), so the determinant of this linear transformation on Sp is
+ 1. Thus g, is a linear transformation that is also orthogonal so Jg, = 1 and
the claim is established. The next claim is that if D is a p x p diagonal matrix with positive
diagonal elements and g2 is defined on Sp by
g2(S) = DSD,
then Jg= (det D) '. In the [p(p + 1)/21-dimensional space jj, let Si;, 1 i j - i < p, denote the coordinates of S. Then it is routine to show that
the (i, j) coordinate function of g2 is g2 ij(S) = xiXjsij where A1,...., A,p are
the diagonal elements of D. Thus the matrix of the linear transformation g2
is a [p( p + 1)/2] X [p( p + 1)/2] diagonal matrix with diagonal entries
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.12 171
X1xi for 1 < j < i < p. Hence the determinant of this matrix is the product
of the XAX. for 1 < j -< i < p. A bit of calculation shows this determinant is
(H X)P+ . Since det D = 1IIX, the second claim is established. To complete the proof, note that
g(S) = ASA' = (rF ? rF)(D X D)(r2 X r2)S = h1(h2(h3(S)))
where h(S) = (rl e rl)S h2(S) = (D X D)S, and h3(S) = (r2 0 r2)S. A direct argument shows that
Jhl, oh2 , h3(S) =Jhj, h2(h3(S)) Jh3(S)
Jh,(h2(h3(S)))Jh2(h3(S))Jh3(S)
But Jh, 1 = Jh3 and Jh2 = (det D) P". Since A = r1Dr2, Idet Al = det D,
which entailsJg = Idet AI P+. 0
Proposition 5.12. Let M be the linear space of p X p skew-symmetric
matrices and define g on M to M by
g(S) = ASA'
where A is a p X p nonsingular matrix. Then Jg(S) = Idet A I P.
Proof. The proof is similar to that of Proposition 5.11 and is left to the reader. ?
Proposition 5.13. Let GT be the set of p x p lower triangular matrices with
positive diagonal elements and let A be a fixed element of GT. The function g defined on GT to GT by
g(T)=AT, Te GT
has a Jacobian given by Jg(T) = Hr'Pa, where all,..., a are the diagonal
elements of A.
Proof. The set GT is an open subset of [2p(p + 1)]-dimensional coordi nate space and g is a one-to-one onto function by Proposition 5.1. For T E GT, form the vector [T] with coordinates t11, t21 t22, t31,..., tpp and write the coordinate functions of g in the same order. Then the matrix of partial derivatives is lower triangular with diagonal elements
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
172 MATRIX FACTORIZATIONS AND JACOBIANS
a,,, a22, a22, a33, a where a occurs i times on the diagonal. Thus the
determinant of this matrix of partial derivatives is Hfa soJg - ii
Proposition 5.14. In the notation of Proposition 5.13, define g on GT to GT by
g(T)= TB, TeG
where B is a fixed element of GT. Then Jg(T) = Hf b P. whereb1,..., bp are the diagonal elements of B.
Proof. The proof is similar to that of Proposition 5.13 and is omitted. O
Proposition 5.15. Let Gu be the set of all p x p upper triangular matrices
with positive diagonal elements. For fixed elements A and B of Gt, define g
by
g(U) = A UB, U E Gt.
Then,
p p
Jg(U) = fHaP H' flb' I I
where a, app and b. bpp are diagonal elements of A and B.
Proof. The proof is similar to that given for lower triangular matrices and
is left to the reader. El
Thus far, only Jacobians of linear transformations have been computed
explicitly, and, of course, these Jacobians have been constant functions. In
the next proposition, the Jacobian of the nonlinear transformation de
scribed in Proposition 5.8 is computed.
Proposition 5.16. Let p and P2 be positive integers and set p = p1 + P2. Using the notation of Proposition 5.8, define h on S+ x S+ x cP2 to S+
by
h(All, A22, A12) (A11 A122212 A12A22 hhn ^ A 1 1 A 2, 2' 12= (dt A22A12 A22 )
Then Jh(All, A22, A12) = (det A22 ) PI
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.16 173
Proof. For notational convenience, set S = h (AI,, A22, A12) and partition
S as
S_ (SlI S12\ =S12 S22J
where Sij is pi x pj, i, j = 1, 2. The partial derivatives of the elements of S,
as functions of the elements of A1 1, AI2 and A22, need to be computed. Since
SI = A + A22AA' the matrix of partial derivatives of thepI(pI + 1)/2 elements of S11 with respect to the p I( pI + 1)/2 elements of All is just the
[p1 (p1 + 1)/21-dimensional identity matrix. Since SI2 = A12A22, the matrix of partial derivatives of the p1 P2 elements of SI2 with respect to the elements
of A I I is the P I P2 X P I P2 zero matrix. Also, since S22 = A22, the partial
derivative of elements of S22 with respect to the elements of AI, or A,2 are
all zero and the matrix of partial derivatives of the P2(P2 + 1)/2 elements
of S22 with respect to the P2(P2 + 1)/2 elements of A22 is the identity matrix. Thus the matrix of partial derivatives has the form
All A12 A22
SI ( IF 7] S12 ? B
S22 ? ? I2
so the determinant of this matrix is just the determinant of the P I P2 x P, P2
matrix B, which must be found. However, B is the matrix of partial derivatives of the elements of S12 with respect to the elements of A12 where S12 = A,2A22. Hence the determinant of B is just the Jacobian of the transformation g(A12) = A,2A22 with A22 fixed. This Jacobian is (det A22)P' by Proposition 5.10. o
As an application of Proposition 5.16, a special integral over the space S+ is now evaluated. p
* Example 5.1. Let dS denote Lebesgue measure on the set p +. The
integral below arises in our discussion of the Wishart distribution. For a positive integer p and a real number r > p - 1, let
c(r, p) = I>SI(r-P- )/2exp[- - trS] dS. p
In this example, the constant c(r, p) is calculated. When p = 1,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
174 MATRIX FACTORIZATIONS AND JACOBIANS
p= (0, oo) so for r > O,
c(r,l) =f s(/)1exp[- j] ds = 2r/2J(j)
where r(r/2) is the gamma function evaluated at r/2. The first claim is that
c(r, p + 1) - (27T)12c(r - 1, p)c(r, 1),
for r > p and p > 1. To verify this claim, consider S - S + and
partition S as
'l \2 S22
where S E +
, S22 e (0, oo), and S12 is p X 1. Introduce the
change of variables
(Sl S12 _ (All + A2A22Af2 A12A22
S 1i2 S22 2 12 A22
where A1 E S, A22 e (0, ox), and AE2 E RP. By Proposition 5.16,
the Jacobian of this transformation is AP2. Since det S = det(S11 -
S12Si2 'S'2 )det S22 = (det A 1I ) A 22, we have
c(r, p + 1) = (r-P-2)/2exp[- 2 trS] dS
T f L A4(r-p2)/2Ar-p - 2)/2 xexp[- 2tr11 -
#2A22A12A12-2 22]
XAP dAHd12 dA22.
Integrating with respect to A12 yields
JR| ex- A22A22A 12] dA12 = (T)P/2A P/22 SP
Substituting this into the second integral expression for c(r, p + 1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.16 175
and then integrating on A22 shows that
c(r, p + 1) (21)Tp/2c(r, 1)j IAI (r-P-2)/2exp[-2 trA11] dA1j
= (2 r) p2c(r, 1)c(r - 1, p).
This establishes the first claim. Now, it is an easy matter to solve for c(r, p). A bit of manipulation shows that with
c(r, p) = IP(P- 1)/42rP/2 r( r -1+1)
for p = 1, 2,.. ., and r > p - 1, the equation
c(r, p + 1) - (2v ) p2c(r, l)c(r - 1, p)
is satisfied. Further,
c(r, 1) = 2r/2r 2().
Uniqueness of the solution to the above equation is clear. In summary,
f +Sl(r-P -)/2exp[- 4 trS] dS = TP(p-I)/42rP12 H r( r )
and this is valid for p = 1, 2, ... and r > p - 1. The restriction that r be greater than p - 1 is necessary so that r[(r - p + 1)/2] be
well defined. It is not difficult to show that the above integral is + zo if r < p - 1. Now, set w(r, p) = l/c(r, p) so
f(S) w(r, p)IsI(r-P-1)/2exp[-2 trS]
is a density function on 5p. When r is an integer, r > p, f turns out
to be the density of the Wishart distribution.
Proposition 5.4 shows that there is a one-to-one correspondence between elements of p and elements of GT. More precisely, the function g defined on G' by
g(T) = TT', TE GT
is one-to-one and onto +. It is clear that g has a differential since each
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
176 MATRIX FACTORIZATIONS AND JACOBIANS
coordinate function of g is a polynomial in the elements of T. One way to find the Jacobian of g is to simply compute the matrix of partial derivatives and then find its determinant. As motivation for some considerations in the next chapter, a different derivation of the Jacobian of g is given here. The first observation is as follows.
Proposition 5.17. Let dS denote Lebesgue measure on S' and consider the measure ,u on Sp given by ,u(dS) = dS/ISI(P+ 1)/2. For each Borel measur able function f on Sp which is integrable with respect to ,u, and for each nonsingular matrix A,
ff(S)A(dS) = f(ASA')uL(dS).
Proof Set B = ASA'. By Proposition 5.11, the Jacobian of this transforma tion on Sp to Sp is idet AIP+ . Thus
p p IS
f(ASA')tLdS' ff(ASA') dS +
ISI(p+ 1)72
= f(ASA')jdet AIP+' 1)2 dS J J ASA'I(P+ 1)/2
f (B) dB _
f (S) ,u (dS) 4IBI(P ?1)/2dB f()dd)
The result of Proposition 5.17 is often paraphrased by saying that the
measure ,u is invariant under each of the transformations gA defined on 0p+ by gA(S) = ASA'. The following calculation gives a heuristic proof of this
result:
~(dgAS)) -d(gA(S)) JgA (S) dS ,u ( gs (
))I A SA' (p + 1)/2 A SA'l (p+ 1)/2
Idet AIP ' dS dS (dS) IAA'I(P+ 1)/2 1 I(p 1)72 -is1(p? 1)72 pd)
In fact, a similar calculation suggests that ,u is the only invariant measure in
Sp (up to multiplication of IL by a positive constant). Consider a measure v
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.18 177
of the form i(dS) = h(S) dS where h is a positive Borel measurable
function and dS is Lebesgue measure. In order that v be invariant, we must
have
h(S) dS = v(dS) = v(dgA(S)) = h(gA(S))d(gA(S))
= h(gA(S))Idet AIP+' dS
so h should satisfy the equation
h(S) = h(ASA')IAA'l(P+ 1)/2
since gA(S) = ASA' and Idet AIP+ = IAA'I(P+ 1)/2. Set S = Ip, B - AA', and c = h(Ip). Then
h(B) = tB ' B E' SP
so
v(dS) = cii(dS)
where c is a positive constant. Making this argument rigorous is one of the
topics treated in the next chapter. The calculation of the Jacobian of g on GT to Sp+ is next.
Proposition 5.18. For g(T) = TT', T E GT
p
Jg(T ) = 2P I1 tl?-+ i=I
where tl,..., tp are the diagonal elements of T.
Proof. The Jacobian Jg is the unique continuous function defined on GT that satisfies the equation
' f(S) dS - f (g(T)) Jg(T) J
f ( S)5 + IT/2 T+ g(T) (P+I)'2 d1 T
for all Borel measurable functions f for which the integral over S + exists. But the left-hand side of this equation is invariant under the replacement of
f(S) byf(ASA') for any nonsingularp x p matrix. Thus the right-hand side
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
178 MATRIX FACTORIZATIONS AND JACOBIANS
must have the same property. In particular, for A E GT we have
f (TT+) I (T) dT f ( TT)) Jg(TT) dTT.
In this second integral, we make the change of variable T = A -'B for A e GT fixed and B e GT. By Proposition 5.12, the Jacobian of this transformation is l/flPai where al ,..., a are the diagonal elements of A.
Thus
f (TT') f(~d= f(BB') hg(A - B) dB
J t (Tr))/Jg( T) dT =dGB.B'(+'z All 'lHa T ~~~~Hai1
Since this must hold for all Borel measurable f and since Jg is a continuous
function, it follows that for all T e GT and A E GT
Jg(T) =Jg(A-'T) IAp
rl aiii
Setting A = T and noting that ITI = rIP ti,, we have
p
Jg (T) = Jg(Ip) rH tP? -+
Thus Jg(T) is a constant k times HftfPVt+ 1. Hence
dS kf (TT') lP ild = |f (S)
ISI(P+ + + kf (TT') H tidT.
To evaluate the constant k, pick
f(S) = ISIr/2exp[- 2trS], r > p - 1
But
|Sr/2exp[ - 1 tr S]
dS c(r, p)
JS+ISIeXP 2 s
- ( c(r/p
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 5.19 179
where c(r, p) is defined in Example 5.1. However,
p
kf ITTilr/2exp[ - 2 tr TT'] IIt,, dT
GT~~~~~~ I
=kfC Hlt,iexpt- I -
,4 dT =k2-PC( r, p)
so k = 2P. The evaluation of the last integral is carried out by noting that tii ranges from 0 to oo and tij for j < i ranges from - oo to oo. Thus the
integral is a product of p( p + 1)/2 integrals on R, each of which is easy to
evaluate. o
A by-product of this proof is that
h(T) - i
exp . t
c(r, p) ]
is a density function on GT. Since the density h factors into a product of
densities, the elements of T, ti1 forj < i, are independent. Clearly,
C(tij) =N(O, 1) forj < i
and
E (t2) =
Xn-+
when r is the integer n > p.
Proposition 5.19. Define g on Gu to Sp by g(U)= UU'. Then Jg(U) is
given by
p Jg(U) = 2P Hu,
i=1I
where u,,..., upp are the diagonal elements of U.
Proof The proof is essentially the same as the proof of Proposition 5.18
and is left to the reader. o
The technique used to prove Proposition 5.18 is an important one. Given g on GT to Sp, the idea of the proof was to write down the equation the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
180 MATRIX FACTORIZATIONS AND JACOBIANS
Jacobian satisfies, namely,
f S)/ dS = WT)) Jg(T) dT
I+ (p+1)/2 GT ITtP
for all integrable f. Since this equation must hold for all integrable f, Jg is uniquely defined (up to sets of Lebesgue measure zero) by this equation. It is clear that any property satisfied by the left-hand integral must also be satisfied by the right-hand integral and this was used to characterize Jg. In particular, it was noted that the left-hand integral remained the same if f(S) was replaced by f(ASA') for an nonsingular A. For A E GT, this led to the
equation
Jg(T) = Jg(A-T) lAlP' Ha~
which determined Jg. It should be noted that only Jacobians of the linear transformations discussed in Propositions 5.11 and 5.13 were used to determine the Jacobian of the nonlinear transformation g. Arguments similar to this are used throughout Chapter 6 to derive invariant integrals
(measures) on matrix groups and spaces that are acted upon by matrix
groups.
PROBLEMS
1. Given A e p n with rank(A) = p, show that A = *T where I E Cp n
and T E GT. Prove that I and T are unique.
2. Define the function F on SP to GT as follows. For each S e Sp , F(S)
is the unique element in GT such that S = F(S)(F(S))'. Show that F(TST') = TF(S) for T E GT and ScE .
3. Given S E Sp, show there exists a unique U E Gu such that S = UU'.
4. For S E S., partition S as
/Sll S12
S ( 11S21 S22)
where Sij is pi x pj, i, j = 1, 2. Assume for definiteness that pi < P2.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 181
Show that S can be written as
( SI1 Sl2 {(Al O Ip, (DO)(A, o S21 S22J 0 A2 k(DO)' IP2 Jk A2)
where Ai is pi X pi and nonsingular, D is p, x p, and diagonal with
diagonal elements in [0, 1).
5. Let Co,, be those elements in ep that have rank p. Define F on X Gu
to Cn by F(4, U)
= PU.
(i) Show that F is one-to-one onto, and describe the inverse of F.
(ii) For FEE On and T E GT, define F ? T on Co to Co by
(F ? T)A = FAT'. Show that (r X T)F('I, U) = F(F4, UT'). Also, show that F-,((r X T)A) = (IF, UT') where F-'(A) =
(4, U).
6. Let Bo and B1 be open sets in Rn and fix xo E Bo. Suppose g maps Bo into B1 and g(x) = g(xo) + A(x - xo) + R(x - xo) where A is an
n x n matrix and R(.) is a function that satisfies
lim IIR(u)II = 0. u-0 liull
Prove that A = Dg(xo) so Jg(xo) = Idet(A)l.
7. Let V be the linear coordinate space of all p X p lower triangular real
matrices so V is of dimension p(p + 1)/2. Let Sp be the linear
coordinate space of all p X p real symmetric matrices so Sp is also of
dimension p(p + 1)/2. (i) Show that GT is an open subset of V.
(ii) Define g on GT to ,Sp by g(T) = TT'. For fixed TO E GT, show
that g(T) = g(To) + L(T -
To) + (T - To)(T
- TO)' where L is
defined on V to Sp by L(x) = xTo' + Tox', x E V. Also show
that R(T - To) = (T - To)(T - TO)' satisfies
lim JR( X)JJ - 0. x--+o llxii
(iii) Prove by induction that det L = 2PHPt_
-i 1 where t,
that detL = IPHt~~'weet1. tp, are the diagonal elements of To.
(iv) Using (iii) and Problem 6, show that Jg(T) = 2PIlPt/-+ . (This is just Proposition 5.18).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
182 MATRIX FACTORIZATIONS AND JACOBIANS
8. When S is a positive definite matrix, partition S and S-' as
(S2, S22 )S2l s22
Show that
sl (si, -SS:221l S"= Sll-Sl2Si22S)
s12 =-S12
S22 = (S22 -S21S12
S2' = s22S2S 521 =~ S1 i
and verify the identity
si-Is2,sl' = s22s21s-I. 2- 11 22 -
9. In coordinate space RP, partition x as x = (y), and for E > 0, parti
tion 2 : p x p conformably as
121 222
Define the inner product (, ) on RP by (u, v) = u' - 'v.
(i) Show that the matrix
p =1 -:2:222
defines an orthogonal projection in the inner product (, ). What is 31 (P)?
(ii) Show that the identity
( Z2- (Z- (Y -
-2E Z) E (Y- El2E21Z) + z'X221z
is the same as the identity
IIXI12 = ljPXII2 + I(I - P)x112
where (x, x) = 11x112 and x = (y).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 183
(iii) For a random vector
x-(Y) p
with P-(X) = N(O, :), I > 0, use part (ii) to give a direct proof
via densities that the conditional distribution of Y given Z is
N(122j21Z 211 - 212 221).
10. Verify the equation
p [P
|rtiri 1:p_ t,|dT=2 -Pc(r, p) jes i in[exp I. dT= =
ir2 ,p
where c(r, p) is given in Example 5. 1. Here, r is real, r > p - 1
NOTES AND REFERENCES
1. Other matrix factorizations of interest in statistical problems can be
found in Anderson (1958), Rao (1973), and Muirhead (1982). Many matrix factorizations can be viewed as results that give a maximal
invariant under the action of a group?a topic discussed in detail in
Chapter 7.
2. Only the most elementary facts concerning the transformation of mea
sures under a change of variable have been given in the second section. The Jacobians of other transformations that occur naturally in statisti cal problems can be found in Deemer and Olkin (1951), Anderson
(1958), James (1954), Farrell (1976), and Muirhead (1982). Some of these transformations involve functions defined on manifolds (rather than open subsets of Rn) and the corresponding Jacobian calculations
require a knowledge of differential forms on manifolds. Otherwise, the
manipulations just look like magic that somehow yields answers we do not know how to check. Unfortunately, the amount of mathematics
behind these calculations is substantial. The mastery of this material is no* mean feat. Farrell (1976) provides one treatment of the calculus of
differential forms. James (1954) and Muirhead (1982) contain some
background material and references.
3. I have found Lang (1969, Part Six, Global Analysis) to be a very readable introduction to differential forms and manifolds.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 6
Topological Groups
and Invariant Measures
The language of vector spaces has been used in the previous chapters to describe a variety of properties of random vectors and their distributions. Apart from the discussion in Chapter 4, not much has been said concerning the structure of parametric probability models for distributions of random vectors. Groups of transformations acting on spaces provide a very useful framework in which to generate and describe many parametric statistical
models. Furthermore, the derivation of induced distributions of a variety of functions of random vectors is often simplified and clarified using the existence and uniqueness of invariant measures on locally compact topologi cal groups. The ideas and techniques presented in this chapter permeate the remainder of this book.
Most of the groups occurring in multivariate analysis are groups of nonsingular linear transformations or related groups of affine transforma tions. Examples of matrix groups are given in Section 6.1 to illustrate the
definition of a group. Also, examples of quotient spaces that arise naturally in multivariate analysis are discussed.
In Section 6.2, locally compact topological groups are defined. The existence and uniqueness theorem concerning invariant measures (integrals) on these groups is stated and the matrix groups introduced in Section 6.1
are used as examples. Continuous homomorphisms and their relation to
relatively invariant measures are described with matrix groups again serving as examples. Some of the material in this section and the next is modeled
after Nachbin (1965). Rather than repeat the proofs given in Nachbin
(1965), we have chosen to illustrate the theory with numerous examples. Section 6.3 is concerned with the existence and uniqueness of relatively
invariant measures on spaces that are acted on transitively by groups of
184
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
GROUPS 185
transformations. In fact, this situation is probably more relevant to statisti cal problems than that discussed in Section 6.2. Of course, the examples are
selected with statistical applications in mind.
6.1. GROUPS
We begin with the definition of a group and then give examples of matrix
groups.
Definition 6.1. A group (G, o) is a set G together with a binary operation o such that the following properties hold for all elements in G:
(i) (g1 ? g2) ? g3 = g1 ? (g2 ? g3).
(ii) There is a unique element of G, denoted by e, such that g o e =
e o g = g for all g E G. The element e is the identity in G.
(iii) For each g E G, there is a unique element in G, denoted by g-', such that g o = g-I o g = e. The element g-' is the inverse
of g.
Henceforth, the binary operation is ordinarily deleted and we write g1 g2 for
o1 ? g2. Also, parentheses are usually not used in expressions involving more than two group elements as these expressions are unambiguously defined in (i). A group G is called commutative if g1g2 = g29g for all
g1, g2 E G. It is clear that a vector space V is a commutative group where
the group operation is addition, the identity element is 0 E V, and the
inverse of x is - x.
* Example 6.1. If (V,(-, -)) is a finite dimensional inner product
space, it has been shown that the set of all orthogonal transforma
tions 6(V) is a group. The group operation is the composition of linear transformations, the identity element is the identity linear transformation, and if F E e(V), the inverse of r is I". When V is the coordinate space R', ((V) is denoted by en, which is just the
group of n x n orthogonal matrices.
* Example 6.2. Consider the coordinate space RP and let G' be the set of all p x p lower triangular matrices with positive diagonal elements. The group operation in G' is taken to be matrix multipli cation. It has been verified in Chapter 5 that G' is a group, the
identity in G' is the p x p identity matrix, and if T e G', T- ' is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
186 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
just the matrix inverse of T. Similarly, the set of p x p upper
triangular matrices with positive diagonal elements Gt is a group with the group operation of matrix multiplication.
* Example 6.3. Let V be an n-dimensional vector space and let Gl(V) be the set of all nonsingular linear transformations of V onto V. The group operation in Gl(V) is defined to be composition of linear transformations. With this operation, it is easy to verify that
Gl(V) is a group, the identity in Gl(V) is the identity linear transformation, and if g E Gl(V), g- ' is the inverse linear transfor
mation of g. The group Gl(V) is often called the general linear group of V. When V is the coordinate space Rn, Gl( V) is denoted by
Gln. Clearly, GlI is just the set of n x n nonsingular matrices and
the group operation is matrix multiplication.
It should be noted that 0(V) is a subset of Gl(V) and the group
operation in C(V) is that of Gl(V). Further, GT and Gu are subsets of Gln with the inherited group operations. This observation leads to the definition of a subgroup.
Definition 6.2. If (G, o) is a group and H is a subset of G such that (H, o) is also a group, then (H, o) is a subgroup of (G, o).
In all of the above examples, each element of the group is also a
one-to-one function defined on a set. Further, the group operation is in fact
function composition. To isolate the essential features of this situation, we define the following.
Definition 6.3. Let (G, o) be a group and let 9 be a set. The group (G, o) acts on the left of 9X if to each pair (g, x) E G x 9X, there corresponds a
unique element of IX, denoted by gx, such that
(i) g(g2X) = (g1 o g2)x.
(ii) ex = x.
The content of Definition 6.3 is that there is a function on G x % to DC
whose value at (g, x) is denoted by gx and under this mapping, (g,, g2x)
and (g, a g2, x) are sent into the same element. Furthermore, (e, x) is
mapped to x. Thus each g E G can be thought of as a one-to-one onto
function from 9 to 9 and the group operation in G is function composition.
To make this claim precise, for each g E G, define tg on 6 to 6 by
tg(x) = gx.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.1 187
Proposition 6.1. Suppose G acts on the left of 9X. Then each tg is a
one-to-one onto function from 9 to 9 and:
(i) tg1tg2 tg1 ' 92
(ii) tg ' = t-.
Proof. To show tg is onto, consider x e 9. Then tg(g-'x) = g(g-'x)=
(g o g-=)x = ex = x where (i) and (ii) of Definition 6.3 have been used.
Thus tg is onto. If tg(Xi) = tg(X2), then gx, = gX2 SO
XI =
ex, = (g-l o g)X1 = g1-(gXl) = g'-(gx2)
= (g-l o g)X2 = eX2 = X2
Thus tg is one-to-one. Assertion (i) follows immediately from (i) of Defini tion 6.3. Since te is the identity function on 6X and (i) implies that
t9tg-I - tg-'tg = te,
wehavet - =t ' 0
Henceforth, we dispense with tg and simply regard each g as a function
on 9 to 9 where function composition is group composition and e is the
identity function on 9I. All of the examples considered thus far are groups of functions on a vector space to itself and the group operation is defined to
be function composition. In particular, GI(V) is the set of all one-to-one onto linear transformations of V to V and the group operation is function
composition. In the next example, the motivation for the definition of the group operation is provided by thinking of each group element as a function.
* Example 6.4. Let V be an n-dimensional vector space and consider the set Al(V) that is the collection of all pairs (A, x) with A E Gl(V) and x E V. Each pair (A, x) defines a one-to-one onto function from V to V by
(A, x)v = Av + x, v E V.
The composition of (A,, xl) and (A2, x2) is
(A,, x,)(A2, x2)v = (A,, x,)(A2v + x2) = A,A2v + AIx2 + xI
- (AIA2, AIX2 + XI)v.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
188 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
Also, (I, 0) E Al(V) is the identity function on V and the inverse of (A,x) is (A-, -A-'x). It is now an easy matter to verify that
Al(V) is a group where the group operation in Al(V) is
(A,, x,)(A2, x2)-(AIA2, A,x2 + xl).
This group Al(V) is called the affine group of V. When V is the coordinate space Rn, Al(V) is denoted by Alna.
An interesting and useful subgroup of Al(V) is given in the next example.
* Example 6.5. Suppose V is a finite dimensional vector space and let M be a subspace of V. Let H be the collection of all pairs (A, x)
where x E M, A(M) C M, and (A, x) E Al(V). The group opera
tion in H is that inherited from Al(V). It is a routine calculation to
show that H is a subgroup of Al(V). As a particular case, suppose
that V is Rn and M is the m-dimensional subspace of Rn consisting
of those vectors x E R' whose last n - m coordinates are zero. An
n x n matrix A E Gln satisfies AM c M iff
A,l A12) t 22J
where A11 is m x m and nonsingular, A,2 is m x (n - m), and A22
is (n - m) x (n - m) and nonsingular. Thus H consists of all pairs
(A, x) where A E Gln has the above form and x has its last n - m
coordinates zero.
* Example 6.6. In this example, we consider two finite groups that arise naturally in statistical problems. Consider the space Rn and let
P be an n x n matrix that permutes the coordinates of a vector
x E Rn. Thus in each row and in each column of P, there is a single
element that is one and the remaining elements are zero. Con
versely, any such matrix permutes the coordinates of vectors in Rn.
The set 6,n of all such matrices is called the group of permutation
matrices. It is clear that 'Pn is a group under matrix multiplication
and in has n! elements. Also, let 6Dn be the set of all n X n diagonal
matrices whose diagonal elements are plus or minus one. Obviously,
6Dn is a group under matrix multiplication and 6Dn has 2n elements.
The group 6Dn is called the group of sign changes on Rn. A bit of
reflection shows that both '57n and GiDn are subgroups of On. Now, let
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.1 189
H be the set
H = {PDIP e ';9,, D E (6).
The claim is that H is a group under matrix multiplication. To see
this, first note that for P E -P and D E 6, PDP' is an element of 6D n. Thus if PAD1 and P2D2 are in H, then
P,D,P2D2 = PIP2P2DIP2D2 = P3D3 E H
where P3 = P,P2 and D3 = P2DIP2D2. Also,
(PD) ' = DP' = P'PDP' e H.
Therefore H is a group and clearly has 2nn! elements.
Suppose that G is a group and H is a subgroup of G. The quotient space
G/H, to be defined next, is often a useful representation of spaces that arise in later considerations. The subgroup H of G defines an equivalence relation in G by g, g2 iff gI Eg, H. That = is an equivalence relation is easily verified using the assumption that H is a subgroup of G. Also, it is not
difficult to show that g, g2 iff the set g,H = {glhlh E H) is equal to the set g2 H. Thus the set of points in G equivalent to g1 is the set g IH.
Definition 6.4. If H is a subgroup of G, the quotient space G/H is defined to be the set whose elements are gH for g E G.
The quotient space G/H is obviously the set of equivalence classes (defined by H) of elements of G. Under certain conditions on H, the quotient space G/H is in fact a group under a natural definition of a group operation.
Definition 6.5. A subgroup H of G is called a normal subgroup if g 'Hg = H for all g E G.
When H is a normal subgroup of G, and giH E G/H for i = 1, 2, then
g,Hg2H- {g(g = gh,g2h 2; hl, h2 e H)
= g,g2g92Hg2H = g,g2HH = g,g2H
since HH = H.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
190 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
Proposition 6.2. When H is a normal subgroup of G, the quotient space
G/H is a group under the operation
(g,H)(g2H)=g9g2H.
Proof. This is a routine calculation and is left to the reader.
* Example 6.7. Let Al(V) be the affine group of the vector space V.
Then
H ((I, x)lx E V)
is easily shown to be a subgroup of G, since (I, x0)(I, x2) = (I, xi + x2). To show H is normal in Al(V), consider (A, x) E Al(V)
and (I, xo) E H. Then
(A, x) (I, xo)(A, x) =
(A-, -A'x)(A, x + xo)
= (I,
A-lx + A-lxo- A-lx)
= (I, A -'xo),
which is an element of H. Thus g- 'Hg C H for all g E Al(V). But
if (I, xo) E H and (A, x) E Al(V), then
(A, x) '(I, Axo)(A, x) = (I, xo)
so g- 'Hg = H, for g E Al(V). Therefore, H is normal in Al(V). To
describe the group Al(V)/H, we characterize the equivalence rela
tion defined by H. For (Ai, xi) E Al(V), i = 1, 2,
(Al,, x,)(A x2) = (A-', -A, 'x,)(A2, x2)
=(A, A2, A1'x2-A'lx,)
is an element of H iff A 'A2 = I or Al = A2. Thus (A, xI)
is
equivalent to (A2, x2) iff A, = A2. From each equivalence class, select the element (A,O). Then it is clear that the quotient group
Al(V)/H can be identified with the group
K= ((A,O)|A E Gl(V))
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.3 191
where the group operation is
(Al, O)(A2,O) = (AIA2, O).
Now, suppose the group G acts on the left of the set 9?. We say G acts
transitively on 9 if, for each x, and x2 in X, there exists a g E G such that
gx, = x2. When G acts transitively on X, we want to show that there is a
natural one-to-one correspondence between X and a certain quotient space. Fix an element xo E 9C and let
H = {hlhxo = xo, h E G).
The subgroup H of G is called the isotropy subgroup of xo. Now, define the
function T on G/H to QC by T(gH) = gxo.
Proposition 6.3. The function T is one-to-one and onto. Further,
T(gjgH) = g1T(gH).
Proof. The definition of T clearly makes sense as ghxo = gxo for all h e H.
Also, T is an onto function since G acts transitively on 'X. If T(g,H) =
T(g2H), then g1xO = g2xO so gj'g1 Ee H. Therefore, g,H= g2H 50 7 is one-to-one. The rest is obvious. El
If H is any subgroup of G, then the group G acts transitively on 9 G/H where the group action is
gl(gH) -glgH.
Thus we have a complete description of the spaces 9 that are acted on
transitively by G. Namely, these spaces are simply relabelings of the quotient spaces G/H where H is a subgroup of G. Further, the action of g on 9 corresponds to the action of G on the quotient space described in Proposition 6.3. A few examples illustrate these ideas.
* Example 6.8. Take the set 9 to be IF n-the set of n X p real
matrices I that satisfy +'I = Ip, 1 < p < n. The group G = /9n of
all n X n orthogonal matrices acts on p, n by matrix multiplication. That is, if FE Q and I e 6 n, then r* is the matrix product of F
and I. To show that this group action is transitive, consider I' and I2 in SF n. Then, the columns of *, form a set of p orthonormal
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
192 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
vectors in R' as do the columns of '2. By Proposition 1.30, there exists an n x n orthogonal matrix F that maps the columns of I1 into the columns of 'P2. Thus FI, = '2 so Qn is transitive on p .
A convenient choice of x0 E Cp n to define the map T is
x IP xO= (o)
where 0 is a block of (n - p) X p zeroes. It is not difficult to show
that the subgroup H = (IFrx0 = x0, r E en) iS
H=({FIF= ( r22)I 22 (n -p)}
The function T is
T(H) = xo =rp
which is the n x p matrix consisting of the first p columns of r. This gives an obvious representation of ip n.
Example 6.9. Let 9 be the set of all p x p positive definite
matrices and let G = Glp. The transitive group action is given by
A(x) = AxA' where A is a p x p nonsingular matrix, x E 'X, and A' is the transpose of A. Choose xo e 9X to be Ip. Obviously, H = 6p and the map T is given by
T(AH) =
A(xo) = AA'.
The reader should compare this example with the assertion of Proposition 1.31.
Example 6.10. In this example, take 9 to be the set of all n x p
real matrices of rank p, p < n. Consider the group G defined by
G={gfg=r? T,rEQ, TE GT)
where GT is the group of all p x p lower triangular matrices with
positive diagonal elements. Of course, ? denotes the Kronecker product and group composition is
(11 ? T1)(12 X T2) = (1F12) ? (TIT2).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.3 193
The action of G on 9 is
(F T)X= rXT', X E 9C.
To show G acts transitively on 9, consider Xl, X2 E 9 and write
Xi= - LUi, where 4' E- IF and UV E Gt, i = 1, 2 (see Proposition
5.2). From Example 6.8, there is a F e en such that F4' = '2. Let
T'1" = U2 so
rX,T' = r*IU1Uj'U2 = *2U2 = X2.
Choose X0 E % to be
( O )
as in Example 6.8. Then the equation (F ? T)Xo = X0 implies that
Ip= XoXo = ((I' ? T)Xo)'(I ? T)Xo = TX;F'FXOT' = TT'
so T Ip by Proposition 5.4. Then the equation (F ? Ip)X0 = X=
is exactly the equation occurring in Example 6.8 for elements of the subgroup H. Thus for this example,
H= (v?i v= (- vn4 } ( P ( O 0 ]22 ) ' 22 E
I)
Therefore,
T((FX?T)H)=(F ? T)X0 = r( T
is the representation for elements of '. Obviously,
r ) r E c- FPn
and the representation of elements of 9 via the map T is precisely the representation established in Proposition 5.2. This representa tion of 9 is used on a number of occasions.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
194 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
6.2. INVARIANT MEASURES AND INTEGRALS
Before beginning a discussion of invariant integrals on locally compact topological groups, we first outline the basic results of integration theory on locally compact topological spaces. Consider a set 9X and let i be a
Hausdorff topology for 9X.
Definition 6.6. The topological space (C, i) is a locally compact space if for each x E 9, there exists a compact neighborhood of x.
Most of the groups introduced in the examples of the previous section are subsets of the space Rm, for some m, and when these groups are given the
topology of Rm, they are locally compact spaces. The verification of this is not difficult and is left to the reader. If (9X, i) is a locally compact space,
STC( ) denotes the set of all continuous real-valued functions that have
compact support. Thus f E X(;X() if f is a continuous and there is a
compact set K such that f(x) = 0 if x 0 K. It is clear that Yu((9X) is a real
vector space with addition and scalar multiplication being defined in the obvious way.
Definition 6.7. A real-valued function J defined on 'X(9X) is called an
integral if:
(i) J(alfl + a2f2) = alJ(fl) + a2J(f2) for a,, a2 E R and fl, f2 E
(ii) J(f ) > 0 if f > O, f E YuC(9).
An integral J is simply a linear function on YuC( (X) that has the additional
property that J( f ) is nonnegative when f > 0. Let 03 (9X) be the a-algebra
generated by the compact subsets of 6X. If ji is a measure on B (6X ) such
that ,u(K) < + oo for each compact set K, it is clear that
J( f ) J | f (x)M(dx)
defines an integral on SCf(). Such measures ,u are called Radon measures.
Conversely, given an integral J, there is a measure ,I on ' (BX) such that
,u(K) < + so for all compact sets K and
(f ) = ff(x)tt(dx)
for f E 'X(6X%). For a proof of this result, see Segal and Kunze (1978,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INVARIANT MEASURES AND INTEGRALS 195
Chapter 5). In the special case when (9, i) is a a-compact space-that is,
9 = u OK1 where KI is compact-then the correspondence between in
tegrals J and measures j, that satisfy ,u(K) < + oo for K compact is
one-to-one (see Segal and Kunze, 1978). All of the examples considered here are a-compact spaces and we freely identify integrals with Radon measures and vice versa.
Now, assume (9C, i) is a a-compact space. If an integral J on Xff) corresponds to a Radon measure ,t on 6(9X), then J has a natural extension
to the class of all 603(%X)-measurable and yi-integrable functions. Namely, J
is extended by the equation
J(f) = ff(x)t(dx)
for all f for which the right-hand side is defined. Obviously, the extension of J is unique and is determined by the values of J on '3CQX). In many of the
examples in this chapter, we use J to denote both an integral on SC(%X) and
its extension. With this convention, J is defined for any 13(DC) measurable function that is ,-integrable where u corresponds to J.
Suppose G is a group and i is a topology on G.
Definition 6.8. Given the topology i on G, (G, i) is a topological group if the mapping (x, y) -- xy- is continuous from G x G to G. If (G, i) is a
topological group and (G, i) is a locally compact topological space, (G, i) is called a locally compact topological group
In what follows, all groups under consideration are locally compact topological groups. Examples of such groups include the vector space Rn, the general linear group GlI the affine group Ale, and G'. The verification that these groups are locally compact topological groups with the Euclidean space topology is left to the reader.
If (G, i) is a locally compact topological group, fC(G) denotes the real vector space of all continuous functions on G that have compact support. For s E G and f E '3C(G), the left translate of f by s, denoted by sf, is defined by (sf )(x) f(s -'x), x E G. Clearly, sf E S3C(G) for all s E G.
Similarly, the right translate of f E StC(G), denoted by fs, is (fs)(x) =_ f(xs- ') and fs e SC(G).
Definition 6.9. An integral J * 0 on SXC(G) is left invariant if J(sf) J(f) for all f E -'X3(G) and s E G. An integral J * 0 on %;(G) is right invariant if
J(fs) = J(f ) for all fE Y (G) and s e G.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
1% TOPOLOGICAL GROUPS AND INVARIANT MEASURES
The basic properties of left and right invariant integrals are summarized in the following two results.
Theorem 6.1. If G is a locally compact topological group, then there exist left and right invariant integrals on SC(G). If J, and J2 are left (right) invariant integrals on YuC(G), then J2 = cJ, for some positive constant c.
Proof. See Nachbin (1965, Section 4, Chapter 2).
Theorem 6.2. Suppose that
.J( f) _|ff(x)tu(dx)
is a left invariant integral on 3(G). Then there exists a unique continuous function Ar mapping G into (0, x) such that
(f(xs- ')(dx) Ar(S) f(x),u(dx)
for all s E G and f E i;C(G). The function Ar, called the right-hand modulus
of G, also satisfies:
(i) Ar(St) = Ar(S)Ar(t), S, t E G.
(ii) Jf(x')(dx) = f(X)Ar(X-1),u(dx).
Further, the integral
J(f) f f(X)Ar(Xl')t(dX)
is right invariant.
Proof. See Nachbin (1965, Section 5, Chapter 2). The two results above establish the existence and uniqueness of right and
left invariant integrals and show how to construct right invariant integrals from left invariant integrals via the right-hand modulus Ar. The right-hand
modulus is a continuous homomorphism from G into (0, x)-that is, Ar 1s
continuous and satisfies Ar(St)0 = r(S)Ar(t), for s, t E G. (The definition of a homomorphism from one group to another group is given shortly.)
Before presenting examples of invariant integrals, it is convenient to introduce relatively left (and right) invariant integrals. Proposition 6.4, given
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.4 197
below, provides a useful method for constructing invariant integrals from
relatively invariant integrals.
Definition 6.10. A nonzero integral J on SC(G) given by
J(f ) = ff(x)m(dx), f e Yu(G),
is called relatively left invariant if there exists a function X on G to (0, oo)
such that
ff(s-x)m(dx) = x(s)ff(x)m(dx)
for all s e G and f e S(;(G). The function X is the multiplier for J.
It can be shown that any multiplier X is continuous (see Nachbin, 1965). Further, if J is relatively left invariant with multiplier X, then for s, t E G and f E cW (G),
x(st) f (x)m(dx) = fr((st)y'x)m(dx) = J(tf)(s-'x)m(dx)
= x(s)J(tf)(x)m(dx) = X(s)f(t-'x)m(dx)
= X(s)x(t) f(x)m(dx).
Thus x(st) = X(s)X(t). Hence all multipliers are continuous and are homo morphisms from G into (0, oc). For any such homomorphism X, it is clear that x(e) = 1 and X(s-') = l/X(s). Also, x(G) = {x(s)Is e G) is a sub
group of the group (0, oc) with multiplication as the group operation.
Proposition 6.4. Let X be a continuous homomorphism on G to (0, oo).
(i) If J(f) = Jf(x),u(dx) is left invariant on 'X(G), then
J,(f) ff(x)x(x)A (dx)
is a relatively left invariant integral on 9(;(G) with multiplier X.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
198 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
(ii) If J,(f) = Jf(x)m(dx) is relatively left invariant with multiplier X, then
J(f) -|f(x)x(x-')m(dx)
is a left invariant integral.
Proof. The proof is a calculation. For (i),
J1(sf) = f(sf)(x)x(x)ji(dx) = |f(s- x)X(ss'- x),i(dx)
= x(s)f(s-'x)x(s-1x)[(dx) = X(s)Jf(x)x(x)l(dx)
= x(S)Ji(f).
Thus J1 is relatively left invariant with multiplier X. For (ii),
J(sf) = f(- lx)x(x'- )m(dx) = f(s'- x)X(s'-'sx'- )m(dx)
= X(s-)f (s -lx)X((s-lx) ')m(dx)
= X(s')X(s) f(x)X(x-')m(dx)
= ff(X)x(x'- )m(dx) = J(f ).
Thus J is a left invariant integral and the proof is complete. O
If J is a relatively left invariant integral with multiplier X, say
J(x) = Jf(x)m(dx),
the measure m is also called relatively left invariant with multiplier X. A
nonzero integral J, on Yu;(G) is relatively right invariant with multiplier X if
J1(fs) = x(s)J1(f). Using the results given above, if J1 is relatively right
invariant with multiplier X, then J, is relatively left invariant with multiplier
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.4 199
X/Ar where Ar is the right-hand modulus of G. Thus all relatively right and
left invariant integrals can be constructed from a given relatively left (or right) invariant integral once all the continuous homomorphisms are known.
Also, if a relatively left invariant measure m can be found and its multiplier X calculated, then a left invariant measure is given by m/X according to
Proposition 6.4. This observation is used in the examples below.
* Example 6.11. Consider the group Gln of all nonsingular n x n matrices. Let ds denote Lebesgue measure on Gl. Since Gl =
(sldet(s) * 0), Gln is a nonempty open subset of n2-dimensional Euclidean space and hence has positive Lebesgue measure. For f e SX (Gln), let
J(f) = ff(t) dt.
To find a left invariant measure on Gln, it is now shown that
J(sf ) = Idet(s)lnJ(f) so J is relatively left invariant with multiplier X(s) = Idet(s)In. From Proposition 5.10, the Jacobian of the trans
formation g(t) = st, s e Gln, is Idet(s)l. Thus
J(sf) = |f (st) dt = Idet(s)Inff(t) dt = Idet(s)InJ(f).
From Proposition 6.4, it follows that the measure
Id ) et( t )In'
is a left invariant measure on Gln. A similar Jacobian argument shows that ,u is also right invariant, so the right-hand modulus of GI is Ar 1. To construct all of the relatively invariant measures on Gln, it is necessary that the continuous homomorphisms X be characterized. For each a E R, let
Xe(s) = Idet(s)Ia, s E Gl.
Obviously, each X. is a continuous homomorphism. However, it can be shown (see the problems at the end of this chapter) that if X is a
continuous homomorphism of Gln into (0, oo), then X = Xa for some
a E R. Hence every relatively invariant measure on Gln is given by
m(dt) = CXa(t) dt
where c is a positive constant and a e R.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
200 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
A group G for which Ar = 1 is called unimodular. Clearly, all commuta
tive groups are unimodular as a left invariant integral is also right invariant. In the following example, we consider the group G', which is not unimodu
lar, but GT is a subgroup of the unimodular group Gl.
* Example 6.12. Let GT be the group of all n X n lower triangular
matrices with positive diagonal elements. Thus GT is a nonempty
open subset of [n(n + 1)/21-dimensional Euclidean space so GT
has positive Lebesgue measure. Let dt denote [n(n + 1)/2] dimensional Lebesgue measure restricted to GT. Consider the in
tegral
J( f) - |f(t) dt
defined on WY(GT). The Jacobian of the transformation g(t)= st, S E GT is equal to
n
XO(S) nsiii i=7I
where s has diagonal elements S11,.., Snn (see Proposition 5.13). Thus
J(sf) = f(s' t) dt = Xo(s)f (t) dt = Xo(s) J(f)
Hence J is relatively left invariant with multiplier Xo so the measure
dt dt (dt)=-rl ti ot
is left invariant. To compute the right-hand modulus Ar for GT, let
J1(f) = ff(t)M(dt)
so J1 is left invariant. Then
JU(fs) =f(ts'),L(dt)=ff(ts-) dt (ts- dt
f(tsI)X (S dt = Xo(s')f f(ts') dt.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.4 201
By Proposition 5.14, the Jacobian of the transform g(t) = ts is
n
Xl(s) = n-"i+l. i= 1
Therefore,
J1(fs) = o(s-P f(ts') = Xd(S ')X (s) (t)
X(s) J (f Xo(s)
By Theorem 6.2,
Ar(S)
- x(S)
= nHSn-2i+1
is the right-hand modulus for GT. Therefore, the measure
P(t =I(dt) dt dt
) r(t) Xo(t)Ar(t) -iH1= itri-i+
is right invariant. As in the previous example, a description of the relatively left invariant measures is simply a matter of describing all the continuous homomorphisms on GT. For each vector c E Rn
with coordinates c,,..., cn,, let
n
XC(t -- (i c, 1=1
where t E GT has diagonal elements t1, ...., tnn. It is easy to verify that Xc is a continuous homomorphism on GT. It is known that if X is a continuous homomorphism on GT, then X is given by Xc for some c E Rn (see Problems 6.4 and 6.9). Thus every relatively left invariant measure on GT has the form
m(dt) = kXc(t) Xd(t)
for some positive constant k and some vector c E R'.
The following two examples deal with the affine group and a subgroup of Gln related to the group introduced in Example 6.5.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
202 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
* Example 6.13. Consider the group Al, of all affine transformations on R'. An element of Aln is a pair (s, x) where s C GlI and x C Rn.
Recall that the group operation in Aln is
(SI, X1)(S2, X2) = (s1s2, s1X2 + XI)
so
(s, X) = (S, s-s x).
Let ds dx denote Lebesgue measure restricted to Aln. In order to construct a left invariant measure on Aln, it is shown that the
integral
J(f) ff(t, y) dt dy
is relatively left invariant with multiplier
O(S, x) = Idet(s )In+ 1
For (s, x) E Aln,
J((s, x)f) = ff((s x)1(t, y)) dtdy
f ((s- , -s- 'x)(t, y)) dt dy
=ff(s-t, s-'y - s-'x) dtdy
= Idet(s)lff(s-'t, u) dtdu.
The last equality follows from the change of variable u = s- ly - sx,
which has a Jacobian Idet(s)I. As in Example 6.1 1,
f (s-'t, u) dt = idet(s)lnf f(t, u) dt Gln GIn
for each fixed u E Rn. Thus
J((s , x)f) = Idet(s )In+ If (t, u) dt du = Idet(s)l n+ lJ(f)
- Xo(s x)J(f)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.4 203
so J is relatively left invariant with multiplier Xo. Hence the
measure
dsdu dsdu
p(ds, du)
xo(s, u) Idet(s)In+I
is left invariant. To find the right-hand modulus of Aln, let
U)dt du
J1(f) = Jf(t, u) XO(t, u)
be a left invariant integral. Then using an argument similar to that
above, we have
J1(f(S' x)) = ff((t' )(S' X)') dt du xo(t, U)
dt du - Mf((t, u)(s-, '-sx))o(t u)
= (f(ts', u - ts'x) dtdu
dt du
u ) jdet( )in+ '
= Idet(s 1)In+If f(tS,u) dt du
= Idet(s- 1)In?+
Idet(s )InJf(t,
u) dt du )Idet( t)In+ I
= Idet(s)F J1 (f )
Thus Ar(S, x) = Idet(s)l-' so a right invariant measure on Aln is
v(ds, du) = (S ) ,(ds, du) = du
A'(S' u) ~~Idet(s )In
Now, suppose that X is a continuous homomorphism on Aln. Since
(s, x) = (s,O)(e, s-'x) = (e, x)(s,O)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
204 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
where e is the n X n identity matrix, X must satisfy the equation
X(s, x) = X(s,O)X(e, s-'x) = X(s,O)X(e, x)
Thus for all s c Gln,
X(e, x) = X(e, s-'x).
Letting s- I converge to the zero matrix, the continuity of X implies that
X(e, x) = X(e,O) = 1
since (e, 0) is the identity in Aln. Therefore,
X(s, X) = x(S,O), s E Gl .
However,
X(51X?)(S21?)) = X((S1S2),0) = X(S1X )X(S2 ?)
so X is a continuous homomorphism on Gln. But every continuous homomorphism on Gln is given by s -- Idet(s)ta for some real a. In
summary, X is a continuous homomorphism on Aln iff
X(s, x) = Idet(s)la
for some real number a. Thus we have a complete description of all
the relatively invariant integrals on Alna.
Example 6.14. In this example, the group G consists of all the n
X n nonsingular matrices s that have the form
1sti 5120 S= (
s:P SI1 E Glp, S22 C Glq
where p + q = n. Let M be the subspace of Rn consisting of those
vectors whose last q coordinates are zero. Then G is the subgroup of
Gln consisting of those elements s that satisfy s(M) c M. Let
ds1 ds12ds22 denote Lebesgue measure restricted to G when G is regarded as a subset of (p2 + q2 + pq)-dimensional Eucidean space. Since G is a nonempty open subset of this space, G has
positive Lebesgue measure. As in previous examples, it is shown
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.4 205
that the integral
J(f) )-Jff(t) dtll dti2 dt22
is relatively left invariant. For s E G,
J(sf ) = f(s- 't) dt,, dtl2 dt22.
A bit of calculation shows that
(Sii S12)\' _ (si5,1 S11 12S22)
t ? 522} l ? 5~~~22
and
s-1t = 1 t l I 1 12 -
s1i s112s2-2 t22
\ O 5~~~22 t22
Let
uII = sHs1t1, U22 = S22t22
1=1 SI2 - S11
S12522 t22'
The Jacobian of this transformation is
xO(s) idet(s,,)IPldet(S22)lqldet(s,,)Iq = Idet(s11)Injdet(s22)I .
Therefore,
J(Sf) = xo(s)J(f)
so the measure
L ( dt, 1 7 dtI2 dt22 ) - Idet( t 1 )n nIdet( t22 )Iq
is left invariant. Setting
.1(f ) ff(t)(dti, dt'2, dt22),
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
206 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
a calculation similar to that above yields
JI(fS) =
Ar(S)Ji(f)
where
Ar(S) = Idet S1ll-qIdetS221P.
Thus Ar is the right-hand modulus of G and the measure
vp(dt,,,1 dt12 9 dt22 )- A t dtt)Pdtt)In
is right invariant. For a, ,B E R, let
X. ( s) =-ldet(s5, 1)jajdet(S522 )W
Clearly, X.# is a continuous homomorphism of G into (0, oo).
Conversely, it is not too difficult to show that every continuous homomorphism of G into (0, oo) is equal to X for some a, ,B e R.
Again, this gives a complete description of all the relatively in variant integrals on G.
In the four examples above, the same argument was used to derive the
left and right invariant measures, the modular function, and all of the relatively invariant measures. Namely, the group G had positive Lebesgue
measure when regarded as a subset of an obvious Euclidean space. The
integral on SC(G) defined by Lebesgue measure was relatively left invariant with a multiplier that we calculated. Thus a left invariant measure on G was
simply Lebesgue measure divided by the multiplier. From this, the right-hand modulus and a right invariant measure were easily derived. The characteri zation of the relatively invariant integrals amounted to finding all the
solutions to the functional equation X(st) = x(s)X(t) where X is a continu
ous function on G to (0, ox). Of course, the above technique can be applied
to many other matrix groups-for example, the matrix group considered in
Example 6.5. However, there are important matrix groups for which this
argument is not available because the group has Lebesgue measure zero in
the "natural" Eucidean space of which the group is a subset. For example,
consider the group of n X n orthogonal matrices On. When regarded as a
subset of n2-dimensional Euclidean space, (9n has Lebesgue measure zero.
But, without a fairly complicated parameterization of 9n, it is not possible to
regard (9n as a set of positive Lebesgue measure of some Eucidean space.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.5 207
For this reason, we do not demonstrate directly the existence of an invariant measure on On in this chapter. In the following chapter, a probabilistic proof
of the existence of an invariant measure on On is given.
The group Q,n as well as other groups to be considered later, are in fact
compact topological groups. A basic property of such groups is given next.
Proposition 6.5. Suppose G is a locally compact topological group. Then G
is compact iff there exists a left invariant probability measure on G.
Proof. See Nachbin (1965, Section 5, Chapter 2). o
The following result shows that when G is compact, left invariant
measures are right invariant measures and all relatively invariant measures
are in fact invariant.
Proposition 6.6. If G is compact and X is a continuous homomorphism on
G to (0, oo), then X(s) = 1 for all s E G.
Proof Since X is continuous and G is compact, X(G) = (x(s)Is E G) is a compact subset of (0, oo). Since X is a homomorphism, x(G) is a subgroup of (0, oo). However, the only compact subgroup of (0, oo) is (1). Thus
X(s) = 1 for all s e G. El
The nonexistence of nontrivial continuous homomorphisms on compact groups shows that all compact groups are unimodular. Further, all relatively invariant measures are invariant. Whenever G is compact, the invariant
measure on G is always taken to be a probability measure.
6.3. INVARIANT MEASURES ON QUOTIENT SPACES
In this section, we consider the existence and uniqueness of invariant integrals on spaces that are acted on transitively by a group. Throughout this section, 9 is a locally compact Hausdorff space and 'J(6x) denotes the set of continuous functions on 9 that have compact support. Also, G is a locally compact topological group that acts on the left of 6X.
Definition 6.11. The group G acts topologically on 9 if the function from G x 9 to 9 given by (g, x) -- gx is continuous. When G acts topologically on 9, 9 is a left homogeneous space if for each x E 9, the function -tx on G to 9 defined by rx(g) = gx is continuous, open, and onto '9.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
208 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
The assumption that each 1Tr is an onto function is just another way to say that G acts transitively on 9X. Also, it is not difficult to show that if, for one x E 9, rx is continuous, open, and onto 9X, then for all x, ?T. is
continuous, open, and onto 9?. To describe the structure of left homoge neous spaces 9, fix an element xo e 9 and let
Ho = (glgxo = xo, g e G).
That Ho is a closed subgroup of G is easily verified. Further, the function T considered in Proposition 6.3 is now one-to-one, onto, and T and Tr 1 are
both continuous. Thus we have a one-to-one, onto, bicontinuous mapping between 9 and the quotient space G/Ho endowed with the quotient topology. Conversely, let H be a closed subgroup of G and take % = G/H
with the quotient topology. The group G acts on G/H in the obvious way (g(g,H) = gg1H) and it is easily verified that G/H is a left homogeneous space (see Nachbin 1965, Section 3, Chapter 3). Thus we have a complete
description of the left homogeneous spaces (up to relabelings by T) as
quotient spaces G/H where H is a closed subgroup of G.
In the notation above, let 9 be a left homogeneous space.
Definition 6.12. A nonzero integral J on '7Y(9X)
J(f) =ff(x) m(dx), fe ('X)
is relatively invariant with multiplier X if, for each s E G,
f(s- ix)m (dx) = x(s) f(x)m(dx)
for allf E SC(%).
For f EE '3('X9), the function sf given by (sf )(x) = f(s- 'x) is the left
translate of f by s E G. Thus an integral J on C(9(X) is relatively invariant
with multiplier X if J(sf ) = x(s)J(f). For such an integral,
x(st)J(f ) = J((st)f ) = J(s(tf)) = x(s)J(tf ) = x(s)x(t)J(f)
so X(st) = X(s)X(t). Also, any multiplier X is continuous, which implies that a multiplier is a continuous homomorphism of G into the multiplicative
group (0, oo).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INVARIANT MEASURES ON QUOTIENT SPACES 209
* Example 6.15. Let 9 be the set of all p x p positive definite
matrices. The group G = Glp acts transitively on 9 as shown in
Example 6.9. That 9 is a left homogeneous space is easily verified.
For a E R, define the measure m, by
ma(dx) = (det(x))a/2 dx (p + 1)/2 (det(x))~ 12
where dx is Lebesgue measure on 9X. Let Ja(f) Jf(x)ma(dx). For s E Glp, s(x) = sxs' is the group action on 6. Therefore,
Ja(sf) = Jf(s'(x))ma(dx)
(det(x))(p?+ )/2
Idet(s)laff(s-xs' I)det(s- xs,I)(/2 dx
1)/2
= Idet(s)IaJf(x)(det(X)) a/2 dx 1)/2
The last equality follows from the change of variable x = sys', which has a Jacobian equal to Idet(s)IP+ (see Proposition 5.11). Hence
Ja(sf) = ldet(s)laJ(f)
for all s E Glp, f E CY(9X), and Ja is relatively invariant with multiplier xa(s) = Idet(s)la. For this example, it has been shown that for every continuous homomorphism X on G, there is a relatively invariant integral with multiplier X. That this is not the case in general is demonstrated in future examples.
The problem of the existence and uniqueness of relatively invariant integrals on left homogeneous spaces 9 is completely solved in the follow ing result due to Weil (see Nachbin, 1965, Section 4, Chapter 3). Recall that xo is a fixed element of % and
Ho =
(glgxo = xo, g E G)
is a closed subgroup of G. Let Ar denote the right-hand modulus of G and let Ar denote the right-hand modulus of Ho.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
210 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
Theorem 6.3. In the notation above:
(i) If J(f) = Jf(x)m(dx) is relatively invariant with multiplier X, then
r(Ih) =X(h)Ar(h) for allh E Ho.
(ii) If X is a continuous homomorphism of G to (0, ox) that satisfies A?(h) = X(h)Ar(h), h E Ho, then a relatively invariant integral with multiplier X exists.
(iii) If J1 and J2 are relatively invariant with the same multiplier, then there exists a constant c > 0 such that J2 = cJ,.
Before turning to applications of Theorem 6.3, a few general comments are in order. If the subgroup Ho is compact, then A?(h) = 1 for all h E Ho. Since the restrictions of X and of Ar to Ho are both continuous homomor
phisms on Ho, Ar(h) = x(h) = 1 for all h e Ho as Ho is compact. Thus
when Ho is compact, any continuous homomorphism X is a multiplier for a
relatively invariant integral and the description of all the relatively invariant integrals reduces to finding all the continuous homomorphisms of G.
Further, when G is compact, then only an invariant integral on %u(X) can
exist as X 1 is the only continuous homomorphism. When G and H are
not compact, the situation is a bit more complicated. Both Ar and Al must
be calculated and then, the continuous homomorphisms X on G to (0, x0) that satisfy (ii) of Theorem 6.3 must be found. Only then do we have a
description of the relatively invariant integrals on SC(X). Of course, the
condition for the existence of an invariant integral (X 1) is that A?(h) =
Ar( h) for all h E Ho. If J is a relatively invariant integral (with multiplier X) given by
J(f ) = ff(x)m(dx), f E 3(9X),
then the measure m is called relatively invariant with multiplier X. In
Example 6.15, it was shown that for each a E R, the measure ma was
relatively invariant under Glp with multiplier Xa Theorem 6.3 implies that
any relatively invariant measure on the space of p x p positive definite
matrices is equal to a positive constant times an m< for some a E R. We
now proceed with further examples.
* Example 6.16. Let 9 = p,n and let G = On. It was shown in
Example 6.8 that Q acts transitively on 6, n. The verification that
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INVARIANT MEASURES ON QUOTIENT SPACES 211
lp is a left homogeneous space is left to the reader. Since Q is
compact, Theorem 6.3 implies that there is a unique probability measure IL on p n, that is invariant under the action of onPI Also, any relatively invariant measure on 6F will be equal to a positive constant times ,u. The distribution , is sometimes called the uniform distribution on 65 n. When p = 1, then
ql,n= (xlx E Rn,IIXII = 1),
which is the rim of the unit sphere in Rn. The uniform distribution on 6Cl n iS just surface Lebesgue measure normalized so that it is a
probability measure. When p = n, then 'n, ?=Q on and ,u is the
uniform distribution on the orthogonal group. A different argu ment, probabilistic in nature, is given in the next chapter, which also establishes the existence of the uniform distribution on 6F
* Example 6.17. Take 9X = RP - (0) and let G = Glp. The action of
Glp on 9 is that of a matrix acting on a vector and this action is
obviously transitive. The verification that 9X is a left homogeneous space is routine. Consider the integral
J(f) = ff(x) dx, f EfCQTO)
where dx is Lebesgue measure on 9X. For s e Glp, it is clear that
J(sf ) = Idet(s)l)J(f ) so J is relatively invariant with multiplier xI (s) = Idet(s)I. We now show that J is the only relatively invariant integral on fC( X9). This is done by proving that X1 is the only possible multiplier for relatively invariant integrals on %3C( X). A convenient choice of x0 E 9c is x0 = El where El = (1,0,..., 0). Then
Ho =
(hhel=1, h E Glp}.
A bit of reflection shows that h E Ho iff
hI( h12) w e1(p 122)
where h 22 E- GI(p _ 0
and h 12 is 1 x ( p - 1). A calculation similar to
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
212 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
that in Example 6.14 yields
4(dh,2, dh22) = -dh1 h22
as a left invariant measure on Ho. Then the integral
J1(f) ff|o(h)L(dhl2 dh22)
is left invariant on '3C(HO) and a standard Jacobian argument yields
J(ff) = Ar(h) J1(f, ff E 'J(Ho)
where
Ar(h) = Idet(h22)j, h E Ho.
Every continuous homomorphism on Gip has the form xe(s)= Idet(s)Ia for some a E R. Since Ar = 1 for Glp, Xa can be a
multiplier for an invariant integral iff
Yr(h) = Xa(h), h E Ho.
But A?(h) = jdet(h22)1 and for h E Ho, Xa(h) = Idet(h22)1a so the
only value for a for which Xa can be a multiplier is a = 1. Further,
the integral J is relatively invariant with multiplier X,. Thus Lebesgue measure on 9 is the only (up to a positive constant) relatively
invariant measure on 9 under the action of Gip.
Before turning to the next example, it is convenient to introduce the
direct product of two groups. If GI and G2 are groups, the direct product of
G, and G2, denoted by G GI x G2, is the group consisting of all pairs
(g1, g2) with gi e Gi, i = 1, 2, and group operation
(g1, g2)(h1, h2) =(glhl, g2h2).
If ei is the identity in Gi, i = 1,2, then (el, e2) is the identity in G and
(g, g2'- = (g-', gj'). When G, and G2 are locally compact topological groups, then GI x G2 is a locally compact topological group when endowed
with the product topology. The next two results describe all the continuous
homomorphisms and relatively left invariant measures on G, x G2 in terms
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.8 213
of continuous homomorphisms and relatively left invariant measures on G, and G2.
Proposition 6.7. Suppose G, and G2 are locally compact topological groups. Then X is a continuous homomorphism on GI x G2 iff X((g1, g2)) = X1(g1)X2(g2), (g1, g2) e GI x G2, where Xi is a continuous homomor phism on Gi, i = 1, 2.
Proof If X((g1, g2)) = x1(g1)x2(g2), clearly X is a continuous homomor
phism on G, x G2. Conversely, since (g1, g2) = (g,, e2)(el, g2), if X is a
continuous homomorphism on G, X G2, then
X((g1, g2)) = X(g1, e2)X(eI, g2).
Setting XI(g1) = X(gI, e2) and X2(g2) = X(el, g2), the desired result fol lows. o]
Proposition 6.8. Suppose X is a continuous homomorphism on G1 x G2 with X(g1, g2) =
x1(g1)x2(g2) where Xi is a continuous homomorphism on
Gi, i = 1, 2. If m is a relatively left invariant measure with multiplier X, then there exist relatively left invariant measures mi on Gi with multipliers Xi, i = 1, 2, and m is product measure m 1 X M2. Conversely, if mi is a relatively left invariant measure on Gi with multiplier Xi, i = 1, 2, then m I X M 2 is a
relatively left invariant measure on G1 x G2 with multiplier X, which satisfies X(g1, g2) = Xl(gl)X2(9g2)
Proof This result is a direct consequence of Fubini's Theorem and the existence and uniqueness of relatively left invariant integrals. O
The following example illustrates many of the results presented in this chapter and has a number of applications in multivariate analysis. For example, one of the derivations of the Wishart distribution is quite easy given the results of this example.
* Example 6.18. As in Example 6.10, 9 is the set of all n x p matrices with rank p and G is the direct product group On X GT. The action of (1', T) E en X GT on 9 is
(ri T)X (r X T)XX rXT , X E mt oe.
Since 9C = (XI X Ep- E det(X'X) > 0}, 9C is a nonempty open
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
214 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
subset of ep n. Let dX be Lebesgue measure on 9 and define a measure on 9 by
m (dX) = dX
(det( X X))n/
Using Proposition 5.10, it is an easy calculation to show that the integral
J(f) _Jf(X)m(dX)
is invariant-that is, J((F, T)f) = J(f ) for (F, T) E On X GT and
f e 'i(9X). However, it takes a bit more work to characterize all the
relatively invariant measures on 9. First, it was shown in Example 6.10 that, if X0 is
xo = (I) E 9CX,
then Ho = ((, T)I(J, T)Xo = X0) is a closed subgroup of 0on and hence is compact. By Theorem 6.3, every continuous homomor phism on 0,, X GT is the multiplier for a relatively invariant in
tegral. But every continuous homomorphism X on 0n X GT has the form X(F, T) = xl(r)X2(T) where XI and X2 are continuous ho
momorphisms on 0,, and GT. Since 0,, is compact, Xi = 1. From Example 6.12,
p x2(T) = r (tii)c XC(T)
i= 1
where c E RP has coordinates c,,..., cp. Now that all the possible
multipliers have been described, we want to exhibit the relatively
invariant integrals on XfCQX). To this end, consider the space
'tJ = '>nx GU so points in L3? are (I, U) where I is an n x p
linear isometry and U is a p X p upper triangular matrix in Gu. The
group 06, x GT acts transitively on 6J under the group action
(F, T)(4, U) (r, UT').
Let ,uo be the unique probability measure on L,P n,, that is On-invariant and let vr be the particular right invariant measure on the group Gu
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.8 215
given by
vr(dU)= dU H u 1u1.
Obviously, the integral
J1(f) J f (t, U)to(d4)vr(dU)
is invariant under the action of Q x GT on p n x GU, f E X(9pn X Ga). Consider the integral
J2(U)-| fJ(4, U)Xc(U')Io(d4)Vr(dU)
defined on Yu(Sp,n X G') where Xc is a continuous homomorphism on GT. The claim is that J2((r, T)f) = Xc(T)J2(f ) so J2 is rela
tively invariant with multiplier Xc. To see this, compute as follows:
j2((r, T)f) = fJt((r, T) '(', U))Xc(U')M0(d4)Pr(dU)
= f (rJ , UT' l)xc(TT- 'U')#o(d4)Pr(dU)
= xc(T)fff(rJ," UT' ')Xc((UT-1)')I.LO(dI)Pr(dU)
= xc(T)J2(f).
The last equality follows from the invariance of pu and Pr. Thus all
the relatively invariant integrals on X(6p ,n x G') have been ex
plicitly described. To do the same for c{;(%), the basic idea is to move the integral J2 over to SC(%). It was mentioned earlier that
the map 40 on 1p
X Gu to 9X given by
00(',U) = Ue 'X
is one-to-one, onto, and satisfies
f0((r, T)(4, U)) = (r, T)40(', U),
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
216 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
for group elements (f, T). For f E C( 'X), consider the integral
J3(f ) fff(40(e U))Io(d')Vr(dU).
Then for (F,T) E Q x G ,
J3((r, T)f) = ff((r, T) '40(4, U))Lo(d4)Vr(dU)
= fft((r', T- )40(A, U))A 0 (d')Pr(dU)
= |Iff(+0(r'4, UT' - 1))L0(dI)Vr(dU) = J3(f)
since ,uO and Pr are invariant. Therefore, J3 is an invariant integral on
SC(%). Since J is also an invariant integral on Yu fX), Theorem 6.3
shows that there is a positive constant k such that
J(f ) = kJ3(f ), fe C('X)
More explicitly, we have the equation
J (X) dX = kff ('U) AO (d) PJr (du)
for all f E X3(f). This equation is a formal way to state the very nontrivial fact that the measure m on 9 gets transformed into the
measure k(IL x vr) on n X Gu under the mapping S5-'. To
evaluate the constant k, it is sufficient to find one particular
function so that both sides of the above equality can be evaluated.
Consider
fo(X) = IX'XInl2(27T) P/ exp[-2 tr(X'X)].
Clearly,
dX ff(x)___x__=2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.8 217
so
k = |Iff(4U)L0(d4)Pr(dU)
= (27T) -P/f2 IUUin/2eXp[-2 tr U'U ]r(dU)
-(2 7T) - np2 HUn -eXp _jdU
- ~~~~~~~~~~~~~~~uU
=(2r) - np2
2-Pc(n, p).
The last equality follows from the result in Example 5.1, where c(n, p) is defined. Therefore,
(61 ( X ) IdXX1 n= (nr )p) |f (*U )Ito ( d*+) Pr,(d U )
It is now an easy matter to derive all the relatively invariant
integrals on YuC(;C). Let Xc be a given continuous homomorphism on GT. For each X - 9, let U(X) be the unique element in Gu
such that X = IU(X) for some I E Cp n (see Proposition 5.2). It is
clear that U(rXT') = U(X)T' for F E ?n and T E GT. We have
shown that
J2( = ) ff(P, U)Xc(U')tuO(dI)Pr(dU)
is relatively invariant with multiplier Xc on YUC(6 n,n Gu). For h e 'W( X), define an integral J4 by
J4(h) =ffh(4U)Xc(U'),Lo(d4')vr(dU).
Clearly, J4 is relatively invariant with multiplier Xc since J4(h)= J2(h) where hQ(*, U) h Q(*U). Now, we move J4 over to 9 by (6.1). In (6.1), take f(X) = h(X)Xc(U'(X)) so f('U)= h(4U)Xc(U'). Thus the integral
J5(h) h(X)Xc(U'(X)) dX
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
218 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
is relatively invariant with multiplier X, Of course, any relatively invariant integral with multiplier Xc on AX(%) is equal to a positive constant times J5.
6.4. TRANSFORMATIONS AND FACTORIZATIONS OF MEASURES
The results of Example 6.18 describe how an invariant measure on the set of n x p matrices is transformed into an invariant measure on xp X Gu
under a particular mapping. The first problem to be discussed in this section is an abstraction of this situation. The notion of a group homomorphism plays a role in what follows.
Definition 6.13. Let G and H be groups. A function q from G onto H is a
homomorphism if:
(i) (g1g2) - (g1)nq(g2), g1, g2 E G.
(ii) (g) = (n(g))- I, g E G.
When there is a homomorphism from G to H, H is called a homomorphic
image of G.
For notational convenience, a homomorphic image of G is often denoted by G and the value of the homomorphism at g is g. In this case, g,g2 =-12 and g-' = g-'. Also, if e is the identity in G, then e is the identity in G.
Suppose 9 and '4 are locally compact spaces, and G and G are locally
compact topological groups that act topologically on 9X and tJ, respectively.
It is assumed that G is a homomorphic image of G.
Definition 6.14. A measurable function p from 9N onto 6J is called equi
variant if 4(gx) = g4(x) for all g E G and x E 9X.
Now, consider an integral
J(f) = Jf(x),(dx), f E YuC(9),
which is invariant under the action of G on 9, that is
J(f) Jf (g-'x)tL(dx) = ff(x)d(dx) = J(f)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.9 219
for g E G and f e XfCQ;). Given an equivariant function 4 from 9 to 6J, there is a natural measure v induced on 6?. Namely, if B is a measurable subset of @, v(B) - L(,-'(B)). The result below shows that under a regularity condition on 4, the measure v defines an invariant (under G) integral on X(5).
Proposition 6.9. If 4 is an equivariant function from 9 onto 'J that
satisfies tL(O- '(K)) < + x for all compact sets K C 6;, then the integral
J,( f ) - |ff(y)v(dy), f e C(6@)
is invariant under G.
Proof. First note that J, is well defined and finite since ,.(4F '(K)) < + oo for all compact sets K C 62. From the definition of the measure v, it follows
immediately that
J1(f) = ff(y)v(dy) = ff(O(x))tL(dx), f E EX(62).
Using the equivariance of 4) and the invariance of ,u, we have
il(gf f (g-ly)v(dy) = ff(g-'4(x))p(dx)
- If(O(g-'x))tu(dx) = ff(O(x))t1(dx) = J,(f)
so J, is invariant under G. O
Before presenting some applications of Proposition 6.9, a few remarks are in order. The groups G and G are not assumed to act transitively on 'X and
624, respectively. However, if G does act transitively on 624 and if 6J is a left homogeneous space, then the measure i is uniquely determined up to a positive constant. Thus if we happen to know an invariant measure on 624, the identity
ff(y)v(dy) = ff(4(x))tu(dx), f E YuC(62)
relates the G-invariant measure IL to the G-invariant measure P. It was this
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
220 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
line of reasoning that led to (6.1) in Example 6.18. We now consider some
further examples.
* Example 6.19. As in Example 6.18, let 9 be the set of all n x p
matrices of rank p, and let 6J be the space S' of p x p positive
definite matrices. Consider the map 4 on 9 to Sp defined by
lo(x)=X'X, Xe6.
The group GOn x Glp acts on 9 by
(F, A)X= (r ? A)X = FXA'
and the measure
ft4dX) dX
IX,'Xj n/2
is invariant under 19, x Glp. Further,
((F, A)X) = AX'XA' = A,(X)A',
and this defines an action of Glp on S'. It is routine to check that
the mapping
(r, A) +A- (IF, A)
is a homomorphism. Obviously,
0((F, A)X) = (F, A)+(X)
since the action of Glp on b p is
A(S) = ASA'; S E Sp, A E Glp.
Since Glp acts transitively on S, the invariant measure
vP(dS)= dS
is unique up to a positive constant. The remaining assumption to
verify in order to apply Proposition 6.9 is that k- '(K) has finite IL
measure for compact sets K c p +. To do this, we show that
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.9 221
Jr-'(K) is compact in 'X. Recall that the mapping h on Cp n X Sp onto 9 given by
h(P, S) = 'S c -
is one-to-one and is obviously continuous. Given the compact set Kc0S' let
K1 = {SIS E S+, S2 E K).
Then K1 is compact so n X K1 is a compact subset of 9 x n X
It is now routine to show that
+- (K) = {XiX'X E K) = h(qp n X K,),
which is compact since h is continuous and the continuous image of a compact set is compact. By Proposition 6.9, we conclude that the
measure v = a o 0-' is invariant under Glp and satisfies
(XX) f dXS) - (dS),
for allf E 'Y((Sp ). Since v is invariant under Glp, v cv, where c is a positive constant. Thus we have the identity
(6.2) f (x'x) dX cf ((s) dS
To find the constant c, it is sufficient to evaluate both sides of (6.2) for a particular function fo. For fo, take the function
A (S) - ( 2')-np) Stn/2exp[- 4trS],
so
f0(X'X) = (2Th) nX'Xjn/2exp[- trX'X].
Clearly, the left-hand side of (6.2) integrates to one and this yields the equation
Cf( 2r) np n-p- )/2exp[ trS] dS = 1.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
222 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
The result of Example 5.1 gives
c(> 7-T) `c(n,p)= 1
so
C ,)~
(2 2rT) w(n, p). c(n, p)
In conclusion, the identity
(6.3) Jf(x'x) dX =~2- dS (6.3) f(XX) Ix,xr2 = )np (n, p f
ISI)p(n) f2(s) S(p 1/
has been established for all f E XJ(Sp ), and thus for all measurable
f for which either side exists.
* Example 6.20. Again let 9X be the set of n x p matrices of rank p
so the group On X GT acts on 9X by
(1, T)X- (F e T)X= FXT.
Each element X E 9 has a unique representation X = IU where
'Pet and U e G6. Define p on 9 onto Gt by defining O(X) to be the unique element U E Gu such that X = *U for some
I' E Cp n. If +(X) = U, then r((F, T)X) = UT', since when X=
'U, (F, T)X = F*UT'. This implies that UT' is the unique ele
ment in Gt such that X = (]F4)UT' as ITF E n. The mapping
(, T)-. T (F, T) is clearly a homomorphism of (1, T) onto
GT and the action of GT on GU is
T(U) UT'; UE G, T GT
Therefore, f4((F, T)X) = (Fr, T)4O(X) so 4 is equivariant. The mea
sure
t(dX) = dX
is Q x GT invariant. To show that +-1(K) has finite ,u measure
when K C GU is compact, note that h(', U) 'PU is a continuous
function on p n, X Gt onto 9X. It is easily verified that
?-'(K) = h(Tp n X K).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.9 223
But 1p x K is compact, which shows that O-'(K) is compact
since h is continuous. Thus (4-4 '(K)) < + xo. Proposition 6.9
shows that v IL o -V' is a Gi-invariant measure on Gt and we
have the identity
f ( ( X)) X2n/2 f f (u ) v (dU)
for all f E %C(Gt). However, the measure
vI(dU) dU
is a right invariant measure on GU, and therefore, vP is invariant
under the transitive action of GT on Gt. The uniqueness of
invariant measures implies that v = cv, for some positive constant c
and
dX r dU | ((X)) -X
C Jf U) dU
The constant c is evaluated by choosing f to be
f (U) = ( V2)-- IU'UIn"2exp[-2trU'U]
Since (p(X))'O(X) = X'X,
f(+(X)) - (2) np ) / exp[-2 tr X'X]
and
dX_ Jf(f(X))
Therefore,
1 = c ( 27) Tnp IU,Uln/2 exp I
tr UU] dU
= c (>2 ) |+ T unziexp [-2 tr U'U ]dU
- C(2 ( np2Pc(n, p)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
224 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
where c(n, p) is defined in Example 5.1. This yields the identity
ff(~(x)) dX npdU f W )) dxxr12 2P(v/27) w(n,p)Jf(U) P
for all f E 3C(Gt). In particular, when f(U) = f,(U'U), we have
(6.4) (f (X'X) dX = 2P()ff ( (n, P fI (U'U) d pU
whenever either integral exists. Combining this with (6.3) yields the identity
(6.5) f(S) dS 2P f(U'U) dU
for all measurable f for which either integral exists. Setting T = U'
in (6.5) yields the assertion of Proposition 5.18.
The final topic in this chapter has to do with the factorization of a Radon
measure on a product space. Suppose % and 1? are locally compact and
a-compact Hausdorff spaces and assume that G is a locally compact
topological group that acts on 9 in such a way that T is a homogeneous space. It is also assumed that y is a G-invariant Radon measure on 9 so
the integral
J1(fl) - fj(x)j(dx), f, Ex W(9)
is G-invariant, and is unique up to a positive constant.
Proposition 6.10. Assume the conditions above on %, '4, G, and J,. Define
G acting on the locally compact and a-compact space 9 x 64 by g(x, y) =
(gx, y). If m is a G-invariant Radon measure on 9X x 'J, then m = yI x v
for some Radon measure v on J.
Proof. By assumption, the integral
J(f) fJf(x, y)m(dx, dy), f X
6(cX 6 ) 6~ 9C
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.10 225
satisfies
J(gf) = ff (g-'x, y)m(dx, dy) = J(f).
For f2 E 3X ((6) and f1 E 'XC( f), the product f 112' defined by (f1f2)(x, y)
= I (x)f2(y), is in ('X(6 x 6J) and
J(flf2) = ff f((x)f2(y)m(dx, dy).
Fixf2 E SW(?) such that 12 > 0 and let
H(f1) fl(x)f2(y)m(dx, dy), f, e (c)
Since J( gf) - J(f ), it follows that
H(gfl) = H(fl) for g e G andf1 E (6)
Therefore H is a G-invariant integral on Yu(f). Hence there exists a non-negative constant c(f2) depending on f2 such that
H(fl) = c(f2)Jj(f1)
and c(f2) = 0 iff H(fl) 0 O for allf, e '3(6X). For an arbitraryf2 e 'X(6), write f2 = f2 - f7 where f2j = max(f2,0) and f1 = max(-f2,0) are in
'1C( ). For such an f2, it is easy to show
J(ff2) = c(f2J)1(fl) - c(f7)Jl(fl) = (C(f2) - C(f7))J1(1).
Thus defining c on SW(64) by c(f2) = c(f2 ) - c(f7), it is easy to show that
c is an integral on 'K(6J). Hence
C(f2) = yf2(Y)(dy)
for some Radon measure v. Therefore,
Jf f1(x)f2(y)m(dx, dy) = Jff(x)f2(y)Ju1(dx)v(dy).
A standard approximation argument now implies that m is the product measure,u1 x v. U
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
226 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
Proposition 6.10 provides one technique for establishing the stochastic independence of two random vectors. This technique is used in the next chapter. The one application of Proposition 6.10 given here concerns the space of positive definite matrices.
* Example 6.21. Let Z be the set of all p x p positive definite matrices that have distinct eigenvalues. That S is an open subset of S? follows from the fact that the eigenvalues of S E+ are
continuous functions of the elements of the matrix S. Thus S has nonzero Lebesgue measure in S . Also, let fJ be the set of p x p
p. diagonal matrices Y with diagonal elements y1,.. ., yp that satisfy
Yi > y2 > > yp. Further, let 9 be the quotient sp'ace 1/6D1
where 6D is the group of sign changes introduced in Example 6.6. We now construct a natural one-to-one onto map from X x @ to S.
For XEE , X = '6Dp for some F Ee p. Define 4 by
O(X,Y)= FYF, X= F6D, YE6.
To verify that 4 is well defined, suppose that X = IF16D = F26D .
Then
+(X, Y) = F1YF; = F2r2F,I'1Y 2I' = F2YI'!
since 7I'l EG and every element D E p satisfies DYD = Y for
all Y E 6@. It is clear that +(X, Y) has ordered eigenvalues Yi > Y2 >... > yp > 0, the diagonal elements of Y. Clearly, the function 4
is onto and continuous. To show 4 is one-to-one, first note that, if Y
is any element of fJ, then the equation
FYJ" Y, re(s) p
implies that F e6D (FYF' - Y implies that rY = YF and equat
ing the elements of these two matrices shows that F must be
diagonal so F E p). If
0(X1, Y1) = ?(X2, Y2),
then Y1 = Y2 by the uniqueness of eigenvalues and the ordering of
the diagonal elements of Y E 6J. Thus
F1YlNl = r2YIF2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 6.10 227
when
+(X1, Y1) = ?(X2, Y1).
Therefore,
F2Fl Yl Flr2= Y1,
which implies that F2fr1 e 6DP. Since Xi= ij6Dp for i= 1,2, this
shows that X1 = X2 and that 4 is one-to-one. Therefore, 0 has an
inverse and the spectral theorem for matrices specifies just what V- 1
is. Namely, for Z E 7, let Yi > ... > yp > 0 be the ordered eigen
values of Z and write Z as
Z =JYJ", FGEep
where Y E 'J has diagonal elements Yi > ... > yp > 0. The prob
lem is that r E 0p is not unique since
FYF' = FDYDF' forD e 6D,.
To obtain uniqueness, we simply have "quotiented out" the sub group 6D,p in order that 4' be well defined. Now, let
A (dZ) = dZ
be Lebesgue measure on , and consider v = ,u o 4-the induced
measure on 9 x tJ. The problem is to obtain some information about the measure v. Since 4 is continuous, v is a Radon measure on 9 x 6J, and v satisfies
fJ(X, Y)P(dX, dY) = ff(Zr'(z)) dZ
for f E 9C(%X9 X 6). The claim is that the measure P is invariant under the action of 0p on 9 x '? defined by
F(X, Y) = (Fx, Y).
To see this, we have
fff(F'(X, Y))P(dX, dY) = ff(r'v'(z)) dZ.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
228 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
But a bit of reflection shows that Ic- '(Z) = 4V 1(F'Zr). Since the Jacobian of the transformation rlZr is equal to one, it follows that v is Op-invariant. By Proposition 6.10, the measure v is a product
measure v1 X v2 where v1 is an ?p-invariant measure on 9C. Since Op is compact and QX is compact, the measure v1 is finite and we take
il(f) = I as a normalization. Therefore,
ff('- 1(Z)) dZ = f(X, Y)v1I(dX)v2(dY)
for all f E YuC( X9 x (5?). Setting h = f4 - ' yields
fh(Z) dZ = fJh(4(X, Y)) I (dX)v2(dY)
for h E SJC(S). In particular, if h e 'W(F) satisfies h(Z) = h(FZ1")
for all r1 E and Z E 7, then h(4(X, Y)) = h(Y) and we have the identity
fh(Z) dZ = fh(Y)v2(dY).
It is quite difficult to give a rigorous derivation of the measure v2 without the theory of differential forms. In fact, it is not obvious that P2 is absolutely continuous with respect to Lebesgue measure
on 6. The subject of this example is considered again in later
chapters.
PROBLEMS
1. Let M be a proper subspace for V and set
G(M) = (glg e Gl(V), g(M) = M)
where g(M) = (xix = gv for some v E M).
(i) Show that g(M) = M iff g(M) C M for g E Gl(V) and show
that G(M) is a group.
Now, assume V = RP and, for x E RP, write x = (y) with y E Rq and
z E Rr, q + r = p. Let M = (xlx = (y), y E Rq).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 229
(ii) For g E Gl., partition g as
(g11 g912 isq X q g-921 922' g1isqq
Show that g e G(M) iff glI E Glq, g22 e Glr, and g21 =0. For
such g show that
I-I -g1g12g221
(iii) Verify that G1 = (g E G(M)jgj1 = Iq' g12 = 0) and G2 = {g E
G(M)1922 = Ir) are subgroups of G(M) and G2 is a normal
subgroup of G(M). (iv) Show that G, n G2 = (I) and show that each g can be written
uniquely as g = hk with h E GI and k e G2. Conclude that, if
gi = hiki, i = 1,2, then g1g2 = h3k3, where h3 = h,h2 and k3 =
hj 'k1h2k2, is the unique representation of g1g2 with h3 E GI and k3 E G2.
2. Let G(M) be as in Problem 1. Does G(M) act transitively on V - (0)? Does G(M) act transitively on V n MC where MC is the complement
of the set M in V?
3. Show that Q)n is a compact subset of Rm with m = n2. Show that On is a
topological group when Qn has the topology inherited from Rm. If X is a continuous homomorphism from Q) to the multiplicative group (0, oo), show that x(F) = 1 for all r en.
4. Suppose X is a continuous homomorphism on (0, oo) to (0, oo). Show
that x(x) = xG for some real number a.
5. Show that Q) is a compact subgroup of Gln and show that Gu (of
dimension n X n) is a closed subgroup of Gln. Show that the unique
ness of the representation A = rU (A E Gl, F Ee,n U E Gt) is
equivalent to Q n Gu = (I"). Show that neither On nor Gu is a normal
subgroup of Gln.
6. Let (V, (*, *)) be an inner product space.
(i) For fixed v E V, show that X defined by x(x) = exp[(v, x)] is a
continuous homomorphism on V to (0, oo). Here V is a group under addition.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
230 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
(ii) If X is a continuous homomorphism on V, show that x(x)= log x(x) is a linear function on V. Conclude that x(x) exp[(v, x)] for some v E V.
7. Suppose X is a continuous homomorphism defined on Gln to (0, oo). Using the steps outlined below, show that X(A) = Idet Al' for some real a.
(i) First show that X(F) = 1 for r 1 en.
(ii) Write A = FDA with F, A E On and D diagonal with positive
diagonals X1,..., X n Show that x(A) = X(D)
(iii) Next, write D = HDi(X ,) where Di(c) is diagonal with all diago nal elements equal to one except the ith diagonal element, which is c. Conclude that X(D) = HX(Di(Xi)).
(iv) Show that Di(c) = PD,(c)P' for some permutation matrix P E
on. Using this, show that X(D) = X(D,(X)) where A = HAi.
(v) For A E (0, co), set ((A) = X(DI(A)) and show that ( is a
continuous homomorphism on (0, oo) to (0, oo) so ((A) = AX for
some real /B. Now, complete the proof of X(A) = Idet Al'.
8. Let 9 be the set of all rank r orthogonal projections on Rn to Rn
(I < r < n - 1).
(i) Show that on acts transitively on 9X via the action x -- FxF',
F E On. For
x ( Ir ?)
E 9x
what is the isotropy subgroup? Show that the representation of x
in this case is x = 4A' where A: n X r consists of the first r
columns of F en
(ii) The group Or acts on fr n by { A', A E /9r. This induces an
equivalence relation on 5,r, (4n 1 I 2 iff 4, = 22A' for some
A E ?r), and hence defines a quotient space. Show that the map
[4] -- 4,4' defines a one-to-one onto map from this quotient
space to 9X. Here [4] is the equivalence class of 4.
9. Following the steps outlined below, show that every continuous homo
morphism on GT to (0, oo) has the form X(T) = HP(tii)c where
T: p x p has diagonal elements t11,..., tpp and cl,..., cP are real
numbers.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 231
(i) Let
G= (TIT- (Tol 0) TTI:(p- 1)x(p-l)}
and
'p-I 0\'
G2={TIT= T21 tpp
Show that G1 and G2 are subgroups of GT and G2 is normal.
Show that every T has a unique representation as T = hk with
h E G1, k E G2.
(ii) An induction assumption yields X(h) = HP-I(ti1)c. Also for T = hk, X(T)= X(h)X(k)
(iii) Show that x(k) = (tpp)Cp for some real cp.
10. Evaluate the integral IY f I X'XITexp( - I tr X'X] dX where X ranges
over all n x p matrices of rank p. In particular, for what values of y is
this integral finite?
11. In the notation of Problems 1 and 2, find all of the relatively invariant
integrals on RP n Mc under the action of G(M).
12. In R", let 9 = (xlx E Rn, x t span(e)). Also, let S"_-(e) = (xIlxll =
1, x E R, x'e = 0) and let 6 = R' x (0, oo) x S I,(e). For x E 9X, set x = n- 'e'x and set s2(x) = j(Xi- x-)2. Define a mapping Ton 9
to W by T(x) = ({x, s,(x - xe)/s). (i) Show that T is one-to-one, onto and find -'. Let E"(e) = {rlr
e On, re = e) and consider a group G defined by G -
((a, b, F)Ia E (0, oo), b E RI, F E En(e)) with group composi
tion given by (a,, b1, rF)(a2, b2, F2) = (a1a2, a1b2 + b1, r, 2). Define G acting on 9 and 6@ by (a, b, r)x = aFx + be, x E 9, (a, b, F)(u, v, w) - (au + b, av, rw) for (u, v, w) E 64.
(ii) Show that (gx) = g(x), g E G. (iii) Show that the measure ,u(dx) = dx/s' is an invariant measure
on 9X.
(iv) Let y(dw) be the unique On(e) invariant probability measure on
Sn- (e). Show that the measure
v(d(u, v, w)) = du 2 y(dw)
is an invariant measure on tJ.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
232 TOPOLOGICAL GROUPS AND INVARIANT MEASURES
(v) Prove that Jff(x)u(dx) = kfJf(T-'(y))v(dy) for all integrable f where k is a fixed constant. Find k.
(vi) Suppose a random vector X E 9 has a density (with respect to
dx) given by
f(x) = ^(h. -Xel E= _X
where 8 c R' and a > 0 are parameters. Find the joint density
of X and s.
13. Let 'X = R' -.,(O) and consider X e 9 with an Q"-invariant distri
bution. Define 4 on 9X to (0, oo) x WI, by +(x) = (lixti, x/lIxli). The group 0,, acts on (0, oo) x 6Y1,, by F(u, v) = (u, Fv). Show that p(rx) = rp(x) and use this to prove that:
(i) XII and X/II IKX are independent.
(ii) X/IIXll has a uniform distribution on C, n.
14. Let X = (x E RniXi :
xj for all i *) j and let 'tJ = (y E Rnly, < Y2
< - * < Y,,). Also, let I9n be the group of n X n permutation matrices
so 9n
c en
and 'Pn
acts on 9 by x gx.
(i) Show that the map 4(g, y) = gy is one-to-one and onto from
9,, x 6?4 to 9. Describe -'.
(ii) Let X E 9 be a random vector such that P?(X) = E(gX) for
g E 6Pn. Write V- 1 (X) = (P( X), Y(X)) where P( X) e @n and
Y(X) E 6J. Show that P(X) and Y(X) are independent and that
P(X) has a uniform distribution on (n
NOTES AND REFERENCES
1. For an alternative to Nachbin's treatment of invariant integrals, see
Segal and Kunze (1978).
2. Proposition 6.10 is the Radon measure version of a result due to Farrell
(see Farrell, 1976). The extension of Proposition 6.10 to relatively invariant integrals that are unique up to constant is immediate?the
proof of Proposition 6.10 is valid.
3. For the form of the measure v2 in Example 6.21, see Deemer and Olkin
(1951), Farrell (1976), or Muirhead (1982).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 7
First Applications
of Invariance
We now begin to reap some of the benefits of the labor of Chapter 6. The
one unifying notion throughout this chapter is that of a group of transfor
mations acting on a space. Within this framework independence and
distributional properties of random vectors are discussed and a variety of structural problems are considered. In particular, invariant probability
models are introduced and the invariance of likelihood ratio tests and maximum likelihood estimators is established. Further, maximal invariant statistics are discussed in detail.
7.1. LEFT EI,, INVARIANT DISTRIBUTIONS ON n X p MATRICES
The main concern of this section is conditions under which the two matrices 'I and U in the decomposition X = PU (see Example 6.20) are stochasti
cally independent when X is a random n x p matrix. Before discussing this
problem, a useful construction of the uniform distribution on Cp, is presented. Throughout this section, 9t denotes the space of n X p matrices of rank p so n > p. First, a technical result.
Proposition 7.1. Let XE E. ,n have a normal distribution with mean zero and Cov(X) = I,, Ip. Then P(X E = land the complement of 9 in
,,p nhas Lebesgue measure zero.
Proof. Let Xl,..., Xp denote the p columns of X. Thus X,,..., Ap are independent random vectors in Rn and E(Xi) = N(0, In), i = 1,..., p. It is
shown that P(X e C) = 0. To say that X E c% is to say that, for some
233
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
234 FIRST APPLICATIONS OF INVARIANCE
index i,
Xi E span{XJIj * i}.
Therefore,
P{ X 6C) = p U [Xi e span{Xjlj i}]
p < P{Xi E span(Xjlj * i}).
However, Xi is independent of the set of random vectors (Xjl j i} and the probability of any subspace M of dimension less than n is zero. Since p < n, the subspace span {Xjlj * i) has dimension less than n. Thus conditioning on
Xi for i * i, we have
P{Xi E span{Xjlj * i}} = SP(Xi e span{Xjlj * i)lXj, j * i) = 0.
Hence P(X E =C) = 0. Since VX has probability zero under the normal
distribution on Ep n and since the normal density function with respect to
Lebesgue measure is strictly positive on Ep, n 9 it follows the 9Cc has Lebesgue
measure zero. ol
If X E lip , is a random vector that has a density with respect to
Lebesgue measure, the previous result shows that P(X EE 9) = 1 since 9Xc has Lebesgue measure zero. In particular, if X EP n has a normal distribu
tion with a nonsingular covariance, then P(X 9C) = 1, and we often
restrict such normal distributions to X in order to insure that X has rank p. For many of the results below, it is assumed that X is a random vector in 9, and in applications X is a random vector in p n, which has been restricted
to 9 after it has been verified that C has probability zero under the
distribution of X.
Proposition 7.2. Suppose X E 9 has a normal distribution with E (X) = N(O, In ? Ip). Let X,,..., Xp be the columns of X and let EI' E p n, be the
random matrix whose p columns are obtained by applying the
Gram-Schmidt orthogonalization procedure to Xi,..., Xp. Then I has
the uniform distribution on Cp ,n that is, the distribution of I is the unique probability measure on 6Lp, n that is invariant under the action of ?,, on gp n (see Example 6.16).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.3 235
Proof. Let Q be the probability distribution of 1 of 6p n. It must be
verified that
Q(rB) = Q(B), r FeOn
for all Borel sets B of 6p n. If r , On, it is clear that P,(FX) = E?(X). Also, it is not difficult to verify that 'I, which we now write as a function of X, say I(X), satisfies
I(Fx) = FI(X), F en 0n.
This follows by looking at the Gram-Schmidt Procedure, which defined the columns of '. Thus
Q(B) = P{I(X) e B) = P{'(FX) e B) = P(Fr(X) e B)
= P{I(X) E F'B) = Q(F'B)
for all F e On. The second equality above follows from the observation that
(X) = e(rX). Hence Q is an ?,-invariant probability measure on p n and the uniqueness of such a measure shows that Q is what was called the
uniform distribution on P, n. O
Now, consider the two spaces 9 and p n, X Gu. Let 4 be the function on % to C, n X Gu that maps X into the unique pair (4P, U) such that X = 'IU. Obviously, Ss '(I, U) = 'U e C.
Definition 7.1. If X e 9 is a random vector with a distribution P, then P is left invariant under 0,n if C(X) = IC(FX) for all F E -,n
The remainder of this section is devoted to a characterization of the
On-left invariant distributions on 9. It is shown that, if X E 9 has an On-left invariant distribution, then for +(X) = (I, U) E 6p n X Gt, I and U are stochastically independent and I has a uniform distribution on p, n. This assertion and its converse are given in the following proposition.
Proposition 7.3. Suppose X E 9C is a random vector with an On-left in variant distribution P and write (I, U) = +(X). Then 'I and U are stochastically independent and I has a uniform distribution on Spn
Conversely, if I Ee 6p and U E Gu are independent and if I has a uniform distribution on 6p, nJ then X = PU has an On-left invariant distribu tion on 9.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
236 FIRST APPLICATIONS OF INVARIANCE
Proof. The joint distribution Q of (4', U) is determined by
Q(B1 x B2) = P(0-'(B1 X B2))
where B1 is a Borel subset of gp , and B2 is a Borel subset of Gt. Also,
ff(*, U)Q(d*, dU) = ff(4(x))P(dX)
for any Borel measurable function that is integrable. The group C. acts on the left of 1 nX Gnuby
1(4, U) = (IA, U)
and it is clear that
o (rx) = ro(x) for X E=g, rE=-n.
We now show that Q is invariant under this group action and apply
Proposition 6.10. For r e on,
fff(J&' U))Q(d4', dU) = f (F4(X))P(dX) = ft (G(rx))P(dX)
= ff(O(X))P(dX)
= Jlff(4, U)Q(d4', dU).
Therefore, Q is Qn-invariant and, by Proposition 6.10, Q is a product
measure Ql X Q2 where Q, is taken to be the uniform distribution on IFp n. That Q2 is a probability measure is clear since Q is a probability measure.
The first assertion has been established. For the converse, let Q, and Q2 be
the distributions of 4 and U so Q, is the uniform distribution on Cp, n and
Q, x Q2 is the joint distribution of (4, U) in ?p n X Gu. The distribution P of X = PU = 4V'(4, U) is determined by the equation
ff(X)P(dX) = f(f-1(, U))Q1(d4)Q2(dU),
for all integrable f. To show P is On-left invariant, it must be verified that
ff(FX)P(dX) = ff(X)P(dX)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.4 237
for all integrable f and F E On. But
]7(rx)P(dx) = JJf(Fv- I(4, U))Q,(d4)Q2(dU)
= Jff(cr'(r, U))Q,(d4)Q2(dU)
=ff|f (v-1'(4 , U))QI(d*)Q2(dU) =f ( X)P(dX)
where the next to the last equality follows from the On-invariance of Q1. Thus P is Cn-left invariant. 0
When p = 1, Proposition 7.3 is interesting. In this case 9 = RI- (0) and
the On-left invariant distributions on 9 are exactly the orthogonally in variant distributions on Rn that have no probability at 0 E Rn. If X E R_
(0) has an orthogonally invariant distribution, then 4 = X/IIXI I E WIn is independent of U- II XI and I has a uniform distribution on C, n.
There is an analogue of Proposition 7.3 for the decomposition of X E 9 into (4, A) where 4 E ip n and A E Sp (see Proposition 5.5).
Proposition 7.4. Suppose X E 9 is a random vector with an On-left in variant distribution and write 4i(X) = (4, A) where 4 E andA EI are the unique matrices such that X = 4A. Then 4 and A are independent
and 4 has a uniform distribution on Cp n. Conversely, if 4 E IF n and
A E Sp are independent and if 4 has a uniform distribution on 'p then X = SPA has an On-left invariant distribution on 9X.
Proof. The proof is essentially the same as that of Proposition 7.3 and is left to the reader. O
Thus far, it has been shown that if X E 9 has an On-left invariant
distribution for X = 4U, 4 and U are independent and 4 has a uniform
distribution. However, nothing has been said about the distribution of U E Gt. The next result gives the density function of U with respect to the right invariant measure
Pr(dU) dU
in the case that X has a density of a special form.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
238 FIRST APPLICATIONS OF INVARIANCE
Proposition 7.5. Suppose X e 9 has a distribution P given by a density function
fo(x'x), Xe'1X
with respect to the measure
u(dX) I /dX
on 9X. Then the density function of U (with respect to v,) in the representa tion X = IU is
go(U) = 2P(62J,) "w(n, p)fo(U'U).
Proof. If X E X, U(X) denotes the unique element of Gt such that X = 4U(X) for some 6E p ,. To show go is the density function of U, it
is sufficient to verify that
fh(U(X))fo(X'X)P(dX) = fh(U)go(U)vr(dU)
for all integrable functions h. Since X'X = U'(X)U(X), the results of Example 6.20 show that
fh(U(X))f0(U'(X)U(X))[t(dx) = c h(.U)fo(U'U)Vr(dU)
where c = 2P(v2rP)oPw(n, p). Since go(U) = cfo(U'U), go is the density
of U. a
A similar argument gives the density of S = X'X.
Proposition 7.6. Suppose X E 9 has distribution P given by a density function
fo(X'X), X E 9
with respect to the measure ,u. Then the density of S = X'X is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.6 239
with respect to the measure
v(dS)- dS
Proof. With the notation S(X) = X'X, it is sufficient to verify that
fh(S(X))fo(X'X)L(dX) = h(S)go(S)v(dS)
for all integrable functions h. Combining the identities (6.4) and (6.5), we have
Jh(S(X))fo(X'X)p(dX) = cfh(S)fo(S)v(dS)
where c = (V?_T)fPw,(n, p). Since go = cfo, the proof is complete. O
When X E 9 has the density assumed in Propositions 7.5 and 7.6, it is
clear that the distribution of X is Q"-left invariant. In this case, for X = *U, I and U are independent, I has a uniform distribution on Sp, , and U has
the density given in Proposition 7.5. Thus the joint distribution of I and U has been completely described. Similar remarks apply to the situation treated in Proposition 7.6. The reader has probably noticed that the
distribution of S = X'X was derived rather than the distribution of A in the
representation X = IA for I Ec and A E Sp. Of course, S = A2 so A
is the unique positive definite square root of S. The reason for giving the distribution of S rather than that of A is quite simple-the distribution of A is substantially more complicated than that of S and harder to derive.
In the following example, we derive the distributions of U and S when X E 9 has a nonsingular (9,-left invariant normal distribution.
* Example 7.1. Suppose X E 9 has a normal distribution with a nonsingular covariance and also assume that E(X) -I(rX) for all r1 E en. Thus &: X = 1& Xfor all F E e ,, which implies that &X = 0. Also, Cov(X) must satisfy Cov((F ? Ip)X) = Cov(X) since e(X) - E((F e Ip)X). From Proposition 2.19, this implies that
COV(X) = In 4 2
for some positive definite E as Cov(X) is assumed to be nonsingu lar. In summary, if X has a normal distribution in 9 that is En-left
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
240 FIRST APPLICATIONS OF INVARIANCE
invariant, then
fE(X) = N(O, In ? 2).
Conversely, if X is normal with mean zero and Cov(X) = I,, X 2,
then e(X) = E(FX) for all r e On. Now that the En-left invariant
normal distributions on 9 have been described, we turn to the distribution of S = X'X and U described in Propositions 7.5 and 7.6. When e(X) = N(O, I,, ? E), the density function of X with
respect to the measure p(dX) = dX/iX'X1"12 is
f0(X) = 1,) ' exp - tr E: XXI X
Therefore, the density of S with respect to the measure
v(dS) dS
S E
5 +
is given by
go(S) = t(n, p)I2ISIn/2exp[I trE-S]
according to Proposition 7.6. This density is called the Wishart density with parameters 2, p, and n. Here, p is the dimension of S
and n is called the degrees of freedom. When S has such a density
function, we write 0(S) = W(E, p, n), which is read "the distribu
tion of S is Wishart with parameters :, p, and n." A slightly more
general definition of the Wishart distribution is given in the next chapter, where a thorough discussion of the Wishart distribution is
presented. A direct application of Proposition 7.5 yields the density
g(U) = 2Pw(n, p)IE -UPUIn/2exp[ - tr :-'U'U]
with respect to measure
vr(dU) =
U
when X = PU, P e IF and U E Gt. Here, the nonzero elements
of U are uij, 1i < j < p. When Y. = Ip, g, becomes
p
g1(U)vr(dU) = 2Pw(n, p) n u'iexp[ - 4 trU'U] vr(dU) i=
=-2Pw(n, p) rUinexp- 2 2
E dU. i= 1[ 1 <j ]
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
GROUPS ACTING ON SETS 241
In Gt, the diagonal elements of U range between 0 and x and the
elements above the diagonal range between - oo and + co. Writing
the density above as
p
g1(U)vJ(dU) = 2Pw(n, p)H(Un.-exp[- u2] duii)
x F1 (exp [-2 U,j] ij) i<I
we see that this density factors into a product of functions that are,
when normalized by a constant, density functions. It is clear by
inspection that
l (uij) = N(0, 1) for i < j.
Further, a simple change of variable shows that
((U2i) =
X2+ i = P,..p
Thus when 7. = Ip, the nonzero elements of U are independent, the
elements above the diagonal are all N(0, 1), and the square of the
ith diagonal element has a chi-square distribution with n - i + 1
degrees of freedom. This result is sometimes useful for deriving the distribution of functions of S = U'U.
7.2. GROUPS ACTING ON SETS
Suppose 9 is a set and G is a group that acts on the left of 'X according to
Definition 6.3. The group G defines a natural equivalence relation between elements of 9X-namely, write x, = x2 if there exists a g e G such that x = gx2. It is easy to check that = is in fact an equivalence relation. Thus
the group G partitions the set 9 into disjoint equivalence classes, say
6X= U a9
ax= uA aeA
where A is an index set and the equivalence classes 9Xa are disjoint. For each x E 9X, the set {gxIg E G) is the orbit of x under the action of G. From the
definition of the equivalence relation, it is clear that, if x E 9fa, then 9a is
just the orbit of x. Thus the decomposition of 9 into equivalence classes is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
242 FIRST APPLICATIONS OF INVARIANCE
simply a decomposition of 9 into disjoint orbits and two points are equivalent iff they are in the same orbit.
Definition 7.2. Suppose G acts on the left of 9X. A function f on % to fJ is
invariant if f(x) = f(gx) for all x E 9 and g E G. The function f is
maximal invariant if f is invariant and f(xl) = f(x2) implies that xl = gx2
for some g E G.
Obviously, f is invariant iff f is constant on each orbit in X. Also, f is
maximal invariant iff it is constant on each orbit and takes different values on different orbits.
Proposition 7.7. Suppose f maps 9 onto '@J and f is maximal invariant.
Then h, mapping 9X into ?, is invariant iff h(x) = k(f(x)) for some function k mapping '?4 into Z.
Proof. If h(x) = k(f(x)), then h is invariant as f is invariant. Conversely,
suppose h is invariant. Given y E '6i, the set (xIf(x) = y) is exactly one
orbit in 9 since f is maximal invariant. Let z E !; be the value of h on this
orbit (h is invariant), and define k(y) = z. Obviously, k is well defined and
k(f(x)) = h(x). El
Proposition 7.7 is ordinarily paraphrased by saying that a function is invariant iff it is a function of a maximal invariant. Once a maximal
invariant function has been constructed, then all the invariant functions are
known-namely, they are functions of the maximal invariant function. If
the group G acts transitively on 9, then there is just one orbit and the only
invariant functions are the constants. We now turn to some examples.
* Example 7.2. Let 9X= R _-{o) and let G-=/9 act on 9 as a
group of matrices acts on a vector space. Given x E ', it is clear
that the orbit of x is (ylllyll = llxll). Let Sr = {xlllxll = r), so
X= U Sr r>O
is the decomposition of 9X into equivalence classes. The real number
r > 0 indexes the orbits. That f(x) = lixii is a maximal invariant
function follows from the invariance of f and the fact that f takes a
different value on each orbit. Thus a function is invariant under the
action of G on 9X iff it is a function of xlxii. Now, consider the space
S, x (0, oo) and define the function 4 on 9 to S, x (0, oo) by
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.7 243
+(x) = (x/llxll, llxll). Obviously, 4 is one-to-one, onto, and r- '(u, r) = ru for (u, r) E SI x (0, oo). Further, the group action
on 9 corresponds to the group action on S1 x (0, oc) given by
r(u, r) (Fu, r), F E n.
In other words, f(1x) = rF4(x) so S is an equivariant function (see
Definition 6.14). Since (On acts transitively on SI, a function h on
SI x (0, oo) is invariant iff h(u, r) does not depend on u. For this
example, the space 9 has been mapped onto S, x (0, oo) by 4) so
that the group action on 9 corresponds to a special group action on
SI x (0, oo)-namely, On acts transitively on S and is the identity
on (0, so). The whole point of introducing S, x (0, oo) is that the
function ho(u, r) = r is obviously a maximal invariant function due
to the special way in which O,n acts on S1 x (0, oo). To say it another
way, the orbits in S, x (0, oo) are S, x (r), r > 0, so the product
space structure provides a convenient way to index the orbits and hence to give a maximal invariant function. This type of product space structure occurs in many other examples.
The following example provides a useful generalization of the example above.
* Example 7.3. Suppose 9 is the space of all n x p matrices of rank
p, p < n. Then On acts on the left of 9 by matrix multiplication. The
first claim is that fo(X) = X'X is a maximal invariant function.
That fo is invariant is clear, so assume that fo(XI) = fo(X2). Thus
XlX1 = X2X2 and, by Proposition 1.31, there exists a r E n such that rX, = X2. This proves that f0 is a maximal invariant. Now, the
question is: where did fo come from? To answer this question, recall
that each X E 9 has a unique representation as X = TA where
TI e and A E 5'. Let 4 denote the map that sends X into the
pair (I, A) e IF. X )p such that X = 'A. The group On acts on (g;p n X
, nby 6Y3 x S by
r(4, A) = (rP, A)
and 4 satisfies
4)(rx) = F4(x).
It is clear that ho(4A) A is a maximal invariant function on
Ip n X Sp under the action of On since On acts transitively on Sp, n. p
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
244 FIRST APPLICATIONS OF INVARIANCE
Also, the orbits in IF X 5 p are pn X (A) for A E 5 -. It follows
immediately from the equivanance of 4 that
6- n X (A)) =
(XIX= IA for some' E p j
are the orbits in 9 under the action of 6n. Thus we have a
convenient indexing of the orbits in 9X given by A. A maximal invariant function on 9 must be a one-to-one function of an orbit index-namely, A E Sp . However, fo(X) = X'X = A2 when
XE {XIX= 4A, forsome4 E- 6p ).
Since A is the unique positive definite square root of A2 = X'X, we have explicitly shown why Jo is a one-to-one function of the orbit index A. A similar orbit indexing in 9C can be given by elements U E Gt by representing each X E 9 as X = 4U, 4' Ep n and
U e Gt. The details of this are left to the reader.
* Example 7.4. In this example, the set 9 is (RP - (0)) x Sp. The
group Glp acts on the left of 9 in the following manner:
A(y, S) (Ay, ASA')
for (y, S) E 9 and A E Gip. A useful method for finding a maxi
mal invariant function is to consider a point (y, S) E 9 and then
"reduce" (y, S) to a convenient representative in the orbit of (y, 5). The orbit of (y, S) is (A(y, S)IA E Glp). To reduce a given
point (y, S) by A E Glp, first choose A = rs- 1/2 where rE E p and S-1/2 is the inverse of the positive definite square root of S.
Then
ASA' = rs-112ss-1/2 r = rr' = IP
and
A(y, S) = (rs-1/2y, I),
which is in the orbit of (y, S). Since S-1/2y and IjS-1/2yIIE1 have
the same length (Es = (1, ..., 0)), we can choose r E 0P such that
rS- I/2y = IIS- 1/2y1jc1.
Therefore, for each (y, S) E 6, the point
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.7 245
is in the orbit of (y, S). Let
foMY, S) = y'S-Iy = 11S- /2yll2
The above reduction argument suggests, but does not prove, that fo is maximal invariant. However, the reduction argument does pro vide a method for checking that fo is maximal invariant. First, fo is
invariant. To showfo is maximal invariant, if fo(yl, SI) = fo(y2, S2),
we must show there exists an A E Glp such that A(y1, SI) = (Y2, S2).
From the reduction argument, there exists Ai E Glp such that
Ai (yi, Si) = (IIS?I/2yiIIej, Ip), i = 1, 2.
Sincefo(y1, S1) =fo(Y2, S2),
IIS 112ylll = jjSi-1/2
and this shows that
A1(yI, S1) = A2(Y2, S2).
Setting A = A)`A1, we see that A(yl, S1) = (Y2, S2) sofo is maxi
mal invariant. As in the previous two examples, it is possible to represent 9 as a product space where a maximal invariant is
obvious. Let
6 = ((u,S)ju
E RP,SE S' u'S-'u = 1).
Then Gip acts on the left of 'J by
A(u, S) (Au, ASA').
The reduction argument used above shows that the action of Glp is transitive on 6J. Consider the map 4 from 9 to 6 x (0, oo) given by
P(x, S) = ,-I
( x'S-IX) 1/2 )
The group action of Glp on6t x (0, oo) is
A((u, S), r) (A(u, S), r)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
246 FIRST APPLICATIONS OF INVARIANCE
and a maximal invariant function is
f,((u, S), r) = r
since Glp is transitive on '14. Clearly, k is a one-to-one onto function and satisfies
4(A(x, S)) = A4)(x, S).
Thus f( (x, S)) = x'S- lx is maximal invariant.
In the three examples above, the space 9 has been represented as a product space6@ x in such a way that the group action on % corresponds
to a group action on 6 x i-namely,
g(y, z) =
(gy, z)
and G acts transitively on @. Thus it is obvious that
fM(Y, z) = z
is maximal invariant for G acting on 5 x S. However, the correspondence
4, a one-to-one onto mapping, satisfies
0(gx)=g g(x) forgEG,xEiXc.
The conclusion is that f1(p(x)) is a maximal invariant function on 6X. A
direct proof in the present generality is easy. Since
f, (0(gx)) = f,(g4(x)) = f, (o (0,
f1(0(x)) is invariant. If fl(4(xl)) = fA(4(x2)), then there is a g E G such that g4(x,) = +(x2) sincef1 is maximal invariant on 6J x S. But g4(x,) =
4(gxI) = +(x2), so gx1 = x2 as 4 is one-to-one. Thus f1(4(x)) is maximal
invariant. In the next example, a maximal invariant function is easily found but the product space representation in the form just discussed is not
available.
* Example 7.5. The group ?p acts on S' by
F(S)=IrsrJ, re?,,.
A maximal invariant function is easily found using a reduction argument similar to that given in Example 7.4. From the spectral
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.7 247
theorem for matrices, every S E Sp+ can be written in the form p S = J1DPF where r, E 0p and D is a diagonal matrix whose diag onal elements are the ordered eigenvalues of S, say
,I(S) >1 X2(S) > ...
>1 Ap(S) -
Thus rS1' = D, which shows that D is in the orbit of S. Let fo on
S' to RP be defined by: fo(S) is the vector of ordered eigenvalues p of S. Obviously, f0 is 0 -invariant and, to show f0 is maximal
invariant, suppose fo(SI) = fo(S2). Then S, and S2 have the same
eigenvalues and we have
S, = r1Dr,', i 1,2
where D is the diagonal matrix of eigenvalues of Si, i = 1, 2. Thus
r2r,s(r2r,)' = S2, so f0 is maximal invariant. To describe the
technical difficulty when we try to write S + as a product space, first p consider the case p = 2. Then S+ = 91U X2 where
I = (SIS E= S+ Al(S) = X2(S))
and
2 = {SjS E $+I, XA(S) > X2(S)).
That 02 acts on both 9, and 9X2 is clear. The function ,1 defined
on 9I by +,1(S) = XA(S) E (0, oo) is maximal invariant and 4),
establishes a one-to-one correspondence between 9X , and (0, oo). For 92, define 4)2 by
02()= (S (S) ) XI)
S0 o2 is a maximal invariant function and takes values in the set 6?
of all 2 x 2 diagonal matrices with diagonal elements y, and Y2,
Yi > Y2 > 0. Let ?2 be the subgroup of 02 consisting of those diagonal matrices with + 1 for each diagonal element. The argu ment given in Example 6.21 shows that the mapping constructed there establishes a one-to-one onto correspondence between 9X2 and (02/6D2) x 1, and 02 acts on (02/62) X @ by
r(z, Y) = (Fz, Y); (Z, Y) E (02/6P2) X 6
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
248 FIRST APPLICATIONS OF INVARIANCE
Further, p satisfies
o(r(z, y)) = r(o(z, y)).
Thus for p = 2, S2 has been decomposed into 9C and 92 which
are both invariant under 02. The action of 62 on 9 , is trivial in that
fx = x for all x e 9f and a maximal invariant function on I is
the identity function. Also, 9C2 was decomposed into a product space where 02 acted transitively on the first component of the product space and trivially on the second component. From this decomposition, a maximal invariant function was obvious. Similar decompositions for p > 2 can be given for S., but the number of
component spaces increases. For example, when p = 3, let X 1(S) >
X2(S) > X3(S) denote the ordered eigenvalues of S E 3. The
relevant decomposition for S + is
3= U 2 U X3 U C4
where
9, = {SIXI(S) = X2(S) = X3(S))
%2 = (SIXI(S) = A2(S) > A3(S)}
X3 = (SIAL(S) > A2(S) = X3(S)}
94= (SIXI(S) > X2(S) > X3(S)).
Each of the four components is acted on by 03 and can be written as
a product space with the structure described previously. The details
of this are left to the reader. In some situations, it is sufficient to
consider the subset 5 of S+ where p
S= {sIX(S) > 2) > ... > AX(S)).
The argument given in Example 6.21 shows how to write S as a
product space so that a maximal invariant function is obvious under
the action of Op on f.
Further examples of maximal invariants are given as the need arises. We
end this section with a brief discussion of equivariant functions. Recall (see Definition 6.14) that a function 4 on 9 onto 'fJ is called equivariant if
4(gx) = g4(x) where G acts on 9X, G acts on 6?J, and G is a homomorphic
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.8 249
image of G. If G = {e) consists only of the identity, then equivariant functions are invariant under G. In this case, we have a complete descrip tion of all the equivariant functions-namely, a function is equivariant iff it is a function of a maximal invariant function on 9X. In the general case
when G is not the trivial group, a useful description of all the equivariant functions appears to be rather difficult. However, there is one special case
when the equivariant functions can be characterized. Assume that G acts transitively on 9 and G acts transitively on 6J, where
G is a homomorphic image of G. Fix xo E 9 and let
Ho =
(g|gxo =
xo).
The subgroup Ho of G is called the isotropy subgroup of xo. Also, fix Yo E 6?4
and let
Ko =
({ggyo = Yo)
be the isotropy subgroup of yo.
Proposition 7.8. In order that there exist an equivariant function 4 on 9 to 64 such that 4(xo) = yo, it is necessary and sufficient that Ho c Ko. Here
Ho c G is the image of Ho under the given homomorphism.
Proof First, suppose that + is equivariant and satisfies 4(xo) = yo. Then,
for g E Ho,
0(x0) = j(gxo) = go(xo) = gYo = Yo
so g E KoI Thus Ho c Ko. Conversely, suppose that Ho C Ko. For x E 9, the transitivity of G on 9X implies that x = gxo for some g. Define 4 on 6X to
tJ by
(x) =
gyo where x = gxo.
It must be shown that 4 is well defined and is equivariant. If x = g1xo =
g2x0, then g- 'g, e Ho so g-'g, E KE o Thus
+(x) = giYo= g2Yo
since
g2 g1y= g2 g1Yo=Yo.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
250 FIRST APPLICATIONS OF INVARIANCE
Therefore 4 is well defined and is onto '? since G acts transitively on 1J.
That 4 is equivariant is easily checked. [
The proof of Proposition 7.8 shows that an equivariant function is determined by its value at one point when G acts transitively on 'i. More precisely, if 4) and O2 are equivariant functions on 9 such that 4 (xo) =
(P2(xO) for some xo E X, then )(x) = 02(x) for all x. To see this, write
x = gxo so
4)1(X) = 4P1(gX0) = g01(xO) = g92(xO) = 4)2(gxO) = ?2(X).
Thus to characterize all the equivariant functions, it is sufficient to de termine the possible values of O(xo) for some fixed xo E 9. The following example illustrates these ideas.
* Example 7.6. Suppose 9X= = = S+ and G= G = Glp where the homomorphism is the identity. The action of Glp on Sp? is
A(S) = ASA'; A E Glp, S E S+. p
To characterize the equivariant functions, pick xo = Ip E An equivariant function 4 must satisfy
4)(Ip) = 0(rFr) = ro(ip)r
for all F E Op. By Proposition 2.13, a matrix U(Ip) satisfies this
equation iff O(Ip) = kIp for some real constant k. Since 4(Ip) E S +, k > 0. Thus
(p(Ip) = kIp, k > O
and for S E 5/,
+(S) = 4(Ss1/2s'/2) = S'/2(p(I )S1/2 = kS.
Therefore, every equivariant function has the form +(S) = kS for some k > O.
Further applications of the above ideas occur in the following sections after it is shown that, under certain conditions, maximum likelihood estima
tors are equivariant functions.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INVARIANT PROBABILITY MODELS 251
7.3. INVARIANT PROBABILITY MODELS
Invariant probability models provide the mathematical framework in which the connection between statistical problems and invariance can be studied.
Suppose (9X, 1i3) is a measurable space and G is a group of transformations
acting on 9 such that each g E G is a one-to-one onto measurable function
from 9C to 9X. If P is a probability measure on (9, (13) and g E G, the
probability measure gP on ('CX , I) is defined by
(gP)(B) =
P(g-'B); g E G, B E 6B.
It is easily verified that (g1g2)P = g1(g2P) so the group G acts on the space
of all probability measures defined on (6X, 61).
Definition 7.3. Let ' be a set of probability measures defined on (9c, 13). The set ?P is invariant under G if for each P e C, gP E VP for all g E G. Sets
of probability measures V are called probability models, and when VP is
invariant under G, we speak of a G-invariant probability model.
If X E 9 is a random vector with l&(X) = P, then E(gX) = gP for
g e G since
Pr(gX E B) = Pr{X E g-'B) = P{g-1B) = (gP)(B).
Thus f? is invariant under G iff whenever e(X) e @, E(gX) E VP for all g e G.
There are a variety of ways to construct invariant probability models from other invariant probability models. For example, if a,, a E A, are
G-invariant probability models, it is clear that
U (SPa and n6lP? aeA aeA
are both G-invariant. Now, given (9, ') and a G-invariant probability model @, form the product space
(X(n) = ,C X ,C X ... X 9C
and the product a-algebra ffi(n) on 9C(n). For P E VP, define p(n) on ff(n) by first defining
n * P(n)(B, X B2 X ...
X Bn) =H P(Bi) i=1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
252 FIRST APPLICATIONS OF INVARIANCE
where Bi e fi3; Once p(n) is defined on sets of the form B1 x ** x Bn, its
extension to 913(n) is unique. Also, define G acting on %(n) by
g(Xl ..., IXn) =
(9X1,, 9XJ)
for x = (x,,. Xn x") (n).
Proposition 7.9. Let @(n) - p(n)p E 6i}. Then gp(n) is a G-invariant probability model on (6(n), &J-(n)) when VP is G-invariant.
Proof It must be shown that gp((n) E gp(n) for g e G and p(n) E @(n).
However, p(n) is the product measure
p(n) = p X P X ... X P;
and p(n) is determined by its values on sets of the form BI x ... x Bn. But
(gP)(n)(B, X ... X Bn) = P(n)(g-'B1 X g-'B2 x ... X g-1Bn)
n n = fP(g-'Bi) = (gP)(B1)
1 1
where the first equality follows from the definition of the action of G on
9(Xn). Then gp(n) is the product measure
gp(n) = (gp) X (gP) x ... x (gP),
which is in 6yj(n) as gP E 6P. O
For an application of Proposition 7.9, suppose X is a random vector with ?(X) E 'P where P is a G-invariant probability model on 9X. If X1,..., Xn are independent and identically distributed with C(Xi) E 6', then the ran
dom vector
Y ( XI ,.*, n
)E- 6X (n)
has distribution p(n) E @(n) when E(Xi)= P, i = 1,..., n. Thus 6j(n) is a G-invariant probability model for Y.
In most applications, probability models P are described in the form
VP = (P6 10 e 6) where 0 is a parameter and e is the parameter space. When
discussing indexed families of probability measures, the term "parameter space" is used only in the case that the indexing is one-to-one-that is,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.10 253
Po, = Po2 implies that 0, = 02. Now, suppose PO = (P0 0 E ) is G-invariant.
Then for each g E G and 0 E 9, gP0 e C, so gP6 = P0,, for some unique
0' e 93. Define a function g on 9 to 9 by
gP0 = Pg, 0 E9.
In other words, gO is the unique point in 9 that satisfies the above equation.
Proposition 7.10. Each g is a one-to-one onto function from 9 to 9. Let
G = (glg E G). Then G is a group under the group operation of function
composition and the mapping g g is a group homomorphism from G to
G, that is:
(i) g1g2 =g1g2
(ii) g- I =g- 1.
Proof. To show that g is one-to-one, suppose gO, = gO2. Then
gP0 I = Pg0, = P 2 =gP2'
which implies that P0, = P02 so 01 = 02. The verification that g is onto goes
as follows. If 0 e 9, let 0' = g- 1. Then
Po, = gP6, = g(g1Po)
= (gg- )Po
= Po,
so gO' = 0. Equations (i) and (ii) follow by calculations similar to those above. This shows that G is the homomorphic image of G and G is a group.
0
An important special case of a G-invariant parametric model is the following. Suppose G acts on ('X, 6i3) and assume that v is a a-finite measure on (9, 6) that is relatively invariant with multiplier X, that is,
ff(g- lx)v(dx) =
X(g) f(x)v(dx), g e G
for all integrable functions f. Assume that VP = {Po0IO e 9) is a parametric
model and
PO(B)= JIB(x)P(xl0)P(dx)
for all measurable sets B. Thus p( .jO) is a density for PO with respect to v. If
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
254 FIRST APPLICATIONS OF INVARIANCE
'iP is G-invariant, then
gP =Ppi forgeG,0 e .
Therefore,
gP9(B)= Pg(g-'B) =IB(gx)p(xjO)v(dx)
= JIB(gx)P(g-'gx I) J(dx)
= X(g')JIB(x)p(g-'xIO)v(dx)
Pge (B) = JIB(x) p(xIgo)p(dx)
for all measurable sets B. Thus the density p must satisfy
X(g-')p(g-' 0x) =p(xlgo) a.e. (v)
or, equivalently,
p(xl0) = p(gxlg0)x(g) a.e. (v).
It should be noted that the null set where the above equality does not hold
may depend on both 0 and g. However, in most applications, a version of
the density is available so the above equality is valid everywhere. This leads
to the following definition.
Definition 7.4. The family of densities (p(-10)1O E 9) with respect to the
relatively invariant measure v with multiplier X is (G - G)-invariant if
p(XIo) = p(gxg0)X(g)
for all x, 0, and g.
It is clear that if a family of densities is (G - G)-invariant where G is a
homomorphic image of G that acts on 0, then the family of probability measures defined by these densities is a G-invariant probability model. A
few examples illustrate these notions.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.10 255
* Example 7.7. Let 9 = R' and suppose f(lIx112) is a density with
respect to Lebesgue measure on Rn. For ,u E Rn and Y E 5 , set
p(xI, I ) = 1j1l2f((x - )-,(x - )
For each ,u and 2, p ( * IA, 2:) is a density on Rn. The affine group Aln acts on Rn by (A, b)x = Ax + b and Lebesgue measure is relatively
invariant with multiplier
X(A, b) = Idet(A)j
where (A, b) E Aln. Consider the parameter space Rn X SP+ and the
family of densities
p(P(A ly21y2) E=Rn XSp+).
The group Aln acts on the parameter space Rn x Sp+ by
(A, b)(Li, Y) = (Ay + b, AMA').
It is now verified that the family of densities above is (G- G) invariant where G = G = Aln. For (A, b) E Al",I
p((A, b)xI(A, b)(%, 2)) = p(Ax + bl(AI + b, AMA'))
= tAMA'j- /2f ((Ax + b - Att -
x (A2A') '(Ax + b - AA - b))
= IdetAl- IIII- /2f((x - )':-(x
X (A, b)
Therefore, the parametric model determined by the family of densi ties is Al,-invariant.
A useful method for generating a G-invariant probability model on a measurable space (9I, '13) is to consider a fixed probability measure P0 on (%X,j@) and set
? = (gPfIg
e G).
Obviously, 9' is G-invariant. However, in many situations, the group G does
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
256 FIRST APPLICATIONS OF INVARIANCE
not serve as a parameter space for P since gI Po = g2 Po does not necessarily
imply that g, = g2. For example, consider - = R' and let PO be given by
PO(B) = f IB(X)f(IXI2 ) dx Rn
where f(IIx112) is the density on Rn of Example 7.7. Also, let G = Al,. To obtain the density of gP0, suppose X is a random vector with C(X) = PO. For g = (A, b) E AlI, (A, b)X = AX + b has a density given by
p(xlb, AA') = Idet(AA')I-l/2f((x - b)'(AA') '(x - b))
and this is the density of (A, b)Po. Thus the parameter space for
6 = ((A, b)PO(A, b) e Aln)
is R' x Sn. Of course, the reason that Aln is not a parameter space for 'Y is
that
(r1O)PO = Po
for all n X n orthogonal matrices r. In other words, PO is an orthogonally
invariant probability on R'. Some of the linear models introduced in Chapter 4 provide interesting
examples of parametric models that are generated by groups of transforma
tions.
* Example 7.8. Consider an inner product space (V, [*, *]) and let P0 be a probability measure on V so that if 1(X) = PO, then SX = 0
and Cov(X) = I. Given a subspace M of V, form the group G
whose elements consist of pairs (a, x) with a > 0 and x E M. The
group operation is
(a,, x1)(a2, X2) (a,a2, a1x2 + XI).
The probability model '7 = (gPoig E G) consists of all the distri
butions of (a, x)X = aX + x where f&(X) = PO. Clearly,
&S(aX + x) = x and Cov(aX + x) = a2I.
Therefore, if P,(Y) E 'Y, then & Y E M and Cov(Y) = a2I for some
a 2 > 0, SO ' is a linear model for Y. For this particular example the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.10 257
group G is a parameter space for @. This linear model is generated by G in the sense that 9P is obtained by transforming a fixed
probability measure P0 by elements of G.
An argument similar to that in Example 7.8 shows that the multivariate
linear model introduced in Example 4.4 is also generated by a group of
transformations.
* Example 7.9. Let ep n be the linear space of real n x p matrices
with the usual inner product ( , on ep n. Assume that P0 is a probability measure on e SO that, if C(X)= P0, then &X= 0
and Cov( X) = I,, 0 Ip. To define a regression subspace M, let Z be
a fixed n x k real matrix and set
M = (yly = ZB, B E ep k)
Obviously, M is a subspace of ep n. Consider the group G whose elements are pairs (A, y) with A E Glp and y e M. Then G acts on
Ep, n by
(A, y)x = xA' + y
= (In X A)x
+ y,
and the group operation is
(A1, yl)(A2, Y2) = (AIA2, y2A' + y).
The probability model 6J = {gPo0g E G) consists of the distribu tions of (A, y)X = (In 0 A)X + y where E(X) = P0. Since
&((In ? A)X+y) =y E M
and
Cov((I,, ?A)X+y)= I,n AA',
if E (Y) E @, then E&Y E M and Cov(Y) = I4 0 2: for some p x p
positive definite matrix Z. Thus 'Y is a multivariate linear model as described in Example 4.4. If p > 1, the group G is not a parameter
space for P, but G does generate 6Y.
Most of the probability models discussed in later chapters are examples of probability models generated by groups of transformations. Thus these
models are G-invariant and this invariance can be used in a variety of ways.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
258 FIRST APPLICATIONS OF INVARIANCE
First, invariance can be used to give easy derivations of maximum likeli hood estimators and to suggest test statistics in some situations. In addition, distributional and independence properties of certain statistics are often best explained in terms of invariance.
7.4. THE INVARIANCE OF LIKELIHOOD METHODS
In this section, it is shown that under certain conditions maximum likeli hood estimators are equivariant functions and likelihood ratio tests are invariant functions. Throughout this section, G is a group of transforma tions that act measurably on ff, 'iB) and PJ is a a-finite relatively invariant
measure on (9C, 'MB) with multiplier X. Suppose that P = (P-{O e= E} is a
G-invariant parametric model such that each P6 has a density p( -IO), which
satisfies
p(XIO) = p(gxlg)X(g)
for all x e 'X,X 6Ei, and g e G. The group G= {gIg e G) is the homo
morphic image of G described in Proposition 7.10. In the present context, a
point estimator of 0, say t, mapping 9 into 1, is equivariant (see Definition
6.14) if
t(gx) = gt(x), gEG, xe %.
Proposition 7.11. Given the (G- G)-invariant family of densities
(p( -*I)IO ( )93}, assume there exists a unique function 0 mapping 9X into 0
that satisfies
supp(xlO) = p(xIO(x)). 0 e
Then 0 is an equivariant function-that is,
O(gx) = gO(x), x 9EC, g E G.
Proof. By assumption, O(gx) is the unique point in 0 that satisfies
supp(gxl6) = p(gxIO(gx)).
But
p(gxlO) = x(g-')p(xIg-'6)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.12 259
so
supp(gxlO) = x(g-')supp(x1W-'0) = x(g-')supp(xIO)
8~~~~x( 8 (e X|g
=x(g',)p(xj,(x)) = p(gxlgO(x)).
Thus
p(gxlj(gx)) -p(gxlg,(x))
and, by the uniqueness assumption,
@(gx) - gA(x). o
Of course, the estimator @(x) whose existence and uniqueness is assumed in Proposition 7.11 is the maximum likelihood estimator of 0. That 0 is an equivariant function is useful information about the maximum likelihood estimator, but the above result does not indicate how to use invariance to find the maximum likelihood estimator. The next result rectifies this situa tion.
Proposition 7.12. Let (p(-0)I0 E- 0) be a (G - G)-invariant family of
densities on (9C, B). Fix a point xo E 9 and let _xo be the orbit of xo. Assume that
SUpp(Xo01) = P(xo10o) 8e0
and that 00 is unique. For x e C.,, define &(x) by
O(x) = gx@o wherex = gxxo.
Then 6 is well defined on 0xo and satisfies
(i) X(gx)=gO(x), x e.
(ii) supef0ip(xI0) = p(xj6(x)), x E DX0.
Furthermore, 6 is unique.
Proof. The density p( .0) satisfies
P(AM) = Pgylg0)x(g)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
260 FIRST APPLICATIONS OF INVARIANCE
where X is a multiplier on G. To show 0 is well defined on QO., it must be
verified that if x = g.xo = h.xo, then gxO- hx0o. Set k = h; 'g," so kxo =
xo and we need to show that kGo = 00. But
p(x0100) = sup p(kx0IO) = X(k-') supp(x0IA-7'0)
= X(k-')supp(xoI0) = X(k-')p(xol0o)
= P(kxo0100)
= p(xok700)
By the uniqueness assumption, ko = so SO '0 is well defined on exO. To
establish (i), if x = gxxo, then gx = (ggx)xo so
0(gx) g ggX =
g(gXO ) = ().
For (ii), x = gxxO so
supp(xl0) = supp(gxxol0) = X(gx')supP(x0IgX'00)
= X(gj )supP(xoj0) = X(gX')P(X0100)
=p(gxxolgx0o) =p(x((x)).
To establish the uniqueness of 0, fix x E Oex and consider 01 * gx0o. Then
p(X01) =P(gxX0igX g;01) = X(gjx')P(X0Igx'01)
< (gx-, )P (xo@o) = p(x (XI ).
The strict inequality follows from the uniqueness assumption concerning 00. 0
In applications, Proposition 7.12 is used as follows. From each orbit in
the sample space QX, we pick a convenient point xo and show that p(xoIO) is
uniquely maximized at 00. Then for other points x in this orbit, write x = gxxo and set &(x) = gx0o. The function 0 is then the maximum likeli
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.12 261
hood estimator of 0 and is equivariant. In some situations, there is only one
orbit in 9 so this method is relatively easy to apply.
* Example 7.10. Consider 9 = 0 = Sp+ and let
p(SIY) = w(n, p)12-Is /2exp[- tr l ]
for SE Sp' and E:E- Sp'. The constant w(n, p), n > p, was defined
in Example 5.1. That p( IY) is a density with respect to the measure
v(dS) -_ dS
follows from Example 5.1. The group Gip acts on Sp by
A(S) ASA'
for A E Gip and S E S+ and the measure v is invariant. Also, it is clear that the densityp(- I) satisfies
p(ASA'IAEA') = p(SjE).
To find the maximum likelihood estimator of I E Sp', we apply the
technique described above. Consider the point Ip E S + and note that the orbit of Ipunder the action of Gip is S + so in this case there is only one orbit. Thus to apply Proposition 7.12, it must be verified that
sup P(Ipl-) =p(lP17-o)
where 20 is unique. Taking the logarithm of p(I.12) and ignoring the constant term, we have
sup [n log -, I
tr .i sup n
logBI -
I trB]
n P P = sup E log X,
where A,..., XA are the eigenvalues of B = l E p. However, for A > 0, n log A - A is a strictly concave function of A and is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
262 FIRST APPLICATIONS OF INVARIANCE
uniquely maximized at X = n. Thus the function
n I n
n~- logXi _- fX
is uniquely maximized at X1 = X n =A = n, which means that
n logiBi - trB
is uniquely maximized at B = nI. Therefore,
sup pA(ipl) = (ipl-iP)
and (1/n)Ip is the unique point in Sp that achieves this supremum. p.A
To find the maximum likelihood estimator of Y., say l(S), write
S = AA' for A E Glp. Then
i(S) = A = AIA' 1-S.
In summary,
E= -s n
is the unique maximum likelihood estimator of Y. and
sup p(Sjy) = p(SI S) p
= Xo(n, p) (-S S sl/exp[-2 tr(-S S
= W(n, p)n nP/2exp
- np
The results of this example are used later to derive the maximum
likelihood estimator of a covariance matrix in a variety of multi
variate normal models.
We now turn to the invariance of likelihood ratio tests. First, invariant
testing problems need to be defined. Let VP = (PoIO E 0) be a parametric
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.13 263
probability model on (9X, 'I-?) and suppose that G acts measurably on 9X.
Let e0 and 0 , be two disjoint subsets of 03. On the basis of an observation
vector X E 9 with P,(X) E gP0 U '9 where
91 = {PolS EE 6i), i = O, 1,
suppose it is desired to test the hypothesis
Ho 15(X) E
go
against the alternative
HI: IS(X) E= 6y
Definition 7.5. The above hypothesis testing problem is invariant under G if 6i0 and 9, are both G-invariant probability models.
Now suppose that '0 = (PI0 E Q) and 6P1 = {P6IO e 91) are disjoint families of probability measures on (9X, B) such that each P has a density
p(-10) with respect to a a-finite measure v. Consider
sup p(xIO)
A(x) = sup p(xIO)
sEeoouel
For testing the null hypothesis that E(X) e 90 versus the alternative that P,(X) E- 6I, the test that rejects the null hypothesis iff A(x) < k, where k is
chosen to control the level of the test, is commonly called the likelihood ratio test.
Proposition 7.13. Given the family of densities (p(- 10)1 E- h0 u ( 1), assume the testing problem for E (X) E 6YP0 versus P, (X) E ,P1 is invariant
under a group G and suppose that
p(xIO) = p(gxlg0)x(g)
for some multiplier X. Then the likelihood ratio
sup p(xl0) A()
0 c 0
sup p(xjO) 0 eOu0,
is an invariant function.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
264 FIRST APPLICATIONS OF INVARIANCE
Proof It must be shown that A(x) = A(gx) for x E 9 and g E G. For
g G,
sup p(gxIO) sup x(g-')p(x'g-'0) A (gx) =
sup p(gxl0) sup X(g-')P(XIg9-0) e 00 u 0 0E=-00u0
sup p(xIO)
sup p(xj9) = A(x). Eeeouej
The next to the last equality follows from the positivity of X and the invariance of 00 and 00 U 01. o
For invariant testing problems, Proposition 7.13 shows that that test function determined by A, namely
00 (x) (O if A(x) <
k,
is an invariant function. More generally, any test function 4 is invariant if
4p(x) = 4o(gx) for all x E 9 and g E G. The whole point of the above
discussion is to show that, when attention is restricted to invariant tests for
invariant testing problems, the likelihood ratio test is never excluded from consideration. Furthermore, if a particular invariant test has been shown to
have an optimal property among invariant tests, then this test has been
compared to the likelihood ratio test. Illustrations of these comments are
given later in this section when we consider testing problems for the
multivariate normal distribution. Comments similar to those above apply to equivariant estimators. Sup
pose (p(-1j0)1 Eje 8) is a (G - G)-invariant family of densities and satisfies
p(xIO) = p(gxlg0)X(g)
for some multiplier X. If the conditions of Proposition 7.12 hold, then an
equivariant maximum likelihood estimator exists. Thus if an equivariant
estimator t with some optimal property (relative to the class of all equiv
ariant estimators) has been found, then this property holds when t is
compared to the maximum likelihood estimator. The Pitman estimator, derived in the next example, is an illustration of this situation.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.13 265
* Example 7.11. Let f be a density on RP with respect to Lebesgue
measure and consider the translation family of densities (p(- 10)1 e RP) defined by
p(xlO) = f(x - 0), x, 0 E RP.
For this example, % = e = G = RP and the group action is
g(x) = x + g, x, g E RP.
It is clear that
p(gxlg0) = p(x10),
so the family of densities is invariant and the multiplier is unity. It
is assumed that
xf (x) dx =
O and fIIxIt2f(x) dx < + x .
Initially, assume we have one observation X with 1E( X) E (p(- 10)10 E RP). The problem is to estimate the parameter 0. As a measure of how well an estimator t performs, consider
R(t, 0) -6ollt(X) - 0112.
If t(X) is close to 0 on the average, then R(t, 0) should be small.
We now want to show that, if t is an equivariant estimator of 0, then
R(t, 0) = R(t,O)
and the equivariant estimator to(X) = X minimizes R(t, 0) over all equivariant estimators. If t is an equivariant estimator, then
t(x + g) = t(x) + g
so, withg= - x,
t(x) = x + t(O).
Therefore, every equivariant estimator has the form t(x) = x + c
where c E RP is a constant. Conversely, any such estimator t(x) =
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
266 FIRST APPLICATIONS OF INVARIANCE
x + c is equivariant. For t(x) = x + c,
R(t, 9) = &911t(X) - 9112 = lilx + c - 9112f(x - 9) dx
= flix + C12f (x) dx = R(t, O).
To minimize R(t, 0) over all equivariant t, the integral
flix + cII2f(x) dx
must be minimized by an appropriate choice of c. But
&IIX + C112 = &IIX - &(X)112 + ll&(X) + c112
so
c= -&(X)=fxf(x)dx=o
minimizes the above integral. Hence to(X) = X minimizes R(t, 0) over all equivariant estimators. Now, we want to generalize this
result to the case when XI,..., X,, are independent and identically
distributed with (Xi) E {p( 1p)19 E RP), i = 1,. .., p. The argu
ment is essentially the same as when n = 1. An estimator t is
equivariant if
t(xl + g,..., xn + g) = t(x1,..., x") + g
so, setting g = - x,
t(x1,..., xn) = xl + t(0, x2 - xl,... , xn X1).
Conversely, if
t(xI,, xn) = xI + I(x2 -
xi,..., xn - xI)
then t is equivariant. Here, I is some measurable function taking
values in RP. Thus a complete description of the equivariant estima
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
DISTRIBUTION THEORY AND INVARIANCE 267
tors has been given. For such an estimator,
R(t, 0) = & jt(X1,. . ., xn) - 0112 =oiit(Xi- ,..., X - 0)112
-= &01t(Xi,, xJ)112 = R(t,O).
To minimize R(t, 0), we need to choose the function ' to minimize
R(t,0) = ?; +n'(X2-x1,..., x,-x1)112.
Let Ui = Xi - X1, i = 2,..., n. Then
R(t,O) = olIXI + I(U2 n)112
- &(&(11XI + 4(U2,.., n))II2IU2,..., U)).
However, conditional on (U2,..., Un) U,
6(11X? + I(U)II21U) = 6(11X, - &(X1;U) + &(X,IU) + *(U)1121U)
= S(11XI - S(X1IU)ii2jU) + 11&(X1IU) + I(U)112
Thus it is clear that
*0(U) --6(X,iU)
minimizes R(t, 0). Hence the equivariant estimator
tO(Xl,. , Xn) = X- &;o(X1jX2 - XI,..*, Xn -
XI)
satisfies
R(to, 0) = R(to,0) < R(t,0) = R(t, 0)
for all 0 E RP and all equivariant estimators t. The estimator to is
commonly called the Pitman estimator.
7.5. DISTRIBUTION THEORY AND INVARIANCE
When a family of distributions is invariant under a group of transforma
tions, useful information can often be obtained about the distribution of invariant functions by using the invariance. For example, some of the results in Section 7.1 are generalized here.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
268 FIRST APPLICATIONS OF INVARIANCE
The first result shows that the distribution of an invariant function depends invariantly on a parameter. Suppose ('X, 1i) is a measurable space acted on measurably by a group G. Consider an invariant probability model 6 = (Po 0 - 6) and let G be the induced group of transformations on 6.
Thus
gPe=POe, 0 e6, gE G.
A measurable mapping T on (9, %) to (6p, e) induces a family of distribu tions on (6J, C), (QeIO E 6) given by
Qe(C) Po(T C(C)), Ce C, 0E6.
Proposition 7.14. If T is G-invariant, then Q6 = Qge for 0 E 6 and g e G.
Proof. For each C E C, it must be shown that
QO(C) Qo(C)
or, equivalently, that
po(T-l(C)) =pPj(T'(C)).
But
P90(T-'(c)) = (gP0)(T-'(C)) - PO(g-'T-'C)) = P0((g) I(c)).
Since Tg = T as T is invariant,
Qp(C) = P0((Tg)'(C)) = P0(T-'(C)) = Q6(C). E
An alternative formulation of Proposition 7.14 is useful. If E(X) e (P8j0 E 6) and if T is G-invariant, then the induced distribution of Y = T(X),
which is Q6, satisfies Q,9 = Qge. In other words, the distribution of an invariant function depends only on a maximal invariant parameter. By definition, a maximal invariant parameter is any function defined on 63 that is maximal invariant under the action of ( on 63. Of course, 6 is usually not
a parameter space for the family (Qo E e 6) as QO = Qkg, but any maximal
(-invariant function on 6 often serves as a parameter index for the
distribution of Y = T(X).
* Example 7.12. In this example, we establish a property of the
distribution of the bivariate sample correlation coefficient. Consider
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.14 269
a family of densities p( Ip, :) on R2 given by
P(XIlt, E) = 111'/2fo((x - L)ty--(X - ))
where It E R2 and l E 5'. Hence
and it is assumed that
JIXR12f (11XI12) dx < + 00.
Since the distribution on R2 determined by fo is orthogonally invariant, if Z e R2 has density f0(IIx12), then
&Z = 0 and Cov(Z) = cI2
for some c > 0 (see Proposition 2.13). Also, Z1 = 11/2Z + t has
density p(ILM L:) when Z has density O(1x112 ). Thus
&Z1 = ,t and Cov(Z,) = cY.
The group Al2 acts on R2 by
(A, b)x = Ax + b
and it is clear that the family of distributions, say 3' = ( P , ) E %2 x ), having the densities p(-u, E:), ,u Ee R2, E E 2 iS
2 ~~~~~~~~~~2' invariant under this group action. Lebesgue measure on R2 is relatively invariant with multiplier
x(A, b) = Idet(A)I
and
p(xlu, Y) = p((A, b)xIAM + b, A2A')X(A, b).
Obviously, the group action on the parameter space is
(A, b)(L, E) = (AL + b, ALA')
and
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
270 FIRST APPLICATIONS OF INVARIANCE
Now, let X,,..., X,, n > 3, be a random sample with C(Xi) E 9 so
the probability model for the random sample is Al2-invariant by Proposition 7.9. Consider X = ( Xi/n )EX and S = i - - X)' so X is the sample mean and S is the sample covariance
matrix (not normalized). Obviously, S = S(X,,..., X ) is a func
tion of X,..., Xn and
S(AX, + b,..., AXn + b) == AS(Xj,..., Xn)A'.
That is, S is an equivariant function on (R2)n to Cj where the
group action on 82 is
(A, b)(S) = ASA'.
Writing S E 5+ as
s =
( 22 12 =
521'
the sample correlation coefficient is
S12
Also, the population correlation coefficient is
012
0 1 Y22
when the distribution under consideration is P, , and
z _({ : 1 12)
-Va21 '22J
Now, given that the random sample is from P,L ,, the question is:
how does the distribution of r depend on (,, E)? To show that the
distribution of r depends only on p, we use an invariance argument.
Let G be the subgroup of Al2 defined by
( I~~~~~~a, 0 G = (A, b)I(A, b) e A12, A = a )>0,i= 1,2J.
For (A, b) E G, a bit of calculation shows that r r(X,,..., X") =
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.14 271
r(AX, + b,..., AXn + b) so r is a G-invariant function of X1,...,
X". By Proposition 7.14, the distribution of r, say Q, ,, satisfies
Qi= Q(A,bXs,) (A, b) E G.
Thus Q,u depends on ,u, E: only through a maximal invariant
function on the parameter space R2 X S2 under the action of G. Of course, the action of G is
(A, b)(y, Y) = (AM + b, AIA'), (A, b) e G.
We now claim that
P = P(L, E) =
0a 11022
is a maximal G-invariant function. To see this, consider (,u, 2) E R2
X 5'. By choosing
l 1ll/2 0 A
A (I/ -1/2)
A= t? 0~~22/
and b = -AM, (A, b) E G and
(A, b)(u = ((8) ( p I))
so this point is in the orbit of (IL, l) and an orbit index is p. Thus p is maximal invariant and the distribution of r depends only on
(,u, E) through the maximal invariant function p. Obviously, the distribution of r also depends on the function fo, but fo was
considered fixed in this discussion.
Proposition 7.14 asserts that the distribution of an invariant function depends only on a maximal invariant parameter, but this result is not
especially useful if the exact distribution of an invariant function is desired.
The remainder of the section is concerned with using invariance arguments, when G is compact, to derive distributions of maximal invariants and to characterize the G-invariant distributions.
First, we consider the distribution of a maximal invariant function when a compact topological group G acts measurably on a space (9X, 1 Suppose that Muo is a a-finite G-invariant measure on (9X, 6S3) and f is a density with
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
272 FIRST APPLICATIONS OF INVARIANCE
respect to ,uo. Let T be a measurable mapping from (9, %) onto (6@, (C).
Then T induces a measure on (94, C), say vO, given by
PO(C) = [0(T (C))
and the equation
fh(T(X))Mo(dx) =Jh(y)vo(dy)
holds for all integrable functions h on (64, (C). Since the group G is compact,
there exists a unique probability measure, say 6, that is left and right
invariant.
Proposition 7.15. Suppose the mapping T from ('X, 6l3) onto (64, C) is maximal invariant under the action of G on 9I. If X E 9 has density f with
respect to y0, then the density of Y = -(X) with respect to P0 is given by
q(T(x)) =ff(gX)S(dg).
Proof. First, the integral
ft(gx) 8 (dg)
is a G-invariant function of x and thus can be written as a function of the
maximal invariant T. This defines the function q on 94. To show that q is the
density of Y, it suffices to show that
Sk(Y) = fk(y)q(y)v0(dy)
for all bounded measurable functions k. But
Sk(Y) = &k(r(X)) = fk(T(x))f (x)Ao(dx)
= f k(T(X))f(gx)Lo(dx).
The last equality holds since AO is G-invariant and T is G-invariant. Since 8 is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.15 273
a probability measure
Sk(Y) = jfk(T(x))f(gx)hto(dx)6(dg).
Using Fubini's Theorem, the definition of q and the relationship between Ito and vo, we have
Sk(Y) = f k(T(x))q(T(x))jUO(dx)= k(y)q(y)vo(dy). O
In most situations, the compact group G will be the orthogonal group or some subgroup of the orthogonal group. Concrete applications of Proposi tion 7.15 involve two separate steps. First, the function q must be calculated by evaluating
ff(gx)6(dg).
Also, given ,uO and the maximal invariant , the measure vo must be found.
* Example 7.13. Take 9 = R' and let AO be Lebesgue measure. The
orthogonal group On acts on Rn and a maximal invariant function is T(x) = IIxjj2 so q = [0, c). If a random vector X e Rn has a
density f with respect to Lebesgue measure, Proposition 7.15 tells us how to find the density of Y = IIX1 2 with respect to the measure vP.
To find P0, consider the particular density
fo(x) = (6) nexp[- lexp
Thus f-(X) = N(O, In), so f(Y) = x2 and the density of Y with respect to Lebesgue measure dy on [0, oo) is
n/2- exp[-Iy]
A(Y) n"/2]r(n
Therefore,
p(y) dy = qo(y)Po(dy)
where
qo(T(X)) =jfo(lx)S(dI).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
274 FIRST APPLICATIONS OF INVARIANCE
Since fo( Ex) = fo(x), the integration of fo over en is trivial and
qo(y) = (4) exp - Y]
Solving for vo(dy), we have
vo(dy)- (2) y n/2- dy [r(2 ) n/2-dy
since r(4I ) = V. Now that P0 has been found, consider a general density f on Rn. Then
q(T(x)) =jf(rx)8(dr)
and q(y) is the density of Y IIXI12 with respect to Po. When the
density f is given by
f(x) = h(11XI12),
then it is clear that
q(y) = h(y), y E [O, o)
so the distribution of Y has been found in this case. The noncentral
chi-square distribution of Y = IIXI12 provides an interesting exam ple where the integration over Qn is not trivial. Suppose P&(X)=
N(2I, In) so
f(x) = (4-n exp[- - X_L2]
( 2,7)r exp[- l(11x112 - 2x',u + IItLI2)]
Thus
q(T(x)) = ( exp 4 1 L12]exp[- iIIIxt2]
xI exp[(rx)'j,L]8(dr).
Since x and lixil,- have the same length, x = IIxjjr,c, for some
rF E en where El is the first standard unit vector in R . Similarly,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.15 275
It = IItu11r2e, for some r2 E Q, Setting A = II,L112,
q(y) = (4) exp[- 'X]exp[- 'y]
x f exp[jiK( err,e-)'Fr2e]j8(dr).
Thus to evaluate q, we need to calculate
H(u) =jexp[ue,lr,rr2e1 ]I8(dr).
Since 8 is left and right invariant,
H(u) =jexp[ue,r'i,j8(dr) =jexp[uyjjb8(dr)
where y11 is the (1, 1) element of r. The representation of the
uniform distribution on n given in Proposition 7.2 shows that when r is uniform on OnQ then
t(YIZ) = II
where 1(Z) = N(O, In) and Z, is the first coordinate of Z. Expand
ing the exponential in a power series, we have
001 U' Z H(u)= E . uj-1Uyi8(dr) =Eo j! -lZg
Thus the moments of U1 Z,/IIZII need to be found. Obviously, IC(U1) = 12(- U,), so all odd moments of U1 are zero. Also, U2 =
Z2/(Z2 + 'InZ2), which has a beta distribution with parameters -
and (n - 1)/2. Therefore,
a} (U) - rF(n/2+J)r( )
so
0=0
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
276 FIRST APPLICATIONS OF INVARIANCE
Hence
q(y) = (VY) exp[- X]exp[- y Y (2j)!
is the density of Y with respect to the measure v0. A bit of algebra
and some manipulation with the gamma function shows that
q(y)vo(dy) = ( E je ! (X/2)'hn+2j(Y)} dy j=O
where
ym/2 'exp[ - ly]
hm(Y)= 2m/2r(m/2)
is the density of a x2 distribution. This is the expression for the density of the noncentral chi-square distribution discussed in Chapter 3.
* Example 7.14. In this example, we derive the density function of
the order statistic of a random vector X E R'. Suppose X has a
density f with respect to Lebesgue measure and let Xl,..., X, be the
coordinates of X. Consider the space % c R n defined by
tJ = {yIy E Rn, y I< Y2 < -< Yn.
The order statistic of X is the random vector Y E fJ consisting of
the ordered values of the coordinates of X. More precisely, Y1 is the
smallest coordinate of X, Y2 is the next smallest coordinate of X, and so on. Thus Y = T(X) where T maps each x E Rn into the
ordered coordinates of x-say T(x) E tJ. To derive the density function of Y, we show that Y is a maximal invariant under a
compact group operating on Rn and then apply Proposition 7.15.
Let G be the group of all one-to-one onto functions from (1, 2,..., n) to (1, 2,. . ., n) -that is, G is the permutation group of (1, 2,..., n).
Of course, the group operation is function composition, the group
inverse is function inverse, and G has n! elements. The group G acts
on the left of Rn in the following way. For x E Rn and 7r E G,
define 7rx E Rn to have ith coordinate x(1T-' (i)). Thus the ith
coordinate of lTx is the 7T '(i) coordinate of x, so
(nTX)(i) X(7T-,(i)).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.15 277
The reason for the inverse on -r in this definition is so that G acts on
the left of Rn that is,
(X1X7T2)X = l('7T2X).
It is routine to verify that the function T on 9 to 6J is a maximal
invariant under the action of G on Rn. Also, Lebesgue measure, say
1, is invariant so Proposition 7.15 is applicable as G is a finite group
and hence compact. Obviously, the density q of Y = T(X) is
q(T(x))= ! Yf(-rx)= =
XIT 1rT
so
q(y) = Ef TY) "IT
for y E 6?. To derive the measure Po on 6J, consider a measurable
subset C C 5. Then
<'(C)= U (OC) grfeG
and
VO(C) = i(T '(C)) = ( U (7TC)) = E 1(7T(C)) = n!l(C). 7reG ff C
The third equality follows since (7TC) n (r2C) has Lebesgue mea sure zero for -ri * 12 as the boundary of 6J in R' has Lebesgue
measure zero. Thus vo is just n! times 1 restricted to '6. Therefore, the density of the order statistic Y, with respect to Po restricted to 'J, is
q(y) n 2 f( TY) ir
When f is invariant under permutations, as is the case when X,..., Xn are independent and identically distributed, we have
q(y) = f(y), y e6E
The next example is an extension of Example 7.13 and is related to the results in Proposition 7.6.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
278 FIRST APPLICATIONS OF INVARIANCE
* Example 7.15. Suppose X is a random vector in p, n n > p, which
has a density f with respect to Lebesgue measure dx on li n. Let T map E,, n onto the space of p x p positive semidefinite matrices, say
S7, by T(x)= x'x. The problem in this example is to derive the
density of S = ( X) = X'X. The compact group Q acts on lp, n and
a group element r E en sends x into rx. It follows immediately
from Proposition 1.31 that T is a maximal invariant function under the action of Q)n on p . Since dx is invariant under QI Proposition
7.15 shows that the density of S is
q(T(x)) - ff(rx)[L(dr)
with respect to the measure v0 on Sp+ induced by dx and T. To find
the measure vo, we argue as in Example 7.13. Consider the particu
lar density
fo(x) = ( h) nPexp[- ' tr(x'x)]
onp n SO ( X)= N(O, In X Ip). For this fo, the density of S is
q0(S) = qo(T(X)) =ffo(Fx)tl(dF) = (T)npexp[- 2 tr(S)]
with respect to vo. However, by Propostion 7.6, the density of S with respect to dS/lSI(P+ 1)/2 iS
q1(S) = w(n, p)lSln/2exp[_ I tr(S)].
Therefore,
q1 (S) dS
= qo(S)vo(dS)
so
w (n, p)ISI(n-p- 1)/2 exp[ - 2 tr(S)] dS
= (2 n)-pexp[- 2 tr(S)]v0(dS),
which shows that
v0(S) = ( )np,(n, p)ISI(n-p-1)12dS.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.15 279
In the above argument, we have ignored the set of Lebesgue measure zero where x E- nhas rank less than p. The justification
for this is left to the reader. Now that vo has been found, the density
of S for a general density f is obtained by calculating
q(T(x)) = jf(rx)[t(dF).
When f(x) = h(x'x), then f(rx) = h(x'x) = h(T(x)) and q(S) =
h(S). In this case, the integration over Qn is trivial. Another example
where the integration over On is not trivial is given in the next
chapter when we discuss the noncentral Wishart distribution.
As motivation for the next result of this section, consider the situation discussed in Proposition 7.3. This result gives a characterization of the ?"-left invariant distributions by representing each of these distributions as a product measure where one measure is a fixed On-invariant distribution and the other measure is arbitrary. The decomposition of the space 9 into the product space IF . x Gu provided the framework in which to state this representation of On-left invariant distributions. In some situations, this product space structure is not available (see Example 7.5) but a product measure representation for ?n-invariant distributions can be obtained. It is established below that, under some mild regularity conditions, such a representation can be given for probability measures that are invariant under any compact topological group that acts on the sample space. We now tum to the technical details.
In what follows, G is a compact topological group that acts measurably on a measure space (9C, 'i3) and P is a G-invariant probability measure on ('X, f3f). The unique invariant probability measure on G is denoted by ,u and the symbol U E G denotes a random variable with values in G and
distribution p. The a-algebra for G is the Borel a-algebra of open sets so U is a measurable function defined on some probability space with induced
distribution ,I. Since G acts on 9, 9 can be written as a disjoint union of
orbits, say
x= U 6_
where (E is an index set for the orbits and 9aX nfl = 9 if a m a'. Let xa be
a fixed element of 9fa = (gxa,g e G). Also, set
6 = (x1la c A) c 9
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
280 FIRST APPLICATIONS OF INVARIANCE
and assume that 'J is a measurable subset of 9. The function T defined on
Xto ' by
T(X) = X iffx E X,
is obviously a maximal invariant function under the action of G on 9X. It is assumed that T is a measurable function from 9 to '@ where 6J has the
a-algebra inherited from 9X. A subset B1 c 'J is measurable iff B1 = 5 n B
for some B E @. If X E 9 has distribution P, then the maximal invariant
Y = T(X) has the induced distribution Q defined by
Q(BI) = P(T-'(Bl))
for measurable subsets B, c 6?. What we would like to show is that P is
represented by the product measure ,t X Q on G x '% in the following sense.
If Y E 624 has the distribution Q and is independent of U E G, then the
random variable Z = UY E QX has the distribution P. In other words,
1 (X) = I (UY) where U and Y are independent. Here, UY means the group
element U operating on the point Y E 9X. The intuitive argument that suggests this representation is the following. The distribution of X, condi tional on T(X) = x., should be G-invariant on 9Ca as the distribution of X
is G-invariant. But G acts transitively on 9fa and, since G is compact, there
should be a unique invariant probability distribution on 9X. that is induced
by , on G. In other words, conditional on T(X) = xa, X should have the
same distribution as Ux. where U is " uniform" on G. The next result makes
all of this precise.
Proposition 7.16. Consider %, Es, and G to be as above with their
respective a-algebras. Assume that the mapping h on G X 624 to 9 given by
h(g, y) = gy is measurable.
(i) If U E G and Y E 624 are independent with (U) = , and (Y) = Q, then the distribution of X = UYis a G-invariant distribution on 9c.
(ii) If X e 9 has a G-invariant distribution, say P, let the maximal
invariant Y = T(X) have an induced distribution Q on 62. Let
U E G have the distribution ,u and be independent of X. Then
E(X) = P,(UY).
Proof. For the proof of (i), it suffices to show that
if(X) = Rf(gX)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.16 281
for all integrable functions f and all g e G. But
&f(gX) = Sf(g(UY)) = &f((gU)Y) = &&[f((gU)Y)fY]
- &&[f(UY)IY] = &f(UY) = &f(X).
In the above calculation, we have used the assumption that U and Y are
independent, so conditional on Y, C(U) = E(gU) for g E G. To prove (ii) it suffices to show that
if(X) = &f(UY)
for all integrable f. Since the distribution of X is G-invariant
&f(X) = &f(gX), g E G.
Therefore,
&f(X) = &U&Xf(UX),
as U and X are independent. Thus
ff(x)P(dx) = ff (gx)P(dx)tt(dg) = ff (gx),A(dg)P(dx).
However, for x e 9fa there exists an element k E G such that x = kxa. Using the definition of T and the right invariance of ,u, we have
ff (gx)t(dg) = f (gkx,)A(dg) = ff (gx,)A(dg)
- f(gT(x)) A (dg).
Hence
f (x)P(dx) = f (gT(X))t(dg)P(dx) = ff (gy),L(dg)Q(dy)
where the second equality follows from the definition of the induced measure Q. In terms of the random variables,
if(X) = &u;yf(UY)
where U and Y are independent as U and X are independent. O
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
282 FIRST APPLICATIONS OF INVARIANCE
The technical advantage of Proposition 7.16 over the method discussed in Section 7.1 is that the space N is not assumed to be in one-to-one
correspondence with the product space G x '54. Obviously, the mapping h on G x 'J to 6 is onto, but h will ordinarily not be one-to-one.
* Example 7.16. In this example, take 9 = 5 , the set of all p x p
symmetric matrices. The group G = 9, acts on p by
F(S)=Prsr, SGE,, Fe?1E.
For S E: 5, let
Y2
T(S) = Y = Y2E
yP
where yi > ... > yp are ordered eigenvalues of S and the off-diag
onal elements of Y are zero. Also, let 'J = (YIY = T(S), S E
The spectral theorem shows that T is a maximal invariant function under the action of Op and the elements of 'fJ index the orbits in S .
The measurability assumptions of Proposition 7.16 are easily veri fied, so every On-invariant distribution on SP, say P, has the repre sentation given by
jf(S)P(dS) = j ff(YF1')Q(dY),L(dF') Sp sp
where ,L is the uniform distribution on 9p and Q is the induced
distribution of Y. In terms of random variables, if e(s) = P and
E(FSF') = E(S) for all r E Op, then
is(S) = E(*T(S)* )
where I is uniform on 9p and is independent of the matrix of
eigenvalues of S. As a particular case, consider the probability
measure PO on Sp' c Sp with the Wishart density
po(S) = o(p, n)jSj(n-p-1)/2exp[- 2 trS]I(S)
where n > p, I(S) = 1 if S e SP and is zero otherwise. That po is a
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.16 283
density on Sp with respect to Lebesgue measure dS on S follows from Example 5.1. Also, po is 0 -invariant since dS is ?p-invaiant and po(fSr') = po(S) for all S E S and F E Op. Thus the above
results are applicable to this particular Wishart distribution.
The final example of this section deals with the singular value decomposi tion of a random n x p matrix.
* Example 7.17. The compact group 0,, x Op acts on the space p,, by
(rF A)x- rxA'; (r, A) E e,, X 0,, X E epn.
For definiteness, we take p < n. Define T on ep,n by
XI 0 A,
T(X)= 0 Ap
0
where AX > ** Ap > 0 and A2.., X2 are the ordered eigenval ues of X'X. Let p , e,,, be the range of Tso 'tJ is a closed subset of
e,, n. It is clear that (rXA') = T(X) for rE E enand A E , so Xr is
invariant. That T is a maximal invariant follows easily from the
singular value decomposition theorem. Thus the elements of 6' index the orbits in li,p nand every X E 0p, n can be written as
x -= FyA = (r, A)y
for some y E- 6 and (r A) E ,, X {9p. The measurability assump tions of Proposition 7.16 are easily checked. Thus if P is an
(,n x 0p)-invariant probability measure on Ep,,, and 0 (X)= P, then
e(x) = (F(rYA)
where (r, A) has a uniform distribution on ?n X Op, Y has a
distribution Q induced by T and P, and Y and (r, A) are indepen dent. However, we can say a bit more. Since 0,n X 0, is a product
group, the unique invariant probability measure on 0,n X 9,p is the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
284 FIRST APPLICATIONS OF INVARIANCE
product measure /I X A 2 where #I(I02) is the unique invariant probability measure on ?O(?O). Thus r and A are independent and each is uniform in its respective group. In summary,
e(x) = (rYw).
where r, Y, and A are mutually independent with the distributions given above. As a particular case, consider the density
fo( X =- () nPexp[ - 2 tr( X'X)]
with respect to Lebesgue measure on lp n. Since fo(rXA') =fo(X) and Lebesgue measure is (On X (9p)-invariant, the probability mea sure defined by fo is (n X (9p)-invariant. Therefore, when E(X) =
N(O, In X Ip), X has the same distribution as rYAS where r and /v
are uniform and Y has the induced distribution Q on 6.
7.6. INDEPENDENCE AND INVARIANCE
Considerations that imply the stochastic independence of an invariant function and an equivariant function are the subject of this section. To
motivate the abstract discussion to follow, we begin with the familiar random sample from a univariate normal distribution. Consider X E %
with E(X) = N(,Ie, a2I") where ,u E R, a2 > 0, and e is the vector of ones
in Rn. The set 9 is Rn - span(e) and the reason for choosing this as the
sample space is to guarantee that E2n(xi - x2> 0 for x E 9. The coordi
nates of X, say Xi,..., Xn, are independent and E&(Xi) = N(,u, a2) for
i = l,..., n. When la and a2 are unknown parameters, the statistic t(X) =
(s, X) where
n n
X=EX., s2= E(X-X
is minimal sufficient and complete. The reason for using s rather than s in
the definition of t(X) is based on invariance considerations. The affine group Al, acts on 9 by
(a, b)x ax + be
for (a, b) E Al1. Let G be the subgroup of All given by G = ((a, b)l(a, b) e
Al,, a > 0) so G also acts on 9X.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INDEPENDENCE AND INVARIANCE 285
The probability model for X E 9 is generated by G in the sense that if
Z E 9 and C(Z) = N(O, In),
f ((a, b)Z) = ,(aZ + be) = N(be, a2In).
Thus the set of distributions 6-P = (N(,ue, a2I")Itt E R, a2 > 0) is obtained from an N(O, In) distribution by a group operation. For this example, the
group G serves as a parameter space for P. Further, the statistic t takes its
values in G and satisfies
t((a, b)X) = (a, b)(s, X ),
that is, t evaluated at (a, b)X = aX + be is the same as the group element
(a, b) composed with the group element (s, X). Thus t is an equivariant
function defined on 9C to G and G acts on both 9 and G. Now, which
functions of X, say h(X), might be independent of t(X)? Intuitively, since t(X) is sufficient, t(X) "contains all the information in X about the
parameters." Thus if h(X) has a distribution that does not depend on the
parameter value (such an h( X) will be called ancillary), there is some reason to believe that h(X) and t(X) might be independent. However, the group structure given above provides a method for constructing ancillary statistics. If h is an invariant function of X, then the distribution of h is an invariant
function of the parameter (,u, a2). But the group G acts transitively on the
parameter space (i.e., G), so any invariant function will be ancillary. Also, h
is invariant iff h is a function of a maximal invariant statistic. This suggests that a maximal invariant statistic will be independent of t(X). Consider the statistic
Z( X) = ( (X)) ', X = Ye
where the inverse on t(X) denotes the group inverse in G. The verification that Z(X) is maximal invariant partially justifies choosing t to have values in G. For (a, b) E G,
Z((a, b)X) = (t((a, b)X)) '(a, b)X = ((a, b)t(X)) '(a, b)X
= (t(X)) '(a, b) '(a, b) X = (t(X)) l'X = Z(X),
so Z is invariant. Also, if
(t(x))-'x= Z(X) = Z(Y) = (t(Y)) 'Y,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
286 FIRST APPLICATIONS OF INVARIANCE
then
Y= t(y)wxw) X,
so X and Y are in the same orbit. Thus Z is maximal invariant and is an
ancillary statistic. That Z(X) and t(X) are stochastically independent for each value of ,u and a2 follows from Basu's Theorem given in the Appendix. The whole purpose of this discussion was to show that sufficiency coupled with the invariance suggested the independence of Z( X) and t( X). The role of the equivariance of t is not completely clear, but it is essential in the more abstract treatment that follows.
Let PO be a fixed probability on (9, 613) and suppose that G is a group
that acts measurably on (9, 6). Consider a measurable function t on
(~, 93d) to (', Cl) and assume that G is a homomorphic image of G that acts transitively on (6h, e,) and that
t(gx) = gt(x); x e6X, ge G.
Thus t is an equivariant function. For technical reasons that become apparent later, it is assumed that G is a locally compact and a-compact
topological group endowed with the Borel a-algebra. Also, the mapping (g, y) -- gy from G x 1?4 to 6 is assumed to be jointly measurable.
Now, let h be a measurable function on (9C, 6) to (;, 62), which is
G-invariant. If X E 9 and E(X) = PO, we want to find conditions under
which Y t(X) and Z h(X) are stochastically independent. The follow ing informal argument, which is made precise later, suggests the conditions needed. To show that Y and Z are independent, it is sufficient to verify that,
for all bounded measurable functions f on (E, C2),
H(y) = Spo(f(h(X))It(X) = y)
is constant for y e @. That this condition is sufficient follows by integrating
H with respect to the induced distribution of Y, say QO. More precisely, if k
is a bounded function on (154, C ,) and H(y) = H(yo) for y E '@, then
po [k(t( X))f(h(X))] = fpo [k(t(X))f(h (X))It(X) = y] Q0(dy)
= k(y)&pj[f(h(X))It(X) =y]Q0(dy)
= Jk(y)H(y)Qo(dy) = H(yo)fk(y)Qo(dy)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INDEPENDENCE AND INVARIANCE 287
and this implies independence. The assumption that H is constant justifies the next to the last equality while the last equality follows from
H(yo) = H(y)Qo(dy) =po f (h( X))
when H is constant. Thus under what conditions will H be constant? Since G acts transitively on '64, if H is U-invariant, then H must be constant and
conversely. However,
H(g-'y) = 6po [f(h(X))It(X) = g-1y] = &po [f(h (X))jgt(X) = y]
= po[f (h(X))It(gX) =y] =
Spo[f(h(gX))It(gX) = y]
= Sgpo [f(h (X))It(X) y].
The equivariance of t and the invariance of h justify the third and fourth equalities while the last equality is a consequence of Q(gX) = gP0 when C(X) = P0. Now, if t(X) is a sufficient statistic for the family ' = {gPolg E G}, then the last member of the above string of equalities is just H(y). Under this sufficiency assumption, H(y) = H(g- ly) so H is invariant and hence is a constant. The technical problem with this argument is caused by the nonuniqueness of conditional expectations. The conclusion that H(y) =
H(g- ly) should really be H(y) = H(g- 'y) except for y E Ng where Ng is a set of Q0 measure zero. Since this null set can depend on g, even the conclusion that H is a constant a.e. (Q0) is not justified without some further work. Once these technical problems are overcome, we prove that, if t(X) is sufficient for {gPo0g E G), then for each g E G, h(X) and t(X) are
stochastically independent when i ( X) = gP0. The first gap to fill concems almost invariant functions.
Definition 7.6. Let (X1, S @) be a measurable space that is acted on measurably by a group G, . If la is a o-finite measure on (6%X,
I 1i) and f is a
real-valued Borel measurable function, f is almost G,-invariant if for each g E G1, the set Ng = {xlf(x) * f(gx)) has I measure zero.
The following result shows that under certain conditions, an almost G1-invariant function is equal a.e. (,u) to a GI-invariant function.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
288 FIRST APPLICATIONS OF INVARIANCE
Proposition 7.17. Suppose that G1 acts measurably on (9CXI, 1I) and that G,I is a locally compact and a-compact topological group with the Borel a-algebra. Assume that the mapping (g, x) -> gx from G1 x 9X 1 to 9I is
measurable. If ,u is a a-finite measure on (9X 1, 6 I) and f is a bounded almost
G -invariant function, then there exists a measurable invariant function f1 such thatf =f, a.e. (,u).
Proof This follows from Theorem 4, p. 227 of Lehmann (1959) and the proof is not repeated here. 0
The next technical problem has to do with conditional expectations.
Proposition 7.18. In the notation introduced earlier, suppose (9X, ffi) and (6J1 , C1) are measurable spaces acted on by groups G and G where G is a
homomorphic image of G. Assume that T is an equivariant function from %
to 6J. Let PO be a probability measure on (9C, 93) and let QO be the induced
distribution of T(X) when e(X) = PO. If f is a bounded G-invariant
function on 9, let
H(y) =&0po(f(X)lT(X) =y),
and
HI (y) = Cgp0(f( X)hT(X) y).
Then H,(gy) = H(y) a.e. (Q0) for each fixed g E G.
Proof The conditional expectations are well defined since f is bounded. H(y) is the unique a.e. (Q0) function that satisfies the equation
fk(y)H(y)Qo(dy) = k(T(x))f(x)Po(dx)
for all bounded measurable k. The probability measure gPo satisfies the
equation
Jfi(x)(gPo)(dx) =fi(gx)P0(dx)
for all bounded fl. Since T is equivariant, this implies that if P,(X) = gPo, then E(T(X)) = gQ0. Using this, the invariance off, and the characterizing
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.19 289
property of conditional expectation, we have for all bounded k,
fH(y)k(y)Qo(dy) = ff(x)k(T (x))Po(dx)
f f(gx)k(g-'T(gx))Po(dx)
- Jf(x)k(g-'T(x))(gPo)(dx)
= fHi(y)k(g-ly)(gQO)(dy)
-|Hi(gy)k(g-'gy)QO(dy)
= fH1(gy)k(y)Qo(dy).
Since the first and the last terms in this equality are equal for all bounded k, we have that H(y) = H,(gy) a.e. (QO). O
With the technical problems out of the way, the main result of this section can be proved.
Proposition 7.19. Consider measurable spaces (9X, '13) and ('AJ , e ), which are acted on measurably by groups G and G where G is a homomorphic
image of G. It is assumed that the conditions of Proposition 7.17 hold for the group G and the space (14, C1), and that G acts transitively on 6. Let T
on 9C to '% be measurable and equivariant. Also let (Y, C2) be a measurable space and let h be a G-invariant measurable function from 9 to ?. For a random variable X e 9f with P,(X) = PO, set Y =T(X) and Z = h(X) and
assume that T(X) is a sufficient statistic for the family {gP0Ig E G) of distributions on (fX, 613). Under these assumptions, Y and Z are indepen dent when e(X) gPo, g E G.
Proof. First we prove that Y and Z are independent when J,(X) = PO. Fix
a bounded measurable function f on Z and let
Hg(y) = GgpD(f(h(X))IT(X) = y).
Since (X) is a sufficient statistic, there is a measurable function H on 'EJ such that
Hg(y) = H(y) fory ? Ng
where Ng is a set of gQo-measure zero. Thus (gQo)(Ng) = Qo(g-'Ng) = 0.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
290 FIRST APPLICATIONS OF INVARIANCE
Let e denote the identity in G. We now claim that H is a QO almost
G-invariant function. By Proposition 7.18, He(y) = Hg(gy) a.e. (Q0). How ever, H(y) He(y) a.e. QO and Hg(gy)= H(gy) for gy q Ng, where
Qo(g-'Ng) = 0. Thus Hg(gy) = H(gy) a.e. QO, and this implies that H(y) = H(gy) a.e. QO. Therefore, there exists a G-invariant measurable function, say H, such that H = H a.e. QO. Since G is transitive on 6@, H must be a
constant, so H is a constant a.e. QO. Therefore,
He(Y) = &pO(f(h(X))IT(X) =Y)
is a constant a.e. QO and, as noted earlier, this implies that Z = h(X) and
Y = T(X) are independent when e(X) = PO. When e(X) = g1Po, let PO =
g1P0 and note that {gPo0g E G) = {gPolg E G) so T(X) is sufficient for
(gP0Ig E G). The argument given for PO now applies for PO. Thus Z and Y
are independent when & ( X) = g I PO. U
A few comments concerning this result are in order. Since G acts
transitively on (gPoIg E G) and Z = h(X) is G-invariant, the distribution of Z is the same under each gPo, g E G. In other words, Z is an ancillary
statistic. Basu's Theorem, given in the Appendix, asserts that a sufficient statistic, whose induced family of distributions is complete, is independent of an ancillary statistic. Although no assumptions concerning invariance are
made in the statement of Basu's Theorem, most applications are to prob
lems where invariance is used to show a statistic is ancillary. In Proposition
7.19, the completeness assumption of Basu's Theorem has been replaced by the invariance assumptions and, most particularly, by the assumption that
the group G acts transitively on the space '54.
* Example 7.18. The normal distribution example at the beginning
of this section provided a situation where the sample mean and
sample variance are independent of a scale and translation invariant
statistic. We now consider a generalization of that situation. Let 6 = R-
- (span{e}) where e is the vector of ones in R' and
suppose that a random vector X E 9X has a density f(IIx I12) with
respect to Lebesgue measure dx on X. The group G in the example
at the beginning of this section acts on 9C by
(a, b)x = ax + be, (a, b) E G.
Consider the statistic t(X) = (s, X) where
n I
X=-EXI and S2= (ix
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.19 291
Then t takes values in G and satisfies
t((a, b)X) = (a, b)t(X)
for (a, b) E G. It is shown that t(X) and the G-invariant statistic
Z( X) -(t( X)) -, 1 X-X
are independent. The verification that Z(X) is invariant goes as follows:
Z((a, b)X) = (t((a, b)X)) '(a, b)X
= ((a, b)t(X)) '(a, b)X= (t(X)) 'X= Z(X).
To apply Proposition 7.19, let PO be the probability measure with density f(11x112) on 9 and let G = G = @. Thus t(X) is equivariant and Z(X) is invariant. The sufficiency of t(X) for the parametric family {gPo0g E G) is established by using the factorization theo rem. For (a, b) E G, it is not difficult to show that (a, b)Po has a
density k(xla, b) with respect to dx given by
k(xIa~)=-?f x -be 2) e~~ k(xla, b) = anf
,-| a X E= 9C.
Since
x - be 2 nn
a1 x bel2 = jL(,Xi2 -
2bExi + nb)2
the density k(xla, b) is a function of Ex2 and Ex1 so the pair (E2X2,?XX) is a sufficient statistic for the family {gPo0g E G).
However, t(X) = (s, X) is a one-to-one function of (EX i2,) so t(X) is a sufficient statistic. The remaining assumptions of Proposi tion 7.19 are easily verified. Therefore, t(X) and Z(X) are indepen dent under each of the measures (a, b)Po for (a, b) in G.
Before proceeding with the next example, an extension of Proposition 7.1 is needed.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
292 FIRST APPLICATIONS OF INVARIANCE
Proposition 7.20. Consider the space iC, ,, n > p, and let Q be an n X n
rank k orthogonal projection. If k > p, then the set
B = (XiX e E, rank(QX) < p}
has Lebesgue measure zero.
Proof. Let X E, ,, be a random vector with E(X) = N(O, In X Ip) = Po. It suffices to show that PO(B) = 0 since P0 and Lebesgue measure are absolutely continuous with respect to each other. Also, write Q as
Q=F'DF, reOn
where
D (Ik 0)
0 01
Since
rank(F'DrX) = rank(DrX)
and E(FX) = E(X), it suffices to show that
PO(rank(DX) <p) = 0.
Now, partition X as
X= (X ) X1: k xp
so
DX - E( ? p) n
Thus rank(DX) = rank(XI). Since k > p and P,(X1) = N(0, Ik e Ip), Pro
position 7.1 implies that X, has rank p with probability one. Thus Po(B) = 0
so B has Lebesgue measure zero. o
* Example 7.19. This is generalization of Example 7.18 and deals
with the general multivariate linear model discussed in Chapter 4.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.20 293
As in Example 4.4, let M be a linear subspace of p defined by
M= (xlx Eef p, x = ZB,BEf-P k
where Z is a fixed n x k matrix of rank k. For reasons that are
apparent in a moment, it is assumed that n - k > p. The orthogo
nal projection onto M relative to the natural inner product ( on fSp n is PM = PZ X IP where
p= z(z'z)-Iz
is a rank k orthogonal projection on R . Also, QM QZ C IP' where QZ = In - PZ is the orthogonal projection onto M1 and QZ is
a rank n - k orthogonal projection on Rn. For this example, the
sample space 9 is
(xx Ex= E J,, rank(Q,x) = p).
Since n - k > p, Proposition 7.20 implies that the complement of
9 has Lebesgue measure zero in CE n. In this example, the group G
has elements that are pairs (T, u) with T E GT where T is p x p
and u E M. The group operation is
(T1, u)(T2, U2) =(T1T2, U1 + U2T1)
and the action of G on 9X is
(T, u)x = xT' + u.
For this example, 4 = G = G and t on 9 to G is defined by
t(x) = (T(x), PMx) E G
where T(x) is the unique element in GT such that x'Qzx T(x)T'(x). The assumption that n - k > p insures that x'Qzx has
rank p. It is now routine to verify that
t((T, u)x) = (T, u)t(x)
for x E 'X and (T, u) E G. Using this relationship, the function
h(x) (t(x)) lx
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
294 FIRST APPLICATIONS OF INVARIANCE
is easily shown to be a maximal invariant under the action of G on
%. Now consider a random vector X e 9X with ( X) = PO where PO has a density f((x, x)) with respect to Lebesgue measure on 6. We apply Proposition 7.19 to show that t(X) and h(X) are indepen dent under gP0 for each g e G. Since 6?= G = G, t is an equiv
ariant function and G acts transitively on tJ. The measurability assumptions of Proposition 7.19 are easily checked. It remains to show that t(X) is a sufficient statistic for the family (gP0jg E G). For g = (T, ,u) E G, gP0 has a density given by
p(xj(T, it)) = ITT'l-n/2f(((x - x)(T-)-" X
Letting E = TT' and arguing as in Example 4.4, it follows that
((X -
xU) , X - IL = ((PMX
- L)2 -, PMX -)
+ tr(,E- Y'zx )
since ,u E M. Therefore, the density p(xI(T, IL)) is a function of the
pair (x'Q,x, PMx) so this pair is a sufficient statistic for the family
(gP0og e G}. However, T(x) is a one-to-one function of x'Qzx so
t(x) = (T(x), PMx)
is also a sufficient statistic. Thus Proposition 7.19 implies that t(X) and h(X) are stochastically independent under each probability
measure gP0 for g E G. Of course, the choice of f that motivated
this example is
f (w) = (427)- exp[ w]
so that P0 is the probability measure of a N(O, In ? Ip) distribution
on 9X.
One consequence of Proposition 7.19 is that the statistic h(X) is
ancillary. But for the case at hand, we now describe the distribution
of h(X) and show that its distribution does not even depend on the
particular density f used to define P0. Recall that
h(x) = (t(x)) x= (x - PMx)(T'(x)) = (QMx)(T'(X))Y
= (QZX)(F(X))x x
where T(X)T'(X) = x'Qzx and T(X) E- G+. Fix x E- IX and set
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 7.20 295
I = (Q_x)(T'(x))f. First note that
*= (T(x))-'x'Q,x(T'(x))Y' = IP
so I is a linear isometry. Let N be the orthogonal complement in R' of the linear subspace spanned by the columns of the matrix Z.
Clearly, dim(N) = n - k and the range of I is contained in N
since Q, is the orthogonal projection onto N. Therefore, I is an element of the space
6J-p(N) = {'I*'" = Ip, range(+) c N).
Further, the group
H = {1rlrE en r'(N) = N}
is compact and acts transitively on 6p(N) under the group action
+ r*, * - Jp (N), rE=-H.
Now, we return to the original problem of describing the distribu tion of W = h(X) when e(X)- PO. The above argument shows that W E- SJ( (N). Since the compact group H acts transitively on
15p(N), there is a unique invariant probability measure ' on p(N). It will be shown that l (W) = i by proving f (PW) = 15(W) for all
r E H. It is not difficult to verify that FQ, = Q,1 for F E H. Since
P1(rX) = E(X) and T(rX) = T(X), we have
c(rw) = f(rFh(x)) = (FQzX(T'(X))
= C(QzrX(T'('X)) ')= = (QzX(T'(X)) )
= P-(h(X))= =(W).
Therefore, the distribution of Wis H-invariant so f&(W) = P.
Further applications of Proposition 7.19 occur in the next three chapters. In particular, this result is used to derive the distribution of the determinant of a certain matrix that arises in testing problems for the general linear model.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
296 FIRST APPLICATIONS OF INVARIANCE
PROBLEMS
1. Suppose the random n x p matrix X E 9 (X as in Section 7.1) has a
density given byf(x) = kIx'x Iexp[- 2 trx'x] with respect to dx. The constant k depends on n, p, and y (see Problem 6.10). Derive the density of S = X'X and the density of U in the representation X = IU
with UE GT and 4E- 'Fp
2. Suppose X E 9C has an ?n-left invariant distribution. Let P(X) = X(X'X)'X' and S(X) = X'X. Prove that P(X) and S(X) are inde
pendent.
3. Let Q be an n X n non-negative definite matrix of rank r and set
A = (xlx e Ep n x'Qx has rank p}. Show that, if r > p, then AC has
Lebesgue measure zero.
4. With X as in Section 7. 1, QnX Gip acts on 9 by x - xA' for F eOn
and A e Glp. Also, On X
Glp acts on 5+ by S- ASA'. Show that
+(x) = kx'x is equivariant for each constant k > 0. Are these the only
equivariant functions?
5. The permutation group 'fPn acts on Rn via matrix multiplication x -- gx,
g E gn Let (5 = yIy E- Rn y, < y2< * < Yn}. Define f: Rn
, 6
by f(x) is the vector of ordered values of the set (xl,..., xn) with multiple values listed.
(i) Show f is a maximal invariant.
(ii) Set 10(u) = 1 if u > 0 and 0 if u < 0. Define Fn(t) n- l=lno(t
-xi) for t e R'. Show Fn is also a maximal invariant.
6. Let M be a proper subspace of the inner product space (V, (, )). Let
Ao be defined by Aox = -x for x E M and Aox = x for x e M' .
(i) Verify that the set of pairs (B, y), with y E M and B either Ao or A , forms a subgroup of the affine group Al(V). Let G be this
group.
(ii) Show that G acts on M and on V.
(iii) Suppose t: V -* M is equivariant (t(Bx + y) = Bt(x) + y for
(B, y) E G and x E V). Prove that t(x) = PMX.
7. Let M be a subspace of Rn (M * Rn) so the complement of 9 = Rn
n MC has Lebesgue measure zero. Suppose X E 9 has a density given
by
p(x|,, a) n fO lix _
;2I2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 297
where ,u E M and a > 0. Assume that J11x112fO(1Ix112) dx < + oo. For
a > 0, F E 19(M), and b e M, the affine transformation (a, F, b)x =
aFx + b acts on IX.
(i) Show that the probability model for X (,u E M, a > 0) is in
variant under the above affine transformations. What is the induced group action on (,u, a2)?
(ii) Show that the only equivariant estimator of ,u is PMX. Show that
any equivariant estimator of a2 has the form kIIQx112 for some k > 0.
8. With X as in Section 7.1, suppose f is a function defined on Glp to
[0, oo), which satisfies f(AB) - f(BA) and
t ~~dx
J f(x'x ) /2 - 1.
(i) Show that f(x'x '), 7. e 5 is a density on 9 with respect to
dx/lx'xI"2 and under this density, the covariance (assuming it exists) is cI,, X L where c > 0.
(ii) Show that the family of distributions of (i) indexed by I e Sp is invariant under the group On X Glp acting on 9 by (r, A)x
FxA'. Also, show that (F, A)l = ALA'.
(iii) Show that the equivariant estimators of I all have the form kX'X, k > 0.
Now, assume that
sup f(C) =f(CO) cp
where CO E Sp' is unique.
(iv) Show CO = aIp for some a > 0.
(v) Find the maximum likelihood estimator of E: (expressed in terms of X and a in (iv)).
9. In an inner product space (V, (*, *)), suppose X has a distribution PO. (i) Show that C(IIXIl) = P,4IIYlI) whenever E(Y) = gPo, g e 6(V).
(ii) In the special case that e(X) = E(tL + Z) where it is a fixed
vector and Z has an 0(V)-invariant distribution, how does the distribution of 11XII depend on ,u?
10. Under the assumptions of Problem 4.5, use an invariance argument to show that the distribution of F depends on (,u, a2) only through the parameter (11,I112 - IIP ttII2)/a2. What happens when ,u E w?
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
298 FIRST APPLICATIONS OF INVARIANCE
11. Suppose XI,..., X, is a random sample from a distribution on RP (n > p) with density p(xJI p, ) = 12V- -'2f((x -
tL)'2; '(x - ,u)) where
, E RP and l E 2 p . The parameter 0 = det(2) is sometimes called
the population generalized variance. The sample generalized variance is V = det((l/n)S) where S = 2l(xi - Jf)(x, - x)'. Show that the dis
tribution of V depends on (It, 2) only through 0.
12. Assume the conditions under which Proposition 7.16 was proved. Given a probability Q on '54, let Q denote the extension of Q to
X-that is, Q(B) = Q(B n6 ) for B E '. For g E G, gQ is defined
in the usual way-(gQ)(B) = Q(g-'B).
(i) Assume that P is a probability measure on 9 and
(7.1) P =gQjV(dg), G
that is,
P(B) ==J(gQ)(B)M(dg); B e 6J.
Show that P is G-invariant.
(ii) If P is G-invariant, show that (7.1) holds for some Q.
13. Under the assumptions used to prove Proposition 7.16, let ?P be all the G-invariant distributions. Prove that T(X) is a sufficient statistic for the family VP.
14. Suppose XE R' has coordinates XI,..., Xn that are i.i.d. N(,u, 1),
,uE R'. Thus the parameter space for the distributions of X is the
additive group G = R'. The function t: Rn -- G given by t(x)= =
gives a complete sufficient statistic for the model for X. Also, G acts on
R' by gx = x + ge where e e Rn is the vector of ones.
(i) Show that t(gx) = gt(x) and that Z(X) = (t(X))-'X is an
ancillary statistic. Here, (t(X))-' means the group inverse of t(X) E G so (t(X))-'X = X - Xe. What is the distribution of
Z( X)? (ii) Suppose we want to find a minimum variance unbiased estima
tor (MVUE) of h(,u) = S,j(XI) where f is a given function.
The Rao-Blackwell Theorem asserts that the MVUE is & (f(X1)It(X) = t). Show that this conditional expectation is
| f(z + t)[1z2 1 jf(z +t) exp - dz 00 2,8 2 82~~~
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 299
where 82 = var(X, - X) = (n - 1)/n. Evaluate this for f(x) =
1 if x < uoandf(x)= Oifx > uo.
(iii) What is the MVUE of the parametric function (viT)- 'exp
[- 2(Xo - _1)2] where xo is a fixed number?
15. Using the notation, results, and assumptions of Example 7.18, find an unbiased estimator based on t(X) of the parametric function h(a, b) = ((a, b)Po)(X <, uo) where uo is a fixed number and Xl is the first
coordinate of X. Express the answer in terms of the distribution of Z1 -the first coordinate of Z. What is this distribution? In the case that
PO is the N(O, In) distribution, show this gives a MVUE for h(a, b).
16. This problem contains an abstraction of the technique developed in Problems 14 and 15. Under the conditions used to prove Proposition 7.19, assume the space ('J, (C,) is (G, 6G) and G = G. The equivari
ance assumption on T then becomes T(gx) = g o T(x) since T(x) E G.
Of course, T(X) is assumed to be a sufficient statistic for (gP0lg E G). (i) Let Z(X) = (T(X))-'X where (T(X))-' is the group inverse of
T(X). Show that Z(X) is a maximal invariant and Z(X) is
ancillary. Hence Proposition 7.19 applies. (ii) Let QO denote the distribution of Z when f&(X) is one of the
distributions gPo, g E G. Show that a version of the conditional expectation &(f(X)IT(X) = g) is &Qof(gZ) for any bounded
measurable f. (iii) Apply the above to the case when PO is N(O, In ? Ip) on 9 (as in
Section 7.1) and take G = G'. The group action is x -- xT' for
x E % and T E G'. The map T is T(X) = T in the representa
tionX= PT'with PE pandT, GT.WhatisQ0? (iv) When XE 6 is N(O, In X 2) with L E SP, use (iii) to find a
MVUE of the parametric function
wh)-Plr -
i/2sfd c i RUY- u
where uo is a fixed vector in RP.
NOTES AND REFERENCES
1. For some material related to Proposition 7.3, see Dawid (1978). The extension of Proposition 7.3 to arbitrary compact groups (Proposition 7.16) is due to Farrell (1962). A related paper is Das Gupta (1979).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
300 FIRST APPLICATIONS OF INVARIANCE
2. If G acts on % and / is a function from % onto % it is natural to ask if
we can define a group action on ^ (using / and G) so that / becomes
equivariant. The obvious thing to do is to pick jG^J, write y =
t(x\ and then define gy to be t(gx). In order that this definition make sense
it is necessary (and sufficient) that whenever t(x) =
t(x), then i(gx) =
t(gx) for all g e G. When this condition holds, it is easy to show that G then acts on ty via the above definition and t is equivariant. For some
further discussion, see Hall, Wijsman, and Ghosh (1965).
3. Some of the early work on invariance by Stein, and Hunt and Stein, first appeared in print in the work of other authors. For example, the
famous Hunt-Stein Theorem given in Lehmann (1959) was established in 1946 but was never published. This early work laid the foundation
for much of the material in this chapter. Other early invariance works
include Hotelling (1931), Pitman (1939), and Peisakoff (1950). The paper by Kiefer (1957) contains a generalization of the Hunt-Stein
Theorem. For some additional discussion on the development of invari ance arguments, see Hall, Wijsman, and Ghosh (1965).
4. Proposition 7.15 is probably due to Stein, but I do not know a
reference.
5. Make the assumptions on 9C, 6H9 and G that lead to Proposition 7.16, and note that % is just a particular representation of the quotient space
%/G. If v is any a-finite G-invariant measure on 9G, let 8 be the measure on fy defined by
8{C) =
p(t-1(C)), C?l
Then (see Lehmann, 1959, p. 39),
f h(r(x))v(dx) = / h(y)S(dy)
for all measurable functions h. The proof of Proposition 7.16 shows that
for any ^-integrable function /, the equation
(7.2) /f(x)v(dx) =
f f f(gy)fi(dg)S(dy)
holds. In an attempt to make sense of (7.2) when G is not compact, let
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 301
/ir denote a right invariant measure on G. For / g %(%), set
Assuming this integral is well defined (it may not be in certain examples
?e.g., % = Rn - (0} and G = G/?), it follows that /(Ax)
= /(jc) for
A e G. Thus / is invariant and can be regarded as a function on
^ = %/G. For any measure 5 on ^, write ffdS to mean the integral of
/, expressed as a function of y, with respect to the measure 8. In this
case, the right-hand side of (7.2) becomes
J(f) = /(/c/(g*K(?g))
di = ff d?.
However, for h e G, it is easy to show
j(hf) = ̂(h)j(f) so J is a relatively invariant integral. As usual, Ar is the right-hand
modulus of G. Thus the left-hand side of (7.2) must also be relatively invariant with multiplier A 7l. The argument thus far shows that when /i in (7.2) is replaced by /xr (this choice looks correct so that the inside
integral defines an invariant function), the resulting integral / is rela
tively invariant with multiplier A 7*. Hence the only possible measures v
for which (7.2) can hold must be relatively invariant with multiplier
A71. However, given such a v, further assumptions are needed in order
that (7.2) hold for some S (when G is not compact and ft is replaced by
/xr). Some examples where (7.2) is valid for noncompact groups are
given in Stein (1956), but the first systematic account of such a result is
Wijsman (1966), who uses some Lie group theory. A different approach due to Schwarz is reported in Farrell (1976). The description here
follows Andersson (1982) most closely.
6. Proposition 7.19 is a special case of a result in Hall, Wijsman, and
Ghosh (1965). Some version of this result was known to Stein but never
published by him. The development here is a modification of that which
I learned from Bondesson (1977).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 8
The Wishart Distribution
The Wishart distribution arises in a natural way as a matrix generalization of the chi-square distribution. If XI,..., X, are independent with P,(Xi)= N(O, 1), then Y,2X has a chi-square distribution with n degrees of freedom. When the Xi are random vectors rather than real-valued random variables
say Xi e RP with fE(Xi) = N(O, Ip), one possible way to generalize the
above sum of squares is to form the p x p positive semidefinite matrix
S = 'XLi Xi'. Essentially, this representation of S is used to define a Wishart
distribution. As with the definition of the multivariate normal distribution, our definition of the Wishart distribution is not in terms of a density function and allows for Wishart distributions that are singular. In fact, most
of the properties of the Wishart distribution are derived without reference to densities by exploiting the representation of the Wishart in terms of normal
random vectors. For example, the distribution of a partitioned Wishart matrix is obtained by using properties of conditioned normal random vectors.
After formally defining the Wishart distribution, the characteristic func tion and convolution properties of the Wishart are derived. Certain gener alized quadratic forms in normal random vectors are shown to have Wishart
distributions and the basic decomposition of the Wishart into submatrices is
given. The remainder of the chapter is concerned with the noncentral Wishart distribution in the rank one case and certain distributions that arise
in connection with likelihood ratio tests.
8.1. BASIC PROPERTIES
The Wishart distribution, or more precisely, the family of Wishart distribu tions, is indexed by a p x p positive semidefinite symmetric matrix l, by a
302
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.1 303
dimension parameter p, and by a degrees of freedom parameter n. Formally, we have the following definition.
Definition 8.1. A random p x p symmetric matrix S has a Wishart distri
bution with parameters 1, p, and n if there exist independent random
vectors Xl,..., Xn in RP such that C(Xi) = N(O, :), i = 1,..., n and
( S=) = c( x,).
In this case, we write E(S) = W(2, p, n).
In the above definition, p and n are positive integers and 2 is a p x p
positive semidefinite matrix. When p = 1, it is clear that the Wishart
distribution is just a chi-square distribution with n degrees of freedom and scale parameter I > 0. When . = 0, then Xi = 0 with probability one, so
S = 0 with probability one. Since YIXX i is positive semidefinite, the
Wishart distribution has all of its mass on the set of positive semidefinite matrices. In an abuse of notation, we often write
n
S = xi,
when f(S) = W(2, p, n). As distributional questions are the primary con cern in this chapter, this abuse causes no technical problems. If X E E n
has rows Xl,..., X", it is clear that E(X) = N(O, In X ) and X'X =
IX1XiX'. Thus if C(S) = W(2, p, n), then c(S) = f&(X'X) where E,(X) =
N(O, In 0 ) in ep n. Also, the converse statement is clear. Some further
properties of the Wishart distribution follow.
Proposition 8.1. If C(S) = W(2, p, n) and A is an r X p matrix, then
C(ASA') = W(AE2A', r, n).
Proof. Since ft(S) = W(Y, p, n),
C(s) = E(x'X)
where E (X) = N(O, I,, 0 X) in lp, n. Thus f&(ASA') = f (AX'XA') =
e[((In A)X)'(In 0 A)XI. But Y = (In 0 A)X satisfies e (Y) =
N(0, In 0 (ALA')) in Er n and (IY'Y) = (ASA'). The conclusion follows from the definition of the Wishart distribution. o
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
304 THE WISHART DISTRIBUTION
One consequence of Proposition 8.1 is that, for fixed p and n, the family of distributions {W(X, p, n)IY. > 0} can be generated from the W(Ip, p, n) distribution and p x p matrices. Here, the notation L >? 0 (2 > 0) means
that X is positive semidefinite (positive definite). To see this, if C(S)=
W(Ip, p, n) and I = AA', then
C (ASA') = W(AA', p, n) = W(2, p, n).
In particular, the family (W(2, p, n)JI2 > 0) is generated by the W(Ip, p, n) distribution and the group Glp acting on Sp by A(S) ASA'. Many proofs are simplified by using the above representation of the Wishart distribution. The question of the nonsingularity of the Wishart distribution is a good example. If C(S) = W(2, p, n), then S has a nonsingular Wishart distribu
tion if S is positive definite with probability one.
Proposition 8.2. Suppose ?(S) = W(2, p, n). Then S has a nonsingular Wishart distribution iff n > p and Y. > 0. If S has a nonsingular Wishart
distribution, then S has a density with respect to the measure v(dS)=
dSlISI(P+ )I2 given by
p(SIE) = t(n, p)IE- ISIn/2exp[ tr E- S].
Here, w(p, n) is the Wishart constant defined in Example 5.1.
Proof Represent the W(1, p, n) distribution as C(AS,A') where fC(S) =
W(Ip, p, n) and AA' = . Obviously, the rank of A is the rank of : and
Y. > 0 iff rank of I is p. If n < p, then by Proposition 7.1, if l (Xi) =
N(O, Ip), i = 1,..., n, the rank of EjXiX,' is n with probability one. Thus
SI = y2nX1X' has rank n, which is less thanp, and S = ASIA' has rank less
than p with probability one. Also, if the rank of I is r < p, then A has rank
r so ASIA' has rank at most r no matter what n happens to be. Therefore, if
n < p or if 2 is singular, then S is singular with probability one. Now,
consider the case when n > p and I is positive definite. Then S, = Y'Xi Xi' has rank p with probability one by Proposition 7.1, and A has rank p.
Therefore, S = AS,A' has rank p with probability one.
When L > 0, the density of X e E n is
f(X) =
(, -
2/-nply tr z-'X'X]
when e(X) = N(0, In ? E). When n > p, it follows from Proposition 7.6
that the density of S with respect to v(dS) isp(SI2). O
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.3 305
Recall that the natural inner product on Sp, when ? is regarded as a
subspace of e is
(SI, S2)=trS1S2, Si e?, =1, 2.
The mean vector, covariance, and characteristic function of a Wishart distribution on the inner product space (S , >) are given next.
Proposition 8.3. Suppose E(S) = W(2, p, n) on (sp, ( , )). Then
(i) &S = nl.
(ii) Cov(S) = 2n X E2.
(iii) +(A)=&exp[i(A, S)] = lIp-2i2A[j/2.
Proof. To prove (i) write S = E4XiX,' where E(Xi) = N(O, 2:), and X1,.... X,, are independent. Since &;XiX,i' 2=, it is clear that &S = n2. For (ii), the
independence of Xl,..., X,, implies that
n n
Cov(S) = Cov(EXiX) = E Cov( XX,') = n Cov( X, XI')
=n Cov(X 1o XI)
where X, o X, is the outer product of X, relative to the standard inner product on RP. Since E (XI) = C(CZ) where e(Z) = N(O, Ip) and CC' = 2, it follows from Proposition 2.24 that Cov(X1 0 XI) = 22 0 2. Thus (ii)
holds. To establish (iii), first write C'AC = PD F' where A E S , CC' 2,
F E Qn, and D is a diagonal matrix with diagonal entries Al,..., Ap. Then
?(A) = I;exp[itr(AS)] = &;exp[itrA (YX1x;)j
n n
= a; IH exp[itrAXjXj'] =
H&6exp[itrAXjXj'] j1 j=1
[&;exp[itrAX,Xj]]n = [&exp[iX,AXj]] (t(A
Again, E4(XI) = P(CZ) where E(Z) = N(O, Ip). Also, E (PZ) = ES(Z) for
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
306 THE WISHART DISTRIBUTION
F e Therefore,
((A) = Sexp[iXAXl] = & exp[iZ'C'ACZ]
= &exp[iZ'DZ] = Sexp[i E AjZj
where Z1,..., Zp are the coordinates of Z. Since Z1,..., Zp are independent with E(Zj) N(O, 1), Zj, has a x2 distribution and we have
p p
0(A) = 17 6 exp[iXjZj2] = l (1 - 2iXj) / ]=1 J=1
= II, - 2iDi-'/2 II, - 2iFDF'j-/2 - I - 2iC'AC- 1/2
= IIp - 2iCC'A'-7/2 = Ip- 2iDAj-'/2.
The next to the last equality is a consequence of Proposition 1.35. Thus (iii) holds. O
Proposition 8.4. If fS(Si) = W(E, p, n ) for i = 1,2 and if S, and S2 are
independent, then 1(S, + S2) = W(E, p, n1 + n2))
Proof An application of (iii) of Proposition 8.3 yields this convolution result. Specifically,
2 .p(A) = & exp[i(A, S, + S2) =]Hl C exp i(A, Sj)
j=1
2
= H IIp - 2i.AI-'72 = -I - 2iEAI(n?n2)/2 j=1
The uniqueness of characteristic functions shows that l (S, + S2)= W(2, p, n, + n2).
It should be emphasized that <, ) is not what we might call the standard inner product on Sp when Sp is regarded as a [p(p + 1)/2]
dimensional coordinate space. For example, if p = 2, and S, T E Sp, then
KS, T) = trST = s,,t,, + S22t22 + 2s12tl2
while the three-dimensional coordinate space inner product between S and
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.5 307
T would be sH1tH1 + s22t22 + s12t12. In this connection, equation (ii) of
Proposition 8.3 means that
cov((A, S), (B, S)) = 2n(A, (2 ? 2)B)
= 2n(A,2BE) = 2ntr(A>2BE),
that is, (ii) depends on the inner product < , on p and is not valid for
other inner products. In Chapter 3, quadratic forms in normal random vectors were shown to
have chi-square distributions under certain conditions. Similar results are available for generalized quadratic forms and the Wishart distribution. The following proposition is not the most general possible, but suffices in most situations.
Proposition 8.5. Consider X e p, where E(X) = N(,u, Q X E:). Let S=
X'PX where P is n x n and positive semidefinite, and write P = A2 with A
positive semidefinite. If A QA is a rank k orthogonal projection and if P = 0, then
e (S) = W(2:, p, k).
Proof. With Y = AX, it is clear that S = Y'Y and
e (Y) = N(Aft,(AQA) ? Y).
Since 9L(A) = 6(P) and Pu= 0, Ap = O so
(Y)= N(O,(AQA) 2).
By assumption, B = AQA is a rank k orthogonal projection. Also, S = Y'Y = Y'BY + Y'(I - B)Y, and f ((I - B)Y) = N(O, 0 0 E) so Y'(I - B)Y is zero with probability one. Thus it remains to show that if E(Y) = N(O, B 0
Y.) where B is a rank k orthogonal projection, then S = Y'BY has a
W(E, p, k) distribution. Without loss of generality (make an orthogonal transformation),
B =(Ik 0): n X n. P a
Partitioning Y into Y, : k X p and Y2: (n -
k) X p, it follows that S = Y'Yl
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
30)8 THE WISHART DISTRIBUTION
and
e(Y,) = N(O, Ik ? E)*
Thus ls(S) = W(Y, p, k).
* Example 8.1. We again return to the multivariate normal linear model introduced in Example 4.4. Consider X E Ep n with
e,(X) = N(p, In X ? )
where ,u is an element of the subspace M c (Ep n defined by
M = (XIX E epn, X = ZB, BE ep,k)
Here, Z is an n x k matrix of rank k and it is assumed that
n - k >? P. With P, = Z(Z'Z)-'Z', PM = P, X IP is the orthogonal
projection onto M and QM = QZ ? IP, QZ = I - PZ, is the orthogo
nal projection onto M' . We know that
I =PMX = (P, Ci) x = P,ZX
is the maximum likelihood estimator of ,u. As demonstrated in
Example 4.4, the maximum likelihood estimator of 2 is found by
maximizing
P(XI,, ) = ll-n/2exp[-2 tr E- lx Qzx]
Since n - k > p, x'Q,x has rank p with probability one. When
X'QZ X has rank p, Example 7.10 shows that
1
n
is the maximum likelihood estimator of E. The conditions of
Proposition 8.5 are easily checked to verify that S X'Qz X has a
W(X, p, n - k) distribution. In summary, for the multivariate lin
ear model, ,u = PMX and E = n X'QzX are the maximum likeli
hood estimators of ,I and E. Further, t and I are independent and
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.6 309
8.2. PARTITIONING A WISHART MATRIX
The partitioning of the Wishart distribution considered here is motivated partly by the transformation described in Proposition 5.8. If l&(S)=
W(Y., p, n) where n > p, partition S as
= S, S12'\ S21 S22 /
where S2, = Sj2 and let
S11.2 = Sl - S2S2'21.
Here, Sij is pi X pj for i, j = 1, 2 so PI + P2 = p. The primary result of this
section describes the joint distribution of (5Si 2' S21, S22) when I is nonsin gular. This joint distribution is derived by representing thte Wishart distribu tion in terms of the normal distribution. Since s(S) = W(1, p, n), S = X'X
where E(X) = N(O, I,, ? E). Discarding a set of Lebesgue measure zero, X
is assumed to take values in 9, the set- of all n x p matrices of rank p. With
x =(XI IX2), Xi: nX Pi, i=1,2,
it is clear that
Si= X, XJ fori,j= 1,2.
Thus
511.2 = x;x, - X,X2(X2X2) 'X2XI = X'QX,
where
Q =In -X2 ( X2X2 X2- In - P
is an orthogonal projection of rank n - P2 for each value of X2 when X e 9. To obtain the desired result for the Wishart distribution, it is useful to first give the joint distribution of (QXI, PX1, X2).
Proposition 8.6. The joint distribution of (QXI, PXI, X2) can be described as follows. Conditional on X2, QX1 and PX, are independent with
E(QX1IX2) = N(O, Q 0 E 11-2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
310 THE WISHART DISTRIBUTION
and
f (PX1IX2) =
N(X21221Y21, p ? 111 2)
Also,
f (X2) = N(O, In X 222)
Proof. From Example 3.1, the conditional distribution of X1 given X2, say
e(XiIX2), is
E(XllX2) = N (X2 -2-2y 2 1, In 0 1-I1 .2)
Thus conditional on X2, the random vector
Q ? 1P, Qx
w- iP@ |x (P :(2n) xp,
is a linear transformation of X1. Thus W has a normal distribution with mean vector
p 0 8 IP IX 11 3(? _
1
since QX2= 0 and PX2 = X2. Also, using the calculational rules for parti
tioned linear transformations, the covariance of W is
/Q 9 'P, IQ 0\ (p g
IP )(In (' Y-11 2)(Q >' Ipl, P qg Ipl) (0 p) / 11-2
since QP = 0. The conditional independence and conditional distribution of QX1 and PX1 follow immediately. That X2 has the claimed marginal
distribution is obvious. [l
Proposition 8.7. Suppose l (S) = W(Y., p, n) with n > p and . > 0. Parti
tion S into Sij, i, j = 1, 2, where S-i is pi x pj, PI + P2 = p, and partition l
similarly. With S1l.2 = Sl -
SU2Sj2'5S21 SH.2 and (S21 S22) are stochasti
cally independent. Further,
(S11 2)-= W(Y11.2, pl, n - P2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.7 311
and conditional onS22
E(S21IS22) = N(S22"21-22, S22 ? 211.2).
The marginal distribution of S22 is W(122, P2' n).
Proof. In the notation of Proposition 8.6, consider X E 9 with D ( X) =
N(O, In ,, Y) and S = X'X. Then Sij = X,'Xj for i, j = 1, 2 and
Si.2 = XlQXI. Since PX2 = X2 and S21 = X2X, we see that S21
(PX2)'XI = X2PX,, and conditional on X2,
E(S21IX2) = N(XX222 2121(X2X2) ?112
To show that SI I.2 and (S21, S22) are independent, it suffices to show that
Sf(S1.-2)h(S21, S22) = &f(Si1.2)&h(S21, S22)
for bounded measurable functions f and h with the appropriate domains of
definition. Using Proposition 8.6, we argue as follows. For fixed X2, QXI and PX, are independent so S'11.2 = XQQX1 and S21 = X2PX, are condi
tionally independent. Also,
E(QX1jX2) = N(O, Q X E 11 2)
and Q is a rank n - P2 orthogonal projection. By Proposition 8.5,
e (XlQXI I X2) = W(Y1 I . 2, P I, n -
P2) for each X2 so XlQX1 and X2 are
independent. Conditioning on X2, we have
&f(sII.2)h(S21, S22) = &[f(X1QX1)h(X2PX1, X2X2)1X2]
= &[G(f(x;'Qx1)Ix2)&q(h( X2PX1, X2X2)IX2)]
= & [&f(X'QX1) & (h (X2PX1, X2 X2)1X2)]
= &f(X,QX1)&g [& (h(X2PX1, X2X2)IX2)]
= &f ( Xl QX) Sh ( X2PXI, X2X2)
= &f(SI 12)&h(S21, S22).
Therefore, S, .2 and (S21, S22) are stochastically independent. To describe
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
312 THE WISHART DISTRIBUTION
the joint distribution of S2I and S22, again condition on X2. Then
E(S2 IX2) = ?22219(X2
and this conditional distribution depends on X2 only through S22 = X2X2. Thus
(S2l1IS22) = N(S22222 21,9 S22 ?
11.2)
That S22 has the claimed marginal distribution is obvious. O
By simply permuting the indices in Proposition 8.7, we obtain the following proposition.
Proposition 8.8. With the notation and assumptions of Proposition 8.7, let S22-1 = S22 - S21Sj',SS2. Then S22.1 and (S1H, S12) are stochastically inde
pendent and
e (S22 -I W(Y(22.1, P2, n - Pi).
Conditional on SI ,,
C(S121sI1) = N(S11 j'12 S11 ? 22.1)
and the marginal distribution of S11 is W(Z:1, pI, n).
Proposition 8.7 is one of the most useful results for deriving distributions of functions of Wishart matrices. Applications occur in this and the remain
ing chapters. For example, the following assertion provides a simple proof of the distribution of Hotelling's-T2, discussed in the next chapter.
Proposition 8.9. Suppose SO has a nonsingular Wishart distribution, say
W(2, p, n), and let A be an r x p matrix of rank r. Then
5 ((ASo- 'A") ) =
W((AY. -
'A') 1, r, n -p + r).
Proof. First, an invariance argument shows that it is sufficient to consider
the case when L = I. More precisely, write E = B2 with B > 0 and let
C = AB-'. With S = B-1SOB-', E(S) = W(I, p, n) and the assertion is
that
((CS- IC,) 1 W((CC') ', r,n - p + r).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.9 313
Now, let I = C'(CC') 1/2, SO the assertion becomes
E ((*IS-'*)-') = W(Ir, r, n - p + r).
However, ' is p x r and satisfies V'I = Jr-that is, ' is a linear isometry. Since S(r'Sr) = fS(S) for all F E op
( (wrts - Ir+)l C,((* ,s-l)
Choose F so that
r = (Ir) p X r. 0
For this choice of F, the matrix (4"F'S- 'TF)-' is just the inverse of the
r x r upper left corner of S-', and this matrix is
Sl - SI2'S21 V
where V is r x r. By Proposition 8.7,
fE(V) = W(I,r,n-p + r)
since c (S) = W(I, p, n). This establishes the assertion of the proposition. E
When r = 1 in Proposition 8.9, the matrix A' is nonzero vector, say A' = a E RP. In this case,
a (-z la) ~~ =2
a a'Sa/=Xl+
when C(S) = W(2, p, n). Another decomposition result for the Wishart distribution, which is sometimes useful, follows.
Lemma 8.10. Suppose S has a nonsingular Wishart distribution, say fE(S) = W(2, p, n), and let S = TT' where T e GT. Then the density of T with
respect to the left invariant measure v(dT) = dT/Ht' is
p(TjE) = 2Yw6(n, p)J12-TT'In/2exp[ tr , - 'TT'].
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
314 THE WISHART DISTRIBUTION
If S and T are partitioned as
iSI~ I 12 ( , TI 0 s
S21 S22 T
T21 T22)
where Sis Pi x pj, PI + P2 = P, then SI = T1IT1l, 12 = T IT27, and
22 .1 = T22T52. Further, the pair (Tl,, T21) is independent of T22 and
S22~~~~ (1 =2 T22 T2 22 (T Esll:l,IlX22
sE(T'1(1)N(T'1j1'II12, 'P, ? E22.1)'
Proof. The expression for the density of T is a consequence of Proposition 7.5, and a bit of algebra shows that S, = ll T, 512 = 'II7'l, and S22 =1 T22 T22- The independence of (T,1, T2I) and T22 follows from Proposition 8.8
and the fact that the mapping between S and T is one-to-one and onto.
Also,
E (S121S1l) = N( SI I 11 12I 11l 22 -1 -
Since S,, and T,, are one-to-one functions of each other and S12 = TT2l,
E(TjIT2'jjTjI) =
N(TTj,YT.-IY,212 TIITI'l 0 -2 221)
Thus
C(T21IT11) I= (l:l2:l2 'pi 6' 222 1),
as
T21= (T?, I IP2)(TI IT21)
and T, is fixed.
Proposition 8.11. Suppose S has a nonsingular Wishart distribution with
C (S) = W(E, p, n) and assume that : is diagonal with diagonal elements
oH'*. .,.Opp. If S = TT' with T e GT, then the random variables (t1ji >1)
are mutually independent and
E (tij) = N(O, aii) fori >j
and
C(t2) = i y2i
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.11 315
Proof. First, partition S, 1, and T as
(S21 S22) ( 222 T21 22
where S11 is I x 1. Since y12 = 0, the conditional distribution of T2, given
T,1 does not depend on T11 and 22 has diagonal elements U22,*, pp . It follows from Proposition 8.10 that t, T, and T ally indepen
1'21' 22 are mutualyidpn
dent and
E(T21) = N(0, 122)
The elements of T21 are t21, t31,..., tpl, and since 22 is diagonal, these are
independent with
C(til)=N(O,aii), i-2, .,p.
Also,
E(t2 UX= 2
Gxn
and
C(S22 1) = e(T22T22) W(222, p - 1, n - 1).
The conclusion of the proposition follows by an induction argument on the dimension parameter p. o
When C(S) = W(1, p, n) is a nonsingular Wishart distribution, the random variable ISI is called the generalized variance. The distribution of 15I is easily derived using Proposition 8.1 1. First, write l = B2 with B > 0 and
let S1 = B-'SB-'. Then C(S,) = W(I, p, n) and ISI = XIZjSI. Also, if
TT' = SI, TE GT, then fE(t2)= X2_ for i = 1,...,p, and tl, I,_ tpp are mutually independent. Thus
e (jSj) = e(i2IISci) = =
Therefore, the distribution of ISI is the same as the constant III times a product of p independent chi-square random variables with n - i + 1 degrees of freedom for i = 1,. . , p.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
316 THE WISHART DISTRIBUTION
8.3. THE NONCENTRAL WISHART DISTRIBUTION
Just as the Wishart distribution is a matrix generalization of the chi-square distribution, the noncentral Wishart distribution is a matrix analog of the noncentral chi-square distribution. Also, the noncentral Wishart distribu tion arises in a natural way in the study of distributional properties of test statistics in multivariate analysis.
Definition 8.2. Let X E l, n have a normal distribution N(pu, I,, X 2). A
random matrix S E p has a noncentral Wishart distribution with parame
ters 2, p, n, and A -I',u if C(S) = P-(X'X). In this case, we write
E(S) = W(2:, p, n; A).
In this definition, it is not obvious that the distribution of X'X depends on 1i only through A = ft',u. However, an invariance argument establishes this. The group Q acts on l,, by sending x into rx for x E n and
r E on. A maximal invariant under this action is x'x. When C(X) =
N(,u, In ? 2), I (rx) = N(rF, In ? 2) and we know the distribution of
X'X depends only on a maximal invariant parameter. But the group action
on the parameter space is (I,, 2) -* (JA, 2Y) and a maximal invariant is
obviously (,u't,u 2). Thus the distribution of X'X depends only on (4't, 2).
When A = 0, the noncentral Wishart distribution is just the W(2, p, n)
distribution. Let Xl,..., Xn be the rows of X in the above definition so
XI,..., Xn are independent and P-(Xi) = N(1ii, 2) where pL/,..., p', are the
rows of ,u. Obviously,
P_(Xixil) =W(Y-,p I ; Ai)
where Ai = fAi,uji. Thus Si = XiX,', i = 1,..., n, are independent and it is
clear that, if S = X'X, then
e =(s) eEsi
In other words, the noncentral Wishart distribution with n degrees of
freedom can be represented as the convolution of n noncentral Wishart
distributions each with one degree of freedom. This argument shows that, if
L (Si) = W(2, p, ni; Ai) for i = 1, 2 and if S, and S2 are independent, then
f(S1 + S2) = W(2, p, n1 + n2, A1 + A2)- Since
SX X,' X= 2 + 11/;
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.12 317
it follows that
6S= nl + A
when e (S) = W(12, p, n; A). Also,
n
Cov(S)= E COv(Si)
but an explicit expression for Cov(S1) is not needed here. As with the central Wishart distribution, it is not difficult to prove that, when e (S) =
W(2, p, n; A), then S is positive definite with probability one iff n > p and
E > 0. Further, it is clear that if 1F(S) = W(1, p, n; A) and A is an r x p
matrix, then E(ASA') = W(AYA', r, n; AAA'). The next result provides an expression for the density function of S in a special case.
Proposition 8.12. Suppose C(S) = W(2, p, n; A) where n > p and : > 0,
and assume that A has rank one, say A = qq' with q E RP. The density of S
with respect to P(dS) = dS/lSI(p+ 1)/2 is given by
p1 (SIE, A) = p(SIM)exp[- Il- 1X H((B'Q- IS- 1q) /2)
where p(SI ) is the density of a W(1, p, n) distribution given in Proposi tion 8.2 and the function H is defined in Example 7.13.
Proof: Consider X E ep n with C(X) = N(,u, I,, ? ) where it
E n and
',u = A. Since S = X'X is a maximal invariant under the action of on on
p, n, the results of Example 7.15 show that the density of S with respect to
the measure vo(dS) = (v2&? )np,(n, p)1SI(n-p- 1)/2 dS is
h(S) = jf(rx)uo(dr).
Here, f is the density of X and ,uo is the unique invariant probability measure on 0,,. The density of X is
f(X) (=)-" lZ -n/2eXp tr(X - u) (X - ]
Substituting this into the expression for h(S) and doing a bit of algebra shows that the density pI(SI 2, A) with respect to v is
p1(SIE, A) = p(SII)exp[- 2 tr I 'A] exp[tr rXi-'L]Lo(dr).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
318 THE WISHART DISTRIBUTION
The problem is now to evaluate the integral over (9,. It is here where we use
the assumption that A has rank one. Since A = ,',, ,t must have rank one so
, = q' where ( E R', 1111 = 1, and q EC RP, A = yq'. Since 11j11 = 1, t = F1cE for some F, E On where E e R n is the first unit vector. Setting u =
(rt2 - Is -) 1nl/2, X2 -
1q = ur2e, for some 12 E eOnas uE1 and XI'-i,q have
the same length. Therefore,
f exp[tr rX-' '],uo(dr) = f exp[tr rX- '] o (dr)
= exp[t'rx2- 'q]1Lo(dF) exp[ue'rTr'2e],,IO(dF)
= f exp[ue'rEJp]Eo(dF) = f exp[uy11],uo(dr) H(u).
The right and left invariance of Ito was used in the third to the last equality
and y11 is the (1, 1) element of r. The function H was evaluated in Example 7.13. Therefore, when A =
pI(Sl:, A) =p(SIY )exp[-j2i'q'- 1] X H((71'.2-S2q)'72). ?
The final result of this section is the analog of Proposition 8.5 for the
noncentral Wishart distribution.
Proposition 8.13. Consider X E P nwhere ?(X) = N(ji, Q ? E) and let
S = X'PX where P > 0 is n x n. Write P = A2 with A > 0. If B AQA is
a rank k orthogonal projection and if AQPji = A,u, then
fS(S) = W( :, p, k; ,u'P,u).
Proof. The proof of this result is quite similar to that of Proposition 8.5
and is left to the reader. o
It should be noted that there is not an analog of Proposition 8.7 for the
noncentral Wishart distribution, at least as far as I know. Certainly, Proposition 8.7 is false as stated when S is noncentral Wishart.
8.4. DISTRIBUTIONS RELATED TO LIKELIHOOD RATIO TESTS
In the next two chapters, statistics that are the ratio of determinants of
Wishart matrices arise as tests statistics related to likelihood ratio tests.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
DISTRIBUTIONS RELATED TO LIKELIHOOD RATIO TESTS 319
Since the techniques for deriving the distributions of these statistics are intimately connected with properties of the Wishart distribution, we have chosen to treat this topic here rather than interrupt the flow of the
succeeding chapters with such considerations. Let X Epsm and S E S+ be independent and suppose that E(X) =
N(p,, Im X 2) and l (S) = W(2, p, n) where n > p and : > 0. We are
interested in deriving the distribution of the random variable
u = isi IS + x'xI
for some special values of the mean matrix ,u of X. The argument below
shows that the distribution of U depends on (,i, E) only through 2-l/2#,2l1/2 where zI/2 iS the positive definite square root of E. Let S = 2:l/25E:I/2 and Y = X:-12. Then S1 and Y are independent, E (S,) =
W(I, p, n), and C(Y) = N(,ul:- /2, Im ? Ip). Also,
U 1Si _ Is,I Is + x,xI is, + Y,Yl
However, the discussion of the previous section shows that Y'Y has a noncentral Wishart distribution, say E(Y'Y) = W(I, p, n; A) where A = 2 -
1/2t- 1/2. In the following discussion we take . = Ip and denote the
distribution of U by
EC(U) = U(n, m, p; A\)
where A = ,u'. When , = 0, the notation
E (U) = U(n, m, p)
is used. In the case that p = I,
s + x'x
where 1(S) = x2. Since C(X) = N(,u, Im), (X'X) = (A) where A =
j,',l > 0. Thus
U= . ~~~1 1 + X2(A)/X2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
320 THE WISHART DISTRIBUTION
When x2 (A) and x2 are independent, the distribution of the ratio
F(m, n;A) 2 Xn
is called a noncentral F distribution with parameters m, n, and A. When A = 0, the distribution of F(m, n; 0) is denoted by Fm n and is simply called an F distribution with (m, n) degrees of freedom. It should be noted that this usage is not standard as the above ratio has not been normalized by the
constant n/m. At times, the relationship between the F distribution and the beta distribution is useful. It is not difficult to show that, when x2 and x2 are independent, the random variable
2 V X
n
Xn + Xm
has a beta distribution with parameters n/2 and m/2, and this is written as
E&(V) = B(n/2, m/2). In other words, V has a density on (0, 1) given by
A(V) = r -
where a = n/2 and ,B= m/2. More generally, the distribution of the random variable
2
Xn
fl
X2 + X2 /
is called a noncentral beta distribution and the notation E(V( A))= JB(n/2, m/2; A) is used. In summary, whenp = 1,
e(U @(n ml\ 1 (U) = 1(jJj!i2
where A = ,u > 0.
Now, we consider the distribution of U when m = 1. In this case,
= N(tL', Ip) where X' E RP and
U= is + 'Xi = -ip + s-x'xr-' = (1 + xs-'x').
The last equality follows from Proposition 1.35.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.14 321
Proposition 8.14. When m = 1,
(U)_g3( 2 ,2;6
where 6 = yy' > 0.
Proof It must be shown that
C (XS-'X') = F( p, n - p + ,)
For X fixed, X * 0, Proposition 8.10 shows that
_ _ _ _ ~~~2
XS IX/) Xn-p+1
when 15(S) = W(I, p, n). Since this distribution does not depend on X, we
have that (XX')/XS- 'X' and XX' are independent. Further,
E(XX') = x2(8)
since l?(X') = N(,u', I.). Thus
P-(XS 'X')=e c x XX ')=F(n - p +1,p; 8).
The next step in studying e(U) is the case when m > 1, p > 1, but J. = 0.
Proposition 8.15. Suppose X and S are independent where t(S)= W(I, p, n) and e(X) = N(O, Im ?, Ip). Then
f(U) = i
where U,,..., Un are independent and 4(U1) = B((n - p + i)/2, p/2).
Proof The proof is by induction on m and, when m = 1, we know
(U) = fi((n - p + 1)/2, p/2).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
322 THE WISHART DISTRIBUTION
Since X'X = EX1 X,' where X has rows X1,..., Xm,
lSI lSI [~~~~s + XI X, u= - = S x I Is + X'Xl IS + X1X;I IS + X1X? + 2X1X,'
The first claim is that
U1-S I1 S + XI X,I
and
W IS + X 1x:I is + XxXx + 2MXiX,|
are independent random variables. Since X,,..., Xm are independent and independent of S, to show U1 and W are independent, it suffices to show that U1 and S + XI X are independent. To do this, Proposition 7.19 is
applicable. The group Glp acts on (S, XI) by
A(S, X1) = (ASA, AX1)
and the induced group action on T = S + X1 X1 sends T into A TA'. The
induced group action is clearly transitive. Obviously, T is an equivariant function and also U, is an invariant function under the group action on
(S, XI). That T is a sufficient statistic for the parametric family generated
by Glp and the fixed joint distribution of (S, X1) is easily checked via the
factorization criterion. By Proposition 7.19, U1 and S + X1X, are indepen
dent. Therefore,
(U) = (U1W)
where U, and W are independent and
' -p + Ip
However, E (S + XI X) = W(I, p, n + 1) and the induction hypothesis ap
plied to W yields
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.16 323
where W,,..., Wmi, are independent with
e ( wi ) n
e I p
P i, P2
Setting Ui = Wi- >, i = 2,..., m, we have
(U i= 1, )
where U1,..., U. are independent and
(Ui) = (f-2+ P22 ) 2
The above proof shows that Ui's are given by
Is + _i ~ U + i= 1,..,m
and that these random variables are independent. Since C(S + Ei-'X)X) = W(I, p, n + i - 1), Proposition 8.14 yields
t(Ue)=@( 2 p2.
In the special case that A has rank one, the distribution of U can be derived
by an argument similar to that in the proof of Proposition 8.15.
Proposition 8.16. Suppose X and S are independent where e(S)= W(I, p, n) and P,(X) = N(JL, I,, ? Ip). Assume that la = q' with ( E Rm,
= 1, andX E= RP. Then
e()= (U (P Ui)
where U,,..., Un are independent,
E(Ui)= ( 2Pn ,+ ), i=1,...,m-1
and
e(Um )-93 +m p( ' 2 )
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
324 THE WISHART DISTRIBUTION
Proof Let Em be the mth standard unit in RK. Then F' = Eim for some
O m em as ismil. Since
Is + X-Xl Is + xrsrxi
and P(rX) = N(Cm7l', Im X Ip), we can take ( = Em without loss of general
ity. As in the proof of Proposition 8.15, X'X = EX XiX' where X1,..., Xm are independent. Obviously, C (Xi) = N(O, Ip), i = 1,..., m - 1, and
E?4Xm) = N(rq, Ip). Now, write U = HmUi where
IS + _=jI L1 = , =1 M. is + x_ j1
The argument given in the proof of Proposition 8.15 shows that
U S} is+ x1x;;I
and (S + X X., X2,..., Xm) are independent. The assumption that X1 has
mean zero is essential here in order to verify the sufficiency condition
necessary to apply Proposition 7.19. Since U2,..., Um are functions of
(S + Xi, X, AT2,..., Xm), U1 is independent of (U2,..., Um). Now, we sim
ply repeat this argument m - I times to conclude that U,,..., Um are
independent, keeping in mind that XI,. . ., Xmr all have mean zero, but Xm
need not have mean zero. As noted earlier,
J 6(i)3 2 192) il...,Im-1
By Proposition 8.14,
t (t = (IS + Emx1 nt )p = m3
p ;t 1
Now, we return to the case when u = 0. In terms of the notation
(U)-= U(n, m, p), Proposition 8.14 asserts that
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.17 325
Further, Proposition 8.15 can be written
m U(n, m, p) HU(n + i - i,l, p)
i= I
where this equation means that the distribution U(n, m, p) can be repre sented as the distribution of the product of m independent random variables
with distribution U(n + i - 1, 1, p) for i = 1,..., m. An alternative repre sentation of U(n, m, p) in terms of p independent random variables when
m > p follows. If m > p and
Is + X'XI
with l (S) = W(I, p, n) and f (X) = N(O, Im , Ip), the matrix T = X'X
has a nonsingular Wishart distribution, E(T) = W(I, p, m). The following technical result provides the basic step for decomposing U(n, m, p) into a
product of p independent factors.
Proposition 8.17. Partition S into Sij where Sij is pi x pj, i, j = 1, 2, and
PI + P2 = p. Partition T similarly and let
Z = S'2Sj,'S12 + Tl2TIT1T2 - (S12 + T12)'(S11 + THY) (S12 + T12).
Then the five random vectors S1, T11, 22 1 22., and Z are mutually independent. Further,
f5(Z) = W(I, P2' PI).
Proof. Since S and T are independent by assumption, (S 1 S12, S22 .1) and (T1, T12, T22.1) are independent. Also, Proposition 8.8 shows that (S, S12) and S22., are independent with
- (S22 1) = W(I, P2, n - p )
( SUISI I) = N(O, S,l IP2)'
and
C(S,) = W(I, p , n).
Similar remarks hold for (T,1, T12) and T22.1 with n replaced by m. Thus the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
326 THE WISHART DISTRIBUTION
four random vectors (SH, S12), S22.1, (T11 T,2), and T22.1 are mutually independent. Since Z is a function of (S, , S12) and (T11, T12), the proposi tion follows if we show that Z is independent of the vector (S11, T 1).
Conditional on (S, 1, T 11),
S
(T2 ) (SI,, T l )) = N 0, S
ol
0
P
Let A(B) be the positive definite square root of S1 I(TI I ). With V = A - IS12
and W = B-'T12,
v( [ (SI s1, Ti )) = N(O, I2p, ? Ip) Also,
- 12 11 12 2112 1 1J 11 I SJv12 + 12J Z =S 'S1S2 + T12T111T`12 -
(SI2 + T12)'S11 + T1l)(S2+r)
= [ W][ W] - [ ] [A ](A2 + B2)'[A][ v] = [ I] V[ v]
where
Q = I2p [j(A + B [
However, Q is easily shown to be an orthogonal projection of rank Pi By
Proposition 8.5,
f(ZI(S11, T11)) = W(I, P2, PI)
for each value of (SI , T11). Therefore, Z is independent of (S1, T1 ) and
the proof is complete. o
Proposition 8.18. If m > p, then.
p
U(n, m, p) =
r U(n -p + i,m,l).
Proof. By definition,
U(n, m, p) = ( TIS )
where S and Tare independent, fI(T) = W(I, p, m) and a(S) = W(I, p, n)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 8.18 327
with n > p. In the notation of Proposition 8.17, partition S and T with
pi = I andp2 =p
- 1. Then S,, Tll, S22.1, T22. and
Z = S S12 + T12T'T12 - (S12 + T12)'(S11 + Tl)'(SI2 + T12)
are mutually independent. However,
(SI = ISIIIIS22 1
and
IS + T( = IS,, + T11l(S + T)22.11 = jSII + T111 S22.1 + T22*1 + Z(.
Thus
151 lIl, I S22 , I1 x IS + Tl ISI1 + Till IS22.1 + T22*1 + Z(
and the two factors on the right side of this equality are independent by Proposition 8.17. Obviously,
= U(n, m, 1).
Since (T22.)= W(I,p- 1,m- 1), e(Z)
= W(I, p - 1, 1), and T22.
and Z are independent, it follows that
f-(T22.1 + Z) = W(I, p - 1, m).
Therefore,
( S+22 + - U(n - 1, m, p -1),
which implies the relation
U(n, m, p) = U(n, m, 1)U(n - 1, m, p - 1).
Now, an easy induction argument establishes
p U(n, m, p) = U(n - i + 1, m, 1),
i=l
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
328 THE WISHART DISTRIBUTION
which implies that
p
U(n, m, p) = H U(n - p + i, m, 1)
and this completes the proof. o
Combining Propositions 8.15 and 8.18 leads to the following.
Proposition 8.19. If m > p, then
U(n, m, p) = U(n - p + m, p, m).
Proof. For arbitrary m, Proposition 8.15 yields
U(n, m, P)= = Ij ( + p)
where this notation means that the distribution U(n, m, p) can be repre sented as the product of m independent beta-random variables with the
factors in the product having a 'i3((n - p + i)/2, p72) distribution. Since
U(n -p + i, i,) ( n + m
Proposition 8.18 implies that
p p n-p m
U(n, m, p) = H U(n - p + i,in,l) = 2 ( - ?
Applying Proposition 8.15 to U(n - p + m, p, m) yields
U(n p +in,pin)= H (- P+imn-im+Q ) ,-1 (2 2)
i= t g 2 '2 )
which is the distribution U(n, m, p). o
In practice, the relationship U(n, m, p) = U(n - p + m, p, m) shows
that it is sufficient to deal with the case that m < p when tabulating the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 329
distribution U(n, m, p). Rather accurate approximations to the percentage points of the distribution U(n, m, p) are available and these are discussed in detail in Anderson (1958, Chapter 8). This topic is not pursued further here.
PROBLEMS
1. Suppose S is W(I, 2, n), n >? 2, 7. > 0. Show that the density of
r = s12/ s11S22 can be written as
p(rlp) - 2( 2)2%(2, n)(l - p2)n/2(1 - r2)(f l)/24,(pr)
where p = a72/ 1qa22 and 4 is defined as follows. Let XI and X2 be
independent chi-square random variables each with n degrees of freedom. Then { (t) = &exp[t(XIX2)'/21 for Iti < 1. Using this repre sentation, prove that p(rIp) has a monotone likelihood ratio.
2. The gamma distribution with parameters a > 0 and X > 0, denoted by
G(a, A), has the density
p(xla,A) = -aF() exp[ A ] x > 0
with respect to Lebesgue measure on (0, oo). (i) Show the characteristic function of this distribution is (1 - iAtf (ii) Show that a G(n/2, 2) distribution is that of a x2 distribution.
3. The above problem suggests that it is natural to view the gamma family as an extension of the chi-squared family by allowing nonin tegral degrees of freedom. Since the W(E, p, n) distribution is a generalization of the chi-squared distribution, it is reasonable to ask if we can define a Wishart distribution for nonintegral degrees of free dom. One way to pose this question is to ask for what values of a is
OJ(A) = lIp - 2iAl a, A E SP, a characteristic function. (We have taken
2 = Ip for convenience).
(i) Using Proposition 8.3 and Problem 7.1, show that 4a is a characteristic function for a = 1/2,...,(p - 1)/2 and all real a > (p - 1)/2. Give the density that corresponds to 0P for
a > (p - 1)/2. W(Ip, p, 2a) denotes such a distribution.
(ii) For any I > 0 and the values of a given in (i), show that
4ja(YA), A E S- , is a characteristic function.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
330 THE WISHART DISTRIBUTION
4. Let S be a random element of the inner product space (5,, K *)) where ( ,) is the usual trace inner product on 5,. Say that S has an
Op-invariant distribution if C(S) = W4FSF') for each F E ?,p. Assume S has an Op-invariant distribution.
(i) Assuming SS exists, show that SS = cIp where c = SsI, and sij
is the i, j element of S.
(ii) Let D e Sp be diagonal with diagonal elements d,,..., dp. Show
that var((D, S)) = (y - 3)Edd2 + f3(EPd )2 where y = var(s11) and /3 - cov{s11, S22).
(iii) For A e Sp, show that var((A, S)) = (-y - /3)(A, A) +
fB(Ip, A)2. From this conclude that Cov(S) = (y - I3)Ip 9 Ip +
1PIPE, IP,,. 5. Suppose S e Sp has a density f with respect to Lebesgue measure dS
restricted to Sp+. For each n > p, show there exists a random matrix
X E fSp n that has a density with respect to Lebesgue measure on Ep n and l(X'X) = C(S).
6. Show that Proposition 8.4 holds for all n,, n2 equal to 1, 2,.. ., p - 1
or any real number greater than p - 1.
7. (The inverse Wishart distribution.) Say that a positive definite S C S' p has an inverse Wishart distribution with parameters A, p, and v if
t(S-')= W(A-,p,v+p- 1). Here AE S. and v is a positive integer. The notation C(S) = IW(A, p, v) signifies that C (S') =
W(A- 1,p, v + p- 1).
(i) If f&(S) = IW(A, p, v) and A is r x p of rank r, show that
C(ASA') = IW(AA A', r, v). (ii) If C(S) =IW(Ip,p, p) and FE-e p, show that C(FSF') = E(S).
(iii) If C(S)= IW(A, p, v), show that 6(S) = (v - 2)-'A. Show
that Cov(S) has the form c1A ? A + c2A C A-what are c1 and
C2?
(iv) Now, partition S into 51: q x q, S12: q X r, and S22: r x r
with S as in (iii). Show that C(S,1) = IW(A11, q, v). Also show that C(S22.1) = IW(A22.1, r, i + q).
8. (The matric t distribution.) Suppose X is N(O, Ir ? Ip) and S is
W(Ip, p, m), m >? p. Let S- 1/2 denote the inverse of the positive
definite square root of S. When S and X are independent, the matrix
T = XS- 1/2 iS said to have a matric t distribution and is denoted by
C(T) = T(m - p + 1, Ir, Ip).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 331
(i) Show that the density of T with respect to Lebesgue measure on
lp,r is given by
(T) o(m, P) ( 4r)rp w(m + r, p) lIp + T'TI(m+r)/2
Also, show that C(T) = C(rIT') for r E Or and A E 9.. Using this, show ST = 0 and Cov(T) = chr 0 Ip when these exist.
Here, c1 is a constant equal to the variance of any element of T.
(ii) Suppose V is IW(Ip, p, v) and that T given V is N(O, Ir ? V). Show that the unconditional distribution of T is T(v, Ir, I).
(iii) Using Problem 7 and (ii), show that if T is T(v, Ir, Ip), and T, I is
the k x q upper left-hand corner of T, then T,, is T(v, Ik Iq).
9. (Multivariate F distribution.) Suppose S1 is W(Ip, p, m) (for m = 1, 2,... ) and is independent of S2, which is W(Ip, p, ' + p - 1) (for
p = 1,2,... ). The matrix F = S7 l/2S S7- 1/2 has a matric F distribu
tion that is denoted by F(m, ', Ip). (i) If S is IW(Ip, p, v) and V given S is W(S, p, m), show that the
unconditional distribution of V is F(m, j, Ip). (ii) Suppose T is T(v, Ir Ip). Show that T'T is F(r, v, Ip). (iii) When r > p, show that the F(r, v, I.) distribution has a density
with respect to dF/IFI(P+ 1)/2 given by
p(F) =
'o(r, p)w( + p - 1, p) IFr/2
co(r + p + p -
1, p) II + Fl(,+p+r-1)/2
(iv) For r > p, show that, if F is F(r, v, Ip), then F-1 is F(v + p -
1, r - p + 1, Ip). (v) If F is F(r, v, Ip) and F, I is the q x q upper left block of F, use
(ii) to show that F11 is F(r, v, Iq). (vi) Suppose X is N(0, Ir ? Ip) with r < p and S is W(Ip, p, m) with
m >? p, X and S independent. Show that XS5-'X' is F(p, m - p
+ 1, Ir).
10. (Multivariate beta distribution.) Let S, and S2 be independent and suppose l (Si) = W(Ip, p, min), i = 1, 2, with mIn + M2 > p. The ran
dom matrix B = (SI + S2)-1/2S1(SI + S2)f-1/2 has a p-dimensional multivariate beta distribution with parameters m, and M2. This is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
332 THE WISHART DISTRIBUTION
written C (B) = B(m1, IM2, Ip) (when p = 1, this is the univariate beta distribution with parameters ml/2 and m2/2).
(i) If B is B(mi, 2, I2p) show that E(fBr') = e(B) for all r E op. Use Example 7.16 to conclude that e(B) = e(ID'I) where I Ee p is uniform and is independent of the diagonal matrix D
with elements XA > - > Xp > 0. The distribution of D is de
termined by specifying the distribution of .1I..., AXp and this is the distribution of the ordered roots of (S1 + S2)-'/2S1(S1 +
S2 -1/2
(ii) With Sl and S2 as in the definition of B, show that S112(S1 +
S2)S- IS1/2 is B(m1, m2, Ip).
(iii) Suppose F is F(m, v, Ip). Use (i) and (ii) to show that (I + F)' is B(p + v - 1, m, Ip) and F(I + F)- is B(m, p + v - 1, Ip).
(iv) Suppose that X is N(O, Ir X Ip) and that it is independent of S, which is W(Ip, p, m). When r < p and m >? p, show that X(S +
X'X)-'X' is B(p, r + m - p, Ir)
(v) If B is B(mi, M2, I.) and m1 > p, show that det(B) is distrib
uted as U(m1, m2, p) in the notation of Section 7.4.
NOTES AND REFERENCES
1. The Wishart distribution was first derived in Wishart (1928).
2. For some alternative discussions of the Wishart distribution, see Ander son (1958), Dempster (1969), Rao (1973), and Muirhead (1982).
3. The density function of the noncentral Wishart distribution in the
general case is obtained by "evaluating"
(8.1) ( exp[trrx2-y]/*o(?r) \
(see the proof of Proposition 8.12). The problem of evaluating
4,(A)=[ av[txTA]p0(dT)
for A g tn n has received much attention since the paper of James
(1954). Anderson (1946) first gave the noncentral Wishart density when
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 333
[x has rank 1 or rank 2. Much of the theory surrounding the evaluation
of \p and series expansions for ^ can be found in Muirhead (1982).
4. Wilks (1932) first proved Proposition 8.15 by calculating all the mo
ments of U and showing these matched the moments of nu?. Anderson
(1958) also uses the moment method to find the distribution of U. This
method was used by Box (1949) to provide asymptotic expansions for
the distribution of U (see Anderson, 1958, Chapter 8).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 9
Inference for Means
in Multivariate
Linear Models
Essentially, this chapter consists of a number of examples of estimation and testing problems for means when an observation vector has a normal distribution. Invariance is used throughout to describe the structure of the models considered and to suggest possible testing procedures. Because of space limitations, maximum likelihood estimators are the only type of estimators discussed. Further, likelihood ratio tests are calculated for most of the examples considered.
Before turning to the concrete examples, it is useful to have a general model within which we can view the results of this chapter. Consider an n-dimensional inner product space (V, (-, )) and suppose that X is a
random vector in V. To describe the type of parametric models we consider for X, let f be a decreasing function on [0, oo) to [0, oo) such that f [(x, x)] is
a density with respect to Lebesgue measure on (V, (*, *)). For convenience, it is assumed that f has been normalized so that, if Z E V has density f,
then Cov(Z) = I. Obviously, such a Z has mean zero. Now, let M be a
subspace of V and let y be a set of positive definite linear transformations on V to V such that I E y. The pair (M, y) serves as the parameter space
for a model for X. For ,u Ee M and L E y,
(xIAL Y) 121-n/2f [(X - A, 7 I(x -
is a density on V. The family
( P ( l, *ZE Mg Y- E= ey}
334
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS 335
determines a parametric model for X. It is clear that if p(-jL,, Y.) is the
density of X, then &X = ,u and Cov(X) = E. In particular, when
(u=(S)_ n
exp I u], u > O ,
then X has a normal distribution with mean L e M and covariance L E y.
The parametric model for X is in fact a linear model for X with parameter
set (M, y). Now, assume that E(M) = M for all Y e y. Since I e -y, the
least-squares and Gauss-Markov estimator of ,u are equal to PX where P is
the orthogonal projection onto M. Further, ai PX is also the maximum likelihood estimator of ,t. To see this, first note that PY. = :P for : e y
since M is invariant under E e Ey. With Q = I - P, we have
(x - ', E-'(x - ,u)) = (P(x - ,u) + Qx, I-?(P(x - ,u) + Qx))
= (Px -
,u, 2:-'(Px -
,)) + (Qx, 2IQx).
The last equality is a consequence of
(Qx, E:-7'P(x - ,)) = (x, QYJ- IP(x - ,u)) = (x, QP7 '(x - ,u)) = 0
as QP = 0 and E- 'P P P-'. Therefore, for each l E y,
(x - A, E- (X - n)) > (Qx, Y.-'Qx)
with equality iff t = Px. Since the function f was assumed to be decreasing, it follows that a = PX is the maximum likelihood estimator of ,u, and ,u is unique if f is strictly decreasing. Thus under the assumptions made so far, A = PX is the maximum likelihood estimator for it. These assumptions hold for most of the examples considered in this chapter. To find the maximum likelihood estimator of E, it is necessary to compute
sup 11:|nJf [(Qx, y- IQx)]
and find the point E E y where the supremum is achieved, assuming it exists. The solution to this problem depends crucially on the set y and this is what generates the infinite variety of possible models, even with the assump tion that E2M = M for E E y. The examples of this chapter are generated by
simply choosing some y's for which I can be calculated explicitly. We end this rather lengthy introduction with a few general comments
about testing problems. In the notation of the previous paragraph, consider a parameter set (M, y) with I E y and assume EM = M for E e y. Also, let
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
336 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
Mo c M be a subspace of V and assume that Y2Mo = M. for : E y.
Consider the problem of testing the null hypothesis that y E Mo versus the alternative that ,u E (M - Mo). Under the null hypothesis, the maximum likelihood estimator for yt is yI0 = POX where P0 is the orthogonal projection onto Mo. Thus the likelihood ratio test rejects the null hypothesis for small values of
sup t2:| /f [(Qox, 2 IQx)]
sup ly.1-n12f [(Qx, IQx)]
where Q0 = I - P0. Again, the set -y is the major determinant with regard to
the distribution, invariance, and other properties of A(x). The examples in this chapter illustrate some of the properties of y that lead to tractable
solutions to the estimation problem for L and the testing problem described
above.
9.1. THE MANOVA MODEL
The multivariate general linear model introduced in Example 4.4, also
known as the multivariate analysis of variance model (the MANOVA
model), is the subject of this section. The vector space under consideration
is l&p n with the usual inner product K ,*) and the subspace M of E? n is
M = (Xlx = Zf3, r E p, k}
where Z is a fixed n x k matrix of rank k. Consider an observation vector
X E lP- n and assume that
E(X) = N(fA, In C Y.)
where IL E M and . is an unknown p x p positive definite matrix. Thus the
set of covariances for X is
Y (In C 21| E Sp )
and (M, y) is the parameter set of the linear model for X. It was verified in
Example 4.4 that M is invariant under each element of -y. Also, the
orthogonal projection onto M is P = P, 0 Ip where
Pz = Z(Z'Z)'z'.
Further, Q = -P = Qz 0 Ip is the orthogonal projection onto M' where
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
THE MANOVA MODEL 337
Q, = In-P . Thus
a=PX= (PZ X IP)X= PzX
is the maximum likelihood estimator of ,u E M and, from Example 7.10,
2= -X'QzX n
is the maximum likelihood estimator of l when n - k > p, which we
assume throughout this discussion. Thus for the MANOVA model, the maximum likelihood estimators have been derived. The reader should check
that the MANOVA model is a special case of the linear model described at
the beginning of this chapter. We now turn to a discussion of the classical MANOVA testing problem.
Let K be a fixed r x k matrix of rank r and consider the problem of testing
Ho: K/ = 0 versus H,: Kp 1 0
where ,A = Zf, is the mean of X. It is not obvious that this testing problem is
of the general type described in the introduction to this chapter. However, before proceeding further, it is convenient to transform this problem into what is called the canonical form of the MANOVA testing problem. The essence of the argument below is that it suffices to take
Z = ZO- ( k) K= Ko-(r? K 0=K0(hr0)
in the above problem. In other words, a transformation of the original problem results in a problem where Z = Z0 and K = Ko. We now proceed with the details. The parametric model for X e Ep , is
e, (X) = N(Z3, In ? 7.)
and the statistical problem is to test Ho: K,B = 0 versus H1: K/ * 0. Since Z has rank k, Z = tU for some linear isometry I: n X k and some k X k matrix U E Gt. The k columns of I are the first k columns of some F E Q so
'I = -(k =FZ0 S0,
Setting X = r"X, 1~=Up, and k = KU-', we have
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
338 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
and the testing problem is Ho: K/B = 0 versus H1: K/I * 0. Applying the
same argument to K' as we did to Z,
for some A E ?k and some r x r matrix U1 in Gt. Let
0=( O
In -k n)
and set Y = rT'X, B = A'/l. Since
rIZo ( n-k)() () ZB,
it follows that
E(Y) = N(ZOB, InO ? )
and the testing problem is Ho: KoB = 0 versus HI: KoB * 0. Thus after
two transformations, the original problem has been transformed into a
problem with Z = ZO and K = Ko. Since Ko = (Ir 0), the null hypothesis is
that the first r rows of B are zero. Partition B into
B = ( ); B1:rXp, B2:(k-r)Xp B22
and partition Y into
Y,
Y= Y2 ; Y1:rXp, Y2:(k-r)Xp, Y3:(n-k)xp.
Y3
Because Cov(Y) = I,, ? 1, Y,, Y2, and Y3 are mutually independent and it
is clear that
E(Y,) = N(B1, Ir ? 1)
(Y2) = N(B2, I(k-r) ? )
E(Y3 =
N(O, I(n-k) ?
2)*
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
THE MANOVA MODEL 339
Also, the testing problem is Ho: B1 = 0 versus H1: B1 * 0. It is this form of
the problem that is called the canonical MANOVA testing problem. The only reason for transforming from the original problem to the canonical problem is that certain expressions become simpler and the invariance of the MANOVA testing problem is more easily articulated when the problem is expressed in canonical form.
We now proceed to analyze the canonical MANOVA testing problem. To simplify some later formulas, the notation is changed a bit. Let Y1, Y2, and
Y3 be independent random matrices that satisfy
C(Y,) = N(B,, Ir X 2)
(Y2) = N(B2, Is X Y)
E(Y3) = N(O, Im ? Y)
so B1 is r x p and B2 is s x p. As usual 2 is a p x p unknown positive
definite matrix. To insure the existence of a maximum likelihood estimator for Y., it is assumed that m > p and the sample space for Y3 is taken to be
the set of all m x p real matrices of rank p. A set of Lebesgue measure zero
has been deleted from the natural sample space l m of Y3. The testing problem is
Ho: B1 = 0 versus H1: B1 0.
Setting n = r + s + m and
Y Y2 n
Y3
15(Y) = N(,u, In ? E) where , is an element of the subspace
M =(a, =(B2 | B I E= P-p, r B2 E= Ep, s) 0~~~
In this notation, the null hypothesis is that It E MO c M where
Mo= {t1 = ( 2);B2CE s}
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
340 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
Since M and MO are both invariant under I,, X 2 for all 2 > 0, the testing
problem under consideration is of the type described in general terms earlier, and
-y = (In
0 2j2 > 0).
When the model for Y is (M, y), the density of Y is
p(YIB, B2, 2) = (V2.)-l2-n/2
x exp[-2 tr(Y -B1)2'(Y1 - B1)'
-2tr(Y2 - B2) 2(Y2 - B2)' - 2 trY2-'Y].
In this case, the maximum likelihood estimators of B1, B2, and 2 are easily seen to be
A ^ ̂A A
B1 = Y1, B2 =
Y2, 1Y33
When the model for Y is (MO, -y), the density of Y is p(Yp, B2, 2) and the
maximum likelihood estimators of B2 and 2 are
B2 = Y2, 3=-(Y3 + YY1). n
Therefore, the likelihood ratio test rejects for small values of
(Y) = p(Y10, B2, 2) 3j-n/2 y3In/2
p(YAI1, B2, :) IY3Y3 + ylYiIn/2*
Summarizing this, we have the following result.
Proposition 9.1. For the canonical MANOVA testing problem, the likeli hood ratio test rejects the null hypothesis for small values of the statistic
U= Y3'Y3I IY3'Y3 + Y'Y,l
Under Ho, E (U) = U(m, r, p) where the distribution U(m, r, p) is given in
Proposition 8.15.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.1 341
Proof The first assertion is clear. Under Ho, E(Yi) = N(O, Ir ? 2) and
E(Y3) = N(O, Im X .). Therefore, E(Y,Y,) = W(2, p, r) and E(Y3Y3) =
W(Y2, p, m). Since m >? p, Proposition 8.18 implies the result. O
Before attempting to interpret the likelihood ratio test, it is useful to see first what implications can be obtained from invariance considerations in the canonical MANOVA problem. In the notation of the previous para graph, (M, -y) is the parameter set for the model for Y and under the null
hypothesis, (MO, y) is the parameter set for Y. In order that the testing problem be invariant under a group of transformations, both of the parame ter sets (M, y) and (MO, y) must be invariant. With this in mind, consider
the group G defined by
G = {glg = (rl , r2, F3, (, A); ri E er r2 E es,
r3Enm,IEfps9 A EGIP)
where the group action on the sample space is given by
Y, r, Y, A'
r,2, 13 , (= A) Y2 r2Y2A( + .
Y3 r3Y3A'
The group composition, defined so that the above action is a left action on the sample space, is
(rF, F2, 1'3, (, A) (A1, A2, A3,X,C) = (rFA, r2A2, r3A3, r2-qA'+t, AC).
Further, the induced group action on the parameter set (M, y) is
(IF, F2, F3, (, A)(B1, B2, 2-) = (FlB1A', r2B2A' + (, AY.A'),
where the point
Bl
B2 E M, (In ( Y) G Y
has been represented simply by (B,, B2, l:). Now it is routine to check that when Y has a normal distribution with & Y E M(6 Y E MO) and Cov(Y) E
y, then &gY E M(&gY E MO) and Cov(gY) E y, for g e G. Thus the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
342 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
hypothesis testing problem is G-invariant and the likelihood ratio test is a G-invanant function of Y. To describe the invariant tests, a maximal invariant under the action of G on the sample space needs to be computed.
The following result provides one form of a maximal invariant.
Proposition 9.2. Let t = min(r, p), and define h(Y,, Y2, Y3) to be the
t-dimensional vector (XA,..., X,)' where Xl > .-- > XA are the t largest
eigenvalues of YY, (Y3Y3)- '. Then h is a maximal invariant under the action
of G on the sample space of Y.
Proof. Note that YY1 (Y3Y3)-1 has at most t nonzero eigenvalues, and these t eigenvalues are nonnegative. First, consider the case when r < p so
t = r. By Proposition 1.39, the nonzero eigenvalues of YY1Y(Y3Y3) - are the
same as the nonzero eigenvalues of Y1 ( YY3 ) - 'Y;, and these eigenvalues are obviously invariant under the action of g on Y. To show that h is maximal
invariant for this case, a reduction argument similar to that in Example 7.4
is used. Given
Y = Y2'
we claim that there exists a go E G such that
(DO) r
_ 0
where D is r x r and diagonal and has diagonal elements VK i, .., Ar. For
g = (I',, F2, F3, , A),
(F1Y1A'
gY= r2Y2A' +
F3Y3A'
By Proposition 5.2, Y3 = I3U3 where % E ?;p m and U3 E Gu is p x p.
Choose A' = U3- 'A where A E Ep is, as yet, unspecified. Then
r,Y,A' = r Y U- A
and, by the singular value decomposition theorem for matrices, there exists
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.2 343
a e E ?r and a A Ee-p such that
F,Y,U;-'A = (DO)
where D is an r x r diagonal matrix whose diagonal elements are the square
roots of the eigenvalues of Y,(U3U3)- 'Y = Y,(Y3Y3)- 'Y. With this choice for A E Cr it is clear that Y3A' = Y3U3-'A e ECp m so there exists a r3 E (i
such that
[,3U-,'A= (IP
Choosing r2 = I = - Y2A', and setting
90 =
(ri, is, r3, - Y2u3- 1A, (u;- 1\),),
g0Y has the representation claimed. To show h is maximal invariant,
suppose h(Y,, Y2, Y3) = h(Z,, Z2, Z3). Let D be the r x r diagonal matrix, the squares of whose diagonal elements are the eigenvalues of Y, ( Y3Y3)f 'Y
and Z,(Z3Z3)-Z'Z. Then there exist go and g, E G such that
(DO) 0
goY= =g,Z
so Y = g- lgI Z. Thus Y and Z are in the same orbit and h is a maximal
invariant function. When r > p, basically the same argument establishes that h is a maximal
invariant. To show h is invariant, if g = (r,, r2, L3, (, A), then the matrix
YfY,(Y3Y3)-' gets transformed into A YiY,(Y3Y3)- A- when Y is trans formed to gY. By Proposition 1.39, the eigenvalues of AYIY,(Y3Y3)-'A-' are the same as the eigenvalues of Y,Y, (Y3Y3 )- , so h is invariant. To show h
is maximal invariant, first show that, for each Y, there exists a go E G such
that
( )0 ep r1
g?Y = EU Ep,
we0 D sqar r
where D is the p X p diagonal matrix of square roots of eigenvalues
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
344 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
(Y,Y, )(Y3Y3)- '. The argument for this is similar to that given previously and is left to the reader. Now, by mimicking the proof for the case r < p, it follows that h is maximal invariant. o
Proposition 9.3. The distribution of the maximal invariant h(Yf, Y2, Y3) depends on the parameters (B1, B2, L:) only through the vector of the t largest eigenvalues of B B1 .
Proof Since h is a G-invariant function, the distribution of h depends on (B,, B2, E) only through a maximal invariant parameter under the induced action of G on the parameter space. This action, given earlier, is
(Fg, F2, F3, (, A)(B1, B2, :) = ( F,B1A', I2B2A' + (, AEA').
However, an argument similar to that used to prove Proposition 9.2 shows that the vector of the t largest eigenvalues of BB Y.1 is maximal invariant
in the parameter space. o
An alternative form of the maximal invariant is sometimes useful.
Proposition 9.4. Let t = min{r, p) and define h,(Y,, Y2, Y3) to be the
t-dimensional vector (Os,..., 0,)' where 01 < -.. < 0, are the t smallest
eigenvalues of Y3Y3(Y3Y3 + YjY1)-'. Then 0i = 1/(1 + Xi), i = 1 t, where X 's are defined in Proposition 9.2. Further, h,(Y1, Y2, Y3) is a
maximal invariant.
Proof For A E [0, oo), let 0 = 1/(1 + A). If X satisfies the equation
Y,Y,(Y3Y3)'- XIp= =,
then a bit of algebra shows that 0 satisfies the equation
Y3Y3(Y3Y3 + - oip = 0,
and conversely. Thus 0i = 1/(1 + AX), i = 1,..., t, are the t smallest eigen values of Y3Y3(Y3Y3 + YlY1 1. Since hI(Y,, Y2, Y3) is a one-to-one function
of h (Y,, Y2, Y3), it is clear that h, (Y,, Y2, Y3) is a maximal invariant. El
Since every G-invariant test is a function of a maximal invariant, the
problem of choosing a reasonable invariant test boils down to studying tests
based on a maximal invariant. When t min{p, r} = 1, the following result
shows that there is only one sensible choice for an invariant test.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.5 345
Proposition 9.5. If t = I in the MANOVA problem, then the test that
rejects for large values of X, is uniformly most powerful within the class of G-invariant tests. Further, this test is equivalent to the likelihood ratio test.
Proof. First consider the case when p = 1. Then Y,Y,(Y3Y3)-1 is a non
negative scalar and
Y;Y,
Y3'Y3
Also. E(Y,) = N(B1, a2Ir) and E(Y3) = N(O, G2Im) where E: has been set
equal to a2 to conform to classical notation when p = 1. By Proposition 8.14,
= F(r, m; 6)
where 8 = BBI/o2 and the null hypothesis is that 8 = 0. Since the non
central F distribution has a monotone likelihood ratio, it follows that the
test that rejects for large values of X, is uniformly most powerful for testing
8 = 0 versus 6 > 0. As every invariant test is a function of X , the case for
p = I follows.
Now, suppose r = 1. Then the only nonzero eigenvalue of YlY(Y3Y3)-' is Y1 (Y3Y3) Y1 by Proposition 1.39. Thus
xi =
YI ( Y3'Y3 )y,'
and, by Proposition 8.14,
E(XI) =F(p,m - p +1;8)
where 8 = BE- 'B' >, 0. The problem is to test 8 = 0 versus 8 > 0. Again,
the noncentral F distribution has a monotone likelihood ratio and the test that rejects for large values of A1 is uniformly most powerful among tests
based on A1.
The likelihood ratio test rejects Ho for small values of
A- _Y3Y3_ I I
jY3;Y3 + Y1 -
'Ip + ylyI ( Y3Y3
If p = 1, then A = (I + X)-' and rejecting for small values of A is
equivalent to rejecting for large values of A1. When r = 1, then
lip + YfY1(Y3Y3) 1 = I + Y1(Y3'Y3) Y' - 1 + As
soagainA = (I + XI)1. a
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
346 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
When t > 1, the situation is not so simple. In terms of the eigenvalues
A1X,.. AX, the likelihood ratio criterion rejects Ho for small values of
A_== 1 1 H
iYY3 + YY1 II, + YlY(Y"Yo)l i 1 + A>
However, there are no compelling reasons to believe that other tests based on A1,..., A, would not be reasonable. Before discussing possible alterna tives to the likelihood ratio test, it is helpful to write the maximal invariant statistic in terms of the original variables that led to the canonical MANOVA problem. In the original MANOVA problem, we had an observation vector X E ,, such that
e(X) =N(Zpg In ) z )
and the problem was to test K,B = 0. We know that
A= (z'z)-'zx
and
1 1 E=-X'Q,X--S n n2
are the maximum likelihood estimators of ,B and L:.
Proposition 9.6. Let t = min(p, r). Suppose the original MANOVA prob lem is reduced to a canonical MANOVA problem. Then a maximal in
variant in the canonical problem expressed in terms of the original variables is the vector (A1, ..., A,)', AX >_ ** > At, of the t largest eigenvalues of
V-[(Kp) (K(Z'Z) -K') -K,]-#
Proof. The transformations that reduced the original problem to canonical form led to the three matrices Y1, Y2, and Y3 where Y1 is r x p, Y2 is
(k - r) X p, and Y3 is (n - k) X p. Expressing Y, and Y3 in terms of X, Z,
and K, it is not too difficult (but certainly tedious) to show that
YY1 ( Y3Y3 ) = V.
By Proposition 9.2, the vector (A1,..., A,)' of the t largest eigenvalues of
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.6 347
YjY1 ( Y3Y3) -' is a maximal invariant. Thus the vector of the t largest
eigenvalues of V is maximal invariant. U
In terms of X, Z, and K, the likelihood ratio test rejects the null
hypothesis if
A= lsI 1i3'K'(K(Z'Z)-K') 'K/ + SI
is too small. Also, the distribution of A under Ho is given in Proposition 9.1
as U(n - k, r, p). The distribution of A when K13 * 0 is quite complicated when t > 1 except in the case when ,B has rank one. In this case, the
distribution of A is given in Proposition 8.16. We now turn to the question of possible alternatives to the likelihood
ratio test. For notational convenience, the canonical form of the MANOVA problem is treated. However, the reader can express statistics in terms of the original variables by applying Proposition 9.6. Since our interest is in invariant tests, consider Y, and Y3, which are independent, and satisfy
E(YI) = N(B1, In C Y)
e (Y3) = N(O, Im X ).
The random vector Y2 need not be considered as invariant tests do not
involve Y2. Intuitively, the null hypothesis Ho: B1 = 0 should be rejected, on the basis of an invariant test, if the nonzero eigenvalues X, > * * * > XA of Y,Y1(Y3Y3)' are too large in some sense. Since E(Y1) = N(B1, Ir ? 2),
=YBYl=BBI + r.
Also, it is not difficult to verify that (see the problems at the end of this chapter)
Y3'Y3 m-p-1
when m - p - 1 > 0. Since Y, and Y3 are independent,
'3-' =
'r1+ I
BI-. F9 Yl'Yl ( Y3 =m-p
- P m - I
I i
Therefore, the further B, is away from zero, the larger we expect the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
348 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
eigenvalues of B,B1 I- to be, and hence the larger we expect the eigen values of Y,YI(Y3Y3)- 1to be. In particular,
tr Yl Y1 (Y'Y3 )'= m-p-1+m - I + t B B I I I
and tr B'B l is just the sum of the eigenvalues of B'B I-'.
The test that rejects for large values of the statistic
Xi= trY,Y1(Y3Y3)
is called the Lawley-Hotelling trace test and is one possible alternative to the likelihood ratio test. Also, the test that rejects for large values of
= tr Y,Y1 ( Y3Y3 + Y9y)
was introduced by Pillai as a competitor to the likelihood ratio test. A third competitor is based on the following considerations. The null hypothesis
Ho: B1 = 0 is equivalent to the intersection over u E Rr, liull = 1, of the
null hypotheses Hu: u'B, = 0. Combining Propositions 9.5 and 9.6, it
follows that the test that accepts HU iff
U'yl ( Y3Y3 ) Y1u < C
is a uniformly most powerful test within the class of tests that are invariant
under the group of transformations preserving Hu. Here, c is a constant.
Under H.,
e (u'Y(3'Y3) 'Y; U) = Fp,m-p+I
so it seems reasonable to require that c not depend on u. Since Ho is
equivalent to nl(HuIIIuII = 1, u E Rr), Ho should be accepted iff all the HU are accepted-that is, Ho should be accepted iff
SUp u'Yl(Y3Y3)Ylu s< C. IluII= I
However, this supremum is just the largest eigenvalue of Y1 ( Y3Y3 ) -Y, which is XI. Thus the proposed test is to accept Ho iff A1 < c or equivalently,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.6 349
to reject Ho for large values of X,. This test is called Roy's maximum root
test. Unfortunately, there is very little known about the comparative behavior
of the tests described above. A few numerical studies have been done for small values of r, m, and p but no single test stands out as dominating the
others over a substantial portion of the set of alternatives. Since very
accurate approximations are available for the null distribution of the likelihood ratio test, this test is easier to apply than the above competitors. Further, there is an interesting decomposition of the test statistic
A- IY3 31 Y3W3 + YIY1j
which has some applications in practice. Let S = Y3Y3 so ft(S)= W(12, p, m) and let Xl,..., Xr' denote the rows of Y1. Under Ho: B1 =0,
Xi,..., Xr are independent and f (Xi) = N(0, E). Further,
lsi ~~~r
A A~~~~~i= S + EI ,XiXi| =I
where
Is' A1 =
IS + Xi
and
? + Xi Xi,2
Proposition 8.15 gives the distribution of Ai under Ho and shows that Al,..., A, are independent under Ho. Let Pr',..., /3 denote the rows of B, and consider the r testing problems given by the null hypotheses
Hi: {(1 * r)#1 -2- ' 0 )
and the alternatives
Hi: r ( Obviousrl lH0 = 2 = ' n h li rt ts f
fori= 19,. . . , r. Obviously, Ho = nrlHi and the likelihood ratio test for
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
350 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
testing Hi against Hi rejects Hi iff Ai is too small. Thus the likelihood ratio test for Ho can be thought of as one possible way of combining the r independent test statistics into an overall test of nr Hi.
9.2. MANOVA PROBLEMS WITH BLOCK DIAGONAL COVARIANCE STRUCTURE
The parameter set of the MANOVA model considered in the previous section consisted of a subspace M = (.LJ,u = ZB,B E Ep k} CEP,
and a set
of covariance matrices
=Y (In X 212 E S p;}
It was assumed that the matrix I was completely unknown. In this section, we consider estimation and testing problems when certain things are known about 2. For example, if L = a p with U2 unknown and positive, then we
have the linear model discussed in Section 3.1. In. this case, the linear model with parameter set {M, y) is just a univariate linear model in the sense that
In 0 1 = a21n 0 Ip and In @ Ip is the identity linear transformation on the
vector space fp& n. This model is just the linear model of Section 9.1 when
p = 1 and np plays the role of n. Of course, when 2 = a2I p, the subspace M
need not have the structure above in order for Proposition 4.6 to hold.
In what follows, we consider another assumption concerning I and treat
certain estimation and testing problems. For the models treated, it is shown that these models are actually "products" of the MANOVA models dis
cussed in Section 9.1. Suppose Y E ep n is a random vector with &Y E M where
M = (It I = ZB, B E ep k}
and Z is a known n x k matrix of rank k. Write p = pI + P2, Pi > 1, for
i = 1, 2. The covariance of Y is assumed to be an element of
(0
In ?
2 0 22
'
)i P
Thus the rows of Y, say Yj,..., 1'Yn, are uncorrelated. Further, if 1' is
partitioned into Xi E RP' and Wi E RP2, Y' = (Xi', Wi'), then Xi and Wi are
also uncorrelated, since
Cov(Yi) = Cov{( )} ( E2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.7 351
Thus the interpretation of the assumed structure of y0 is that the rows of Y are uncorrelated and within each row, the first p, coordinates are uncorre lated with the last P2 coordinates. This suggests that we decompose Y into
x e n and WE n where
Y = (X, W) E
Obviously, the rows of X(W) are X1,..., Xn(WI,..., Wn). Also, partition B E
=p,k into B E E and B2 eP2 k. It is clear that
&X E Ml y = ZB,, B, E eP, k)
and
SW E M2 {(A21P2 = ZB2, B2 E Ep2,k
Further,
CoV(X) E Y1 -y (I, 0 7111z 1l E= 5PI)
and
Cov(W) e Y2 {(In & 1221122 C= P2
Since X and W are uncorrelated, if Y has a normal distribution, then X and
W are independent and normal and we have a MANOVA model of Section
9.1 for both X and W (with parameter sets (Ml, Y1) and (M2, y2)). In
summary, when Y has a normal distribution, Y can be partitioned into X
and W, which are independent. Therefore, the density of Y is
f(Yj1, E) = fl(XjtL1, I1)fi2( W12, 22)
where f, fl, and f2 are normal densities on the appropriate spaces. Since we
have MANOVA models for both X and W, the maximum likelihood estimators of , 2 tL2, Y I and 222 follow from the result of the first section.
For testing the null hypothesis Ho: KB = 0, K: r x k of rank r, a similar
decomposition occurs. As B = (BIB2), Ho: KB = 0 is equivalent to the two
null hypotheses Ho: KB, = 0 and H02: KB2 = 0.
Proposition 9.7. Assume that n - k > max(p1, P2). For testing Ho: KB =
0, the likelihood ratio test rejects for small values of A = A,A2 where
IX'QZXI A X (A) XfQ X + (KB )'( K( Z'Z) - lK') KB,|
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
352 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
and
A2- ~~~IW'QzW1
I W'Q,W + (KB2)'(K(Z'Z) 'K'< B21
Here, Q, = I - P, where P, = Z(Z'Z) Z' and
B, (z'z)'z'x, B2 = (z'z) 'zw.
Proof. We need to calculate
sup f(YlM, E)
t(y)~~~~- =
-
()Ho
sup f(YII, E) (,y, S)EOT
where 611 is the set of (it, Y.) such that ,u e M and I,, ? E e yo. As noted
earlier,
f(YILy, 2) =f MPI Ely, OMW)221y2 -222)
Also, (,u, 7.) e Ho iff (p,,, 1) E Ho and (M2' 222) E H02. Further, (y, Y) E
GT iff (pi, ii) E 6YIj where 6Xi is the set of (,ui, Iii) such that Li E Mi and
In ? ii c GYi for i = 1, 2. From these remarks, it follows that
I(Y) = 'I(X)42(W)
where
sup f1(XIlLi, 11)
( = 1, (11eHo
sup fl(XLY,1 211)
and
sup (W1A2 Y"22)
*2(Wv) (112. 22)E=HO2
sup f2(WjL2, 22) (IL2, Y22) E G2
However, I,(X) is simply the likelihood ratio statistic for testing Ho in the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.8 353
MANOVA model for X. The results of Propositions 9.6 and 9.1 show that
+I, (X) = (Al)n/2. Similarly, I2(W) = (A2)n/2. Thus I(Y)= (AIA2)n/2 so the test that rejects for small values of A = AA2 is equivalent to the
likelihood ratio test. El
Since X and W are independent, the statistics A1 and A2 are indepen
dent. The distribution of A, when Ho is true is U(n - pi, r, pi) for i = 1, 2.
Therefore, when Ho is true, A,A2 is distributed as a product of independent beta random variables and the results in Anderson (1958) provide an approximation to the null distribution of AIA2*
We now turn to a discussion of the invariance aspects of testing Ho: KB
0 on the basis of the observation vector Y. The argument used to reduce the MANOVA model of Section 9.1 to canonical form is valid here, and this leads to a group of transformations GI, which preserve the testing problem
Ho for the MANOVA model for X. Similarly, there is a group G2 that preserves the testing problem H2 for the MANOVA model for W. Since Y = (X, W), we can define the product group GI X G2 acting on Y by
(g1, g2)Y- (g1X, g2W)
and the testing problem Ho is clearly invariant under this group action. A maximal invariant is derived as follows. Let ti = min(r, pi) for i = 1, 2, and in the notation of Proposition 9.7, let
V1 = [(KB')'(K(Z'Z)y'K') KB1](X'QzX)'
and
V2 = [(KB2)'(K(Z'Z)-'K') KB2](W'Qw)
Let m > * > n t, be the tI largest eigenvalues of V1 and 01> * > Ot2 be
the t2 largest eigenvalues of V2.
Proposition 9.8. A maximal invariant under the action of GI x G2 on Y is
the (t1 + t2)-dimensional vector ..... , qt,; 01,. .., t= h(Y) (h1(X); h2(W)). Here, h,(X) = (j,..., l,) and h2(W) = (.1..., t2).
Proof. By Proposition 9.6, h,(X)(h2(W)) is maximal invariant under the action of GI(G2) on X(W). Thus h is G-invariant. If h(Y,) = h(Y2) where
Y, = (X,, Wl) and Y2 = (X2, W2), then h,(X,) = h1(X2) and h2(WI)= h2(W2). Thus there exists gI E G1(g2 E G2) such that g1X,
= X2(g2W1 =
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
354 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
W2). Therefore,
(g1, g2)Y1 = (g1XI, g2W') = (X2, W2) = Y2
so h is maximal invariant. o
As a function of h(Y), the likelihood ratio test rejects Ho if
A = AA2=H12)~( 9 I I + Ii)H + Oi)
is too small. Since t1 + t2 > 1, the maximal invariant h(Y) is always of
dimension greater than one. Thus the situation described in Proposition 9.5 cannot arise in the present context. In no case will there exist a uniformly
most powerful invariant test of Ho: KB = 0 even if K has rank 1. This
completes our discussion of the present linear model. It should be clear by now that the results described above can be easily
extended to the case when E has the form
~222
Ess
where the off-diagonal blocks of E are zero. Here 7. E Sp and :E +p,, p lp,
Es pi = p. In this case, the set of covariances for Y E Cp , is the set y0, which consists of all In X Y where : has the above form and each Yii is
unknown. The mean space for Y is M as before. For this model, Y can be
decomposed into s independent pieces and we have a MANOVA model in
p, for each piece. Also, the matrix B(&Y = ZB) decomposes into B,...,
Bs, B1 E ep k and a null hypothesis Ho: KB = 0 is equivalent to the
intersection of the s null hypotheses Ho: KBi = 0, i = 1,..., s. The likeli
hood ratio test of Ho is now based on a product of s independent statistics,
say A = n,Ai, where E(Ai) = U(n - pi, r, pi) and thus A is distributed as
a product of independent beta random variables when Ho is true. Further, invariance considerations lead to an s-fold product group that preserves the
testing problem and a maximal invariant is of dimension t, + *.. + ts where ti = min(r, pi), i = l,..., s. The details of all this, which are mainly
notational, are left to the reader.
In this section, it has been shown that the linear model with a block
diagonal covariance matrix can be decomposed into independent compo
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.9 355
nent models, each of which is a MANOVA model of the type treated in Section 9.1. This decomposition technique also appears in the next two sections in which we treat linear models with different types of covariance structure.
9.3. INTRACLASS COVARIANCE STRUCTURE
In some instances, it is natural to assume that the covariance matrix of a
random vector possesses certain symmetry properties that are suggested by the sampling situation. For example, if n measurements are taken under the same experimental conditions, it may be reasonable to suppose that the order in which the observations are taken is immaterial. In other words, if X...., Xp denote the observations and X' = (XI,..., Xp) is the observation vector, then X and any permutation of X have the same distribution. Symbolically, this means that E(X) = C(gX) where g is a permutation matrix. If - Cov( X) exists, this implies that 2 = gYg' for g e 6, where
6. denotes the group of p x p permutation matrices. Our first task is to
characterize those covariance matrices that are invariant under 9. -that is,
those covariance matrices that satisfy E: = glg' for all g E @p. Let e E RP
be the vector of ones and set Pe = (l/p)ee' so Pe is the orthogonal
projection onto span{e). Also, let Qe = Ip - Pe.
Proposition 9.9. Let E be a positive definite p x p matrix. The following are equivalent:
(i) I = gl:g' for g E 6p.
(ii) 2 = aPe + 1Qe for a > O andB > 0.
(iii) I = a2A(p) where a2 > 0, - l/(p - 1) < p < 1, and A(p) is a
p x p matrix with elements aii = 1, i = ,...,p, and aij(p) = p
for i *].
Proof. Since
A(p) = (1 - p)Ip + pee' = (1- p)Ip + PPPe
= (1 - P)Qe + (1 + (P l)P)Pe'
the equivalence of (ii) and (iii) follows by taking a = a2(1 + (p - l)p) and
,B = a2(I - p). Since ge = e for g E @, , gPe = Peg. Thus if (ii) holds, then
gg' = agPeg' + f3gQeg = aPe + I#Qe =
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
356 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
so (i) holds. To show (i) implies (ii), let X e RP be a random vector with Cov(X) = 1. Then (i) implies that Cov(X) = Cov(gX) for g C ip. There fore,
var(X) =var(Xj), i, j= ,.. ., p
and
COV(xj, xj) =COV(xj,, Xj,); i =*j, i' *j'.
Let y = var(X,) and 8 = cov(X,, X2). Then
I = See' + (y - 8)Ip = P8Pe + (Y - 8)(Pe + Qe)
= (y + (p - l)8)Pe + (Y - 8)Qe = aPe + I3Qe
where a = y + (p - 1)6 and /3 = y - 8. The positivity of a and /3 follows
from the assumption that 2 is positive definite. O
A covariance matrix 7. that satisfies one of the conditions of Proposition 9.9 is called an intraclass covariance matrix and is said to have intraclass
covariance structure. Now that intraclass covariance matrices have been
described, suppose that X c Ep, n has a normal distribution with it - X E
M and Cov( X) e y where M is a linear subspace of fP, n and
y ={(In > 212 E Sp+,
= aPe + 3Qe, a > O,,B > O).
The covariance structure assumed for X means that the rows of X are
independent and each row of X has the same intraclass covariance structure.
In terms of invariance, if F ? g eQ?n ) ( , and In 0 Y E Y, it is clear that
Cov((F ? g)X) = COv(X)
since
(F ? g)(In ? Y)(r ? g)' = (rInrF) ? (gyg') = In ? E.
Conversely, if T is a positive definite linear transformation on Ep,n
that
satisfies
(r F g)T(r g)' = T forgr e g ? on ,
it is not difficult to show that T E y. The proof of this is left to the reader.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9. 10 357
Since the identity linear transformation is an element of y, in ord'er that the
least-squares estimator of ,u E M be the maximum likelihood estimator, it is sufficient that
(In X 2) M C M for I, 0 7 E- y.
Our next task is to describe a class of linear subspaces M that satisfy the
above condition.
Proposition 9.10. Let C be an r X p real matrix of rank r with rows
cl,. cr. If u,,..., ur is any basis for N span{c,,..., Cr) and U is an
r x p matrix with rows u',..., u1, then there exists an r X r nonsingular
matrix A such that A U = C.
Proof. Since u1,. .Ur is a basis for N,
r
ci - aikUk i = 1,..., r k=1
for some real numbers aik* Setting A = (aik), A U = C follows. As the basis (U..-, Ur) is mapped onto the basis (c,,..., Cr) by the linear transforma
tion defined by A, the matrix A is nonsingular. o
Given positive integers n and p, let k and r be positive integers that
satisfy k < n and r < p. Define a subspace M C C, by
M = = Z1BZ2; B E r,rk)
where Z1 is n x k of rank k, Z2 is r x p of rank r, and assume that e E RP
is an element of the subspace spanned by rows of Z2, say e E N =
span(z,..., Zr) and the rows of Z2 are z',..., z,. At this point, it is
convenient to relabel things a bit. Let u, = e/i/F, u2,..., ur, be an
orthonormal basis for N and let U: r x p have rows u',..., u,. By Proposi
tion 9.10, Z2 = A U for some r x r nonsingular matrix A so
M = - u = Z1BU, B E
Summarizing, X E C is assumed to have a normal distribution with GX E M and Cov(X) E y where M and y are given above. To decompose
this model for X into the product of two simple univariate linear models, let
F (9 , have u',..., u' as its first r rows. With Y = (In ? F)X,
&Y= &xrf = Z BUP'
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
358 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
and
COv(Y) = (In ? F)COV(X)(In 0 F)'
= (In rF)(In 0 (aPe + I3Qe))(In 0 F)
=1In ? (aFrper + I3rQer')
However,
ur' = (ro) E= ,
FPeF' = e1E'
and
rQe r = Ip - e1ej
where E' = (1,0,..., 0). Therefore, the matrix D arPerF + BrQer' is
diagonal with diagonal elements d1,..., dp given by di = a and d2 =
-dp = . Let Y1,..., Yp be the columns of Y and let bl,..., br be the
columns of B. Then it is clear that Y,,..., Yp are independent,
c (Y) = N(Zlbl, aIn)
f (Y,) = N(Z1bi, Ij) i = 2,. .., r,
and
P-(Yi) = N(O, /3Ij, i = r+ 1, ..., p.
To piece things back together, set m = n(p - 1) and let V E Rm be given
by V = (Y21,Y31,..., Y;). Then
f (V) = N(Z8, f3Im)
where 8 E R(r- 1)p, 6' = (b',..., br), and
zl 0
Z, :mX((r-l)p).
0 Z
0
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.10 359
Thus X has been decomposed into the two independent random vectors Y1 and V and the linear models for Y1 and V are given by the parameter sets
(M1, yT) and (M2, Y2) where
ml = {Alyl = Zlbl; b, e R k
Y= (aI'1a > 0)
M2 = 1{I21I2 = z8O 6 e R(r1-)P}
and
Y2 {(/,ml/ > 0).
Both of these linear models are univariate in the sense that y1 and Y2 consist of a constant times an identity matrix.
It is obvious that the general theory developed in Section 9.1 for the MANOVA model applies directly to the above two linear models individu ally. In particular, the maximum likelihood estimators of b1, a, 8, and 1B can simply be written down. Also, linear hypotheses about b, or 8 can be tested separately, and uniformly most powerful invariant tests will exist for such testing problems when the two linear models are treated separately. How ever, an interesting phenomenon occurs when we test a joint hypothesis about b1 and 6. For example, suppose the null hypothesis Ho is that b, = 0
and 8 = 0 and the alternative is that b1 * 0 or 8 * 0. This null hypothesis is
equivalent to the hypothesis that B = 0 in the original model for X. By simply writing down the densities of Y1 and V and substituting in the
maximum likelihood estimators of the parameters, the likelihood ratio test for Ho rejects if
A I - z112 n/2 IV - 2'112 \ /2
is too small. Here, 11 denotes the standard norm on the coordinate Euclidean space under consideration. Let
WI = - Z,b,ll2
1 lly1112
and
liV- Zj112 w
, -
_IV2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
360 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
so W1 and W2 are independent and each has a beta distribution. When p > 3, then m = n( p - 1) > n and it follows that A2/ = Wl2m/ is not in
general distributed as a product of independent beta random variables. This is in contrast to the situation treated in Section 9.2 of this chapter.
We end this section with a brief description of what might be called multivariate intraclass covariance matrices. If X E RP and Cov(X) = Y.,
then I is an intraclass covariance matrix iff Cov(gX) = Cov(X) for all
g e ?p. When the random vector X is replaced by the random matrix
Y: p x q, then the expression gY = (g ? Iq)Y still makes sense for g E
and it is natural to seek a characterization of Cov(Y) when Cov(Y) = Cov((g ? Iq)Y) for all g E p. For g E 57p, the linear transformation g X Iq just permutes the rows of Y and, to characterize T = Cov(Y), we must
describe how permutations of the rows of Y affect T. The condition that Cov(Y) = Cov((g ? Iq)Y) is equivalent to the condition
T = (g X Iq)T(g X Iq)t
g GE p .
For A and B in q, consider
ToPe P? A + Qe? B.
Then To is a self-adjoint and positive definite linear transformation on lIq p to Eq, p. It is readily verified that
To = (g X Iq)To(g Iq), g E- p
That To is a possible generalization of an intraclass covariance matrix is fairly clear-the positive scalars a and ,B of Proposition 9.9 have become the
positive definite matrices A and B. The following result shows that if T is
(9ip ? Iq)-invariant-that is, if T satisfies T = (g ? Iq)T(g 0 Iq)'-then T
must be a To for some positive definite A and B.
Proposition 9.11. If T is positive definite and (60 X Iq)-invariant, then there exist q X q positive definite matrices A and B such that
T = Pe ? A + Qe ? B.
Proof The proof of this is left to the reader.
Unfortunately, space limitations prevent a detailed description of linear models that have covariances of the form I,, 0 T where T is given in
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
SYMMETRY MODELS: AN EXAMPLE 361
Proposition 9.11. However, the analysis of these models proceeds along the lines of that given for intraclass covariance models and, as usual, these
models can be decomposed into independent pieces, each of which is a MANOVA model.
9.4. SYMMETRY MODELS: AN EXAMPLE
The covariance structures studied thus far in this chapter are special cases of a class of covariance models called symmetry models. To describe these, let (V, (., .)) be an inner product space and let G be a compact subgroup of
C(V). Define the class of positive definite transformations YG by
YG-{(2JE IeC(V,V),E > 0,g2g' = L forallg EG).
Thus YG is the set of positive definite covariances that are invariant under G in the sense that I = g:g' for g e G. To justify the term symmetry model for YG' first observe that the notion of "symmetry" is most often expressed in terms of a group acting on a set. Further, if X is a random vector in V
with Cov(X) = 2, then Cov(gX) = glg'. Thus the condition that I = g:g' is precisely the condition that X and gX have the same covariance-hence, the term symmetry model.
Most of the covariance sets considered in this book have been symmetry models for a particular choice of (V, (-, *)) and G. For example, if G = 6 (V), then
YG ( 2t , = 2I2 > 0),
as Proposition 2.13 shows. Hence @(V) generates the weakly spherical symmetry model. The result of Proposition 2.19 establishes that when (V,I ) (p n' ( * , * >) and
G=(gig=FrIi, reo),
then
YG (21Y =
In 4 A, A E )
Of course, this symmetry model has occurred throughout this book. Using techniques similar to that in Proposition 2.19, the covariance models consid ered in Section 9.2 are easily shown to be symmetry models for an appropriate group. Moreover, Propositions 9.9 and 9.11 describe sets of
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
362 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
covariances (the intraclass covariances and their multivariate extensions) in exactly the manner in which the set YG was defined. Thus symmetry models are not unfamiliar objects.
Now, given a closed group G C 69(V), how can we explicitly describe the model yG? Unfortunately, there is no one method or approach that is appropriate for all groups G. For example, the results of Proposition 2.19 and Proposition 9.9 were proved by quite different means. However, there is a general structure theory known for the models YG (see Andersson, 1975), but we do not discuss that here. The general theory tells us what YG should look like, but does not tell us how to derive the particular form of YG.
The remainder of this section is devoted to an example where the methods are a bit different from those encountered thus far. To motivate the considerations below, consider observations X1, . . ., Xp,, which are taken at p
equally spaced points on a circle and are numbered sequentially around the
circle. For example, the observations might be temperatures at a fixed cross
section on a cylindrical rod when a heat source is present at the center of the
rod. Impurities in the rod and the interaction of adjacent measuring devices
may make an exchangeability assumption concerning the joint distribution of X1,..., Xp unreasonable. However, it may be quite reasonable to assume
that the covariance between Xj and Xk depends only on how far apart XJ
and Xk are on the circle-that is, cov(Xj, Xj, ) does not depend on j, j =1,. . ., p, where Xp + 1= X1; cov(Xj, Xj+2) does not depend on j, j= 1,.. .,p, where XP+2 =X2, and so on. Assuming that cov(Xj, Xj) does not
depend onj, these assumptions can be succinctly expressed as follows. Let XE RP have coordinates X,, ... ., Xp and let C be a p x p matrix with
Cp= I= 1, j = 1,...,p - 1
and the remaining elements of C zero. A bit of reflection will convince the reader that the conditions assumed on the covariances is equivalent to the condition that Cov(CX) = Cov(X). The matrix C is called a cyclic permu tation matrix since, if x E RP has coordinates xl,..., xp, then Cx has coordinates x2, X3,..., XP, XI. In the case that p = 5, a direct calculation
shows that
= Cov(X) = Cov(CX) = CEC'
iff I has the form
1 PI P2 P2 P1 1 P1 P2 P2
z =o2 1 p1 P2 1 p1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
SYMMETRY MODELS: AN EXAMPLE 363
where a2> 0. The conditions on p1 and P2 so that 7. is positive definite are
given later. Covariances that satisfy the condition E = C2C' are called cyclic covariances. Some further motivation for the study of cyclic covari ances can be found in Olkin and Press (1969).
To begin the formal treatment of cyclic covariances, first observe that Cp = Ip so the group generated by C is
Go{p, C, C2'... Cp- I
Since C generates Go, it is clear that C2C' = I iff glg' = I for all g E Go. In what follows, only the case of p = 2q + 1, q > 1, is treated. When p is even, slightly different expressions are obtained but the analyses are similar. Rather than characterize the covariance set YGo directly, it is useful and instructive to first describe the set
G0= (BIBC = CB, B E C-).
Recall that ?P is the complex vector space of p-dimensional coordinate complex vectors and e,, is the set of all p x p complex matrices. Consider the complex number r exp[2i7i/pJ and define complex column vectors
wk E cP with jth coordinate given by
Wkj = P-1/2exp e 2 xi( - 1)(k - 1) i= .p
for k = I,... , p. A direct calculation shows that
WZ*W=k, k,l= 1,..., p
so w,,..., wp is an orthonormal basis for CP. For future reference note that
WI = p -/2e, Wk = Wp_ k+2' k = 2,..., q + 1
where p = 2q + 1, q > 1. Here, the bar over wk denotes complex conjugate, and e is the vector of ones in 4P. The basic relation
Cwk = rk IWk, k =l ,.,p
shows that
p (9.1) C = r rk- WkWk*
k=1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
364 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
As usual, * denotes conjugate transpose. Obviously, 1, r,. . ., rPIl are eigenvalues of C with corresponding eigenvectors w,. .., wp. Let Do E ep be diagonal with dkk = rk-l, k = I,...,p and let UE C,, have columns
w1,..., Wk. The relation (9.1) can be written C = UDOU*. Since UU* = Ip U is a unitary complex matrix.
Proposition 9.12. The set CG0 consists of those B E Cp that have the form
p (9.2) B =EIkWkwk
where /31,..., ,Bp are arbitrary complex numbers.
Proof. If B has the form (9.2), the identity BC = CB follows easily from (9.1). Conversely, suppose BC = CB. Then
BUDOU* = UDOU*B
so
U*BUDO = DOU*BU
since U*U = Ip. In other words, U*BU commutes with Do. But Do is a
diagonal matrix with distinct nonzero diagonal elements. This implies that U*BU must be diagonal, say D, with diagonal elements /3,,..., ,Bp. Thus
U*BU = D so B = UDU*. Then B has the form (9.2). El
The next step is to identify those elements of dGo that are real and
symmetric. Consider B E- Go so
p B = EkkwkwkZ
Now, suppose that B is real and symmetric. Then the eigenvalues of B,
namely /3,,..., ,Bp, are real. Since /,3...-, /3,p are real and B is real, we have
p p
E/3kwkwk = B = B= E/3kWkW .
The relationship wk = W,,k+2, k = 2,.. .., q + 1, implies that /k = /3p-k+2.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.13 365
k = 2,...,q+ 1, so
q+ I
(9.3) B = wlww + 1k(Wkwk + wkwk).
k=2
But any B given by (9.3) is real, symmetric, and commutes with C and
conversely. We now show that (9.3) yields a spectral form for the real
symmetric elements of dGo. Write Wk = Xk + iYk with Xk, Yk C RP, and
define Uk E RP by
Uk = Xk + Yk, k =,...,p.
The two identities
Wk*W=- 8k1 k,l= 1,..., p
Wk = Wp-k+2, k = 2,..., p
and the reality of w, yield the identities
UtUz k k, I= 1,..., p
WkWk + WkWk = UkUk + Up-k+2Up-k+2 k = p
Thus u,,..., up is an orthonormal basis for RP. Hence any B of the form
(9.3) can be written
q+ 1
B =3lU1U1 + E 8k(UkUk + Up-k+2Up-k+2) 2
and this is a spectral form for B. Such a B is positive definite iff Pk > 0 for k = l,..., q + 1. This discussion yields the following.
Proposition 9.13. The symmetry model YG0 consists of those covariances X
that have the form
q+ I
(9.4) = alulu + E a,i(Uu + Up_k+2up-k+2) k=2
where ak > O for k = 1,..., q + 1.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
366 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
Let F have rows u ,..., up. Then r is a p X p symmetric orthogonal
matrix with elements
Yjk = Cos[ 27 (j - 1)(k - l) + sin[ 23 (j - 1)(k - I)]
for j, k = 1,. . , p. Further, any L given by (9.4) will be diagonalized by r
-that is, I'l is diagonal, say D, with diagonal elements
dk =ak, k = 1,..., q + 1; dp-k+2= ak, k =2,..., q +.
Since F simultaneously diagonalizes all the elements of YG.,
F can sometimes
be used to simplify the analysis of certain models with covariances in YG0* This is done in the following example.
As an application of the foregoing analysis, suppose Y,,..., Y, are independent with Y.
E RP, p = 2q + 1, and i_(YJ) = N(,j, E), j = 1,..., n.
It is assumed that I is a cyclic covariance so E E YGO- In what follows, we
derive the likelihood ratio test for testing HO, the null hypothesis that the
coordinates of u are all equal, versus H1, the alternative that ,u is completely
unknown. As usual, form the matrix Y: n x p with rows YJ1',] = 1,..., n, so
e (Y) = N(et', IIn ? E)
where ,u Ee RP and L E YGO Consider the new random vector Z = (I,, X F)Y
where F is defined in the previous paragraph. Setting It = F,i, we have
P(Z) = N(ev',In I D)
where D = 17F. As noted earlier, D is diagonal with diagonal elements
dk = ak, k= 1,...,q+ 1; dp_k+2 = ak, k = 2,..., q + 1.
Since L was assumed to be a completely unknown element of YG0, the
diagonal elements of D are unknown parameters subject only to the
restriction that aj > 0, j = 1,..., q + 1. In terms of v = Fft, the null
hypothesis is Ho: V2 = = vp = 0. Because of the structure of D, it is
convenient to relabel things once more. Denote the columns of Z by
Z1,..., Zp and consider W1,. . ., IWq+ defined by
WI = zR , an = (ZjZp-j+2)f jr= 2,. +q. + 1.
Thus W, E- R' and Wj E= 52 n for j = 2,. . ., q + 1. Define vectors (X GE R2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.13 367
by
pi j = 2,.-, q + 1.
Now, it is clear that W,,..., Wq,I are independent and
(W1) = N(vle, a,IJ), (Wj) = N(ej, ajIn X I2),
j= 2,...,~q 1.
Further, the null hypothesis is Ho: = 0,j = 2,.. ., q + 1, and the alterna
tive is that (, * 0 for some j = 2, .. ., q + 1. With the model written in this
form, a derivation of the likelihood ratio test is routine. Let Pe = ee'/n and
let 11 -
11 denote the usual norm on C2 n Then the likelihood ratio test rejects
Ho for small values of
A +H IIW - peT~I jrl llW-eWj
j=2 1WI12
Of course, the likelihood ratio test of HOO : j= 0 versus H(A *j 0
rejects for small values of
A1 - - 'ji2 j = 2,..., q +.
The random variables A2,9 . . . 9Aq+ l are independent, and under Hoj),
e (Aj) =
' (n - 1, 1).
Thus under Ho, A is distributed as a product of the independent beta random variables, each with parameters n - I and 1.
We end this section with a discussion that leads to a new type of structured covariance-namely, the complex covariance structure that is discussed more fully in the next section. This covariance structure arises when we search for an analog of Proposition 9.11 for the cyclic group Go. To keep things simple, assume p = 3 (i.e., q = 1) SO Go has three elements
and is a subgroup of the permutation group '3, which has six elements. Since p = 3, Propositions 9.9 and 9.13 yield that y,53 = YGO and these
symmetry models consist of those covariances of the form
2 = aPe + fQe, a > 0,/f
> 0
where Pe = tee' and Qe = I3 - Pe
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
368 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
Now, consider the two groups iP3 X Ir and Go ? I, acting on Cr, 3 by
(g 1r)(X) = gx, gE 3 X Er,3
Proposition 9.11 states that a covariance T on Er 3 is P3 ?3 Ir invariant iff
(9.5) T-=Pe ? A + Qe ? B
for some r x r positive definite A and B. We now claim that for r > 1, there
are covariances on Er, 3 that cannot be written in the form (9.5), but that are
Go ? Ir invariant.
To establish the above claim, recall that the vectors u,, u2, and U3 defined earlier are an orthonormal basis for R3 and
Pe = UIUl , Qe = U2U2 + U3"u.
These vectors were defined from the vectors Wk X Xk + iYk, k = 1, 2, 3, by
Uk = Xk + Yk, k = 1, 2,3. Define the matrix J by
J-i I w2w2 - W3W3] I
By Proposition 9.12, J commutes with C. Consider vectors v2 and V3 given by
V2 = -(U2 + U3) v3 = -(u2- u3)
so {v2, V3) is an orthonormal basis for span (u2, U3). Since W3 = w2 we have
U3 = x2 - Y2, which implies that v2 = xX2 and v3 = C2 y2. This readily
implies that
j= v -v3v L
so J is skew-symmetric, nonzero, and Ju, = 0. Now, consider the linear transformation To on er 3 to Er 3 given by
To = Pe ? A + Qe C) B + J ? F
where A and B are r X r and positive definite and F is skew-symmetric. It is
now a routine matter to show that (C 1r) To = To(C X Ir) since CPe = PeC,
CQe = QeC, and JC = CJ. Thus To commutes with each element of Go ? Ir
and To is symmetric as both J and F are skew-symmetric. We now make two
claims: first, that a nonzero F exists such that To is positive definite, and
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.14 369
second, that such a To cannot be written in the form (9.5). Since Pe ? A +
Qe ?9 B is positive definite, it follows that for all skew-symmetric F's that
are sufficiently small,
Pe ? A + Qe 09 B + J 0 F
is positive definite. Thus there exists a nonzero skew-symmetric F so that To is positive definite. To establish the second claim, we have the following.
Proposition 9.14. Suppose that
Pe 09 A1 + Qe 0 B1 + J 0 Fl P le A2 + Qe 0 B2 + J 0 F2
where Ai and Bj, j = 1, 2, are symmetric and Fj, j = 1, 2, is skew-symmetric.
This implies that AI = A2, B1 = B2, and F1 = F2.
Proof. Recall that (u,, V2, V3) is an orthonormal basis for R3. The relation
QeUl = Ju, = 0 implies that for x e Rr
(Pe 0 Aj + Qe?8 B + J 0 f)(u1ELx) u10(A x)
for j = 1,2 so uI C(AIx) = uIlO(A2x). With ( ,.) denoting the natural
inner product on er,3, we have
x'A,x = KuUlJx, u1O(A1x)) = (u1Ox, u1O(A2x)) = x'A2x
for all x e Rr. The symmetry of A1 and A2 yield AI = A2. Since PeV2 0,
QeV2 = v2, and Jv2- -V3, we have
(Pe 0 Al + Qe 0 B, + J 0 F1)(v2Ox) = v201(Bx) -v3D(Fx)
= V20(CB2X) -V30(F2X)
for all x E R'. Thus
x'B1x = (v2Ex, v20B1x - v3O(F1x)) = x'B2x,
which implies that B1 = B2. Further,
-y'F1x = (v3Oy, v2 (Bx) - v30FIx) =-y'F2x
for all x, y e Rr. Thus F1 = F2. 0
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
370 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
In summary, we have produced a covariance
To = Pe 0 A + Qe ? B + J 0 F
that is (Go 0 I,)-invariant but is not (S3 ? Ir)-invariant when r > 1. Of course, when r = 1, the two symmetry models yy3 and YGo are the same. At this point, it is instructive to write out the matrix of To in a special ordered
basis for fEr 3. Let E1... e,, be the standard basis for Rr so
{U1El ... I UlEEr, V2E01,... , V20LJr, V3fEl,..., V3 ELr)
is an orthonormal basis for (Er3' (-,-)). A straightforward calculation shows that the matrix of To in this basis is
A O O
[TO]-O B F. O -F B
Since [To] is symmetric and positive definite, the 2r x 2r matrix
(-F B)
has these properties also. In other words, for each positive definite B, there is a nonzero skew-symmetric F (in fact, there exist infinitely many such skew-symmetric F's) such that Y. is positive definite. This special type of structured covariance has not arisen heretofore. However, it arises again in a very natural way in the next section where we discuss the complex normal distribution. It is not proved here, but the symmetry model of Go 0 I, when p = 3 consists of all covariances of the form
To = Pe 0 A + Qe 0 B + J 0 F
where A and B are positive definite and F is skew-symmetric.
9.5. COMPLEX COVARIANCE STRUCTURES
This section contains an introduction to complex covariance structures. One situation where this type of covariance structure arises was described at the end of the last section. To provide further motivation for the study of such
models, we begin this section with a brief discussion of the complex normal distribution. The complex normal distribution arises in a variety of contexts
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
COMPLEX COVARIANCE STRUCTURES 371
and it seems appropriate to include the definition and the elementary properties of this distribution.
The notation introduced in Section 1.6 is used here. In particular, a is the field of complex numbers, ? ' is the n-dimensional complex vector space of n-tuples (columns) of complex numbers, and Cn is the set of all n x n
complex matrices. For x, y E ?P n, the inner product between x and y is
n
(X, Y) =
-XjYj=X*Y. j=1
where x* denotes the conjugate transpose of x. Each x E ?n has the unique representation x = u + iv with u, v e R . Of course, u is the real part of x,
v is the imaginary part of x, and i= r is the imaginary unit. This representation of x defines a real vector space isomorphism between ?Pn and R2n. More precisely, for x E ?n, let
[x]=( u E R2n
where x = u + iv. Then [ax + by] = a[x] + b[y] for x, y e ?Pn, a, b e R,
and obviously, [ ] is a one-to-one onto function. In particular, this shows that ?n is a 2n-dimensional real vector space. If C E Cn, then C = A + iB
where A and B are n x n real matrices. Thus for x = u + iv e ?n,
Cx = (A + iB)(u + iv) = Au - Bv + i(Av + Bu)
so
[Cx] Au-Bv) (A -B)(u)
This suggests that we let (C) be the (2n) x (2n) partitioned matrix given by
(C)= (B A B): (2n) x (2n).
With this definition, [Cx] = (C)[x]. The whole point is that the matrix C e C,, applied to x E ?' can be represented by applying the real matrix
(C) to the real vector [x] E R2". A complex matrix C e Cn is called Hermitian if C = C*. Writing C = A
+ iB with A and B both real, C is Hermitian iff
A + iB = A' - iB',
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
372 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
which is equivalent to the two conditions
A = A', B=-B'.
Thus C is Hermitian iff {C) is a symmetric real matrix. A Hermitian matrix C is positive definite if x*Cx > 0 for all x E CF', x * 0. However, for
Hermitian C,
x*Cx = [x]'C[x]
so C is positive definite iff (C) is a positive definite real matrix. Of course, a
Hermitian matrix C is positive semidefinite if x*Cx > 0 for x E (J7 and C is
positive semidefinite iff (C) is positive semidefinite. Now consider a random variable X with values in 4T. Then X = U + iV
where U and V are real random variables. It is clear that the mean value of
X must be defined by
&X = &U + i&V
assuming &U and &V both exist. The variance of X, assuming it exists, is
defined by
var(X) = 6 [(X - &(X))( X - i X
where the bar denotes complex conjugate. Since X is a complex random variable, the complex conjugate is necessary if we want the variance of X to
be a nonnegative real number. In terms of U and V,
var(X) = var(U) + var(V).
It also follows that
var(aX + b) = aa-var(X)
for a, b E 4. For two random variables X1 and X2 in (I, define the
covariance between X1 and X2 (in that order) to be
COV{ X,, X2) } [( X, - R (X1 )(X2 - ( X2 ) )
assuming the expectations in question exist. With this definition it is clear that cov(X,, XI)
= var(X,), cov(X2, X,) =cov{X1 ,AX2), and
Cov( XI, X2 + X3) = CoV( X1, X2) + CoV{ XI, X3}.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
COMPLEX COVARIANCE STRUCTURES 373
Further,
cov(a,X1 + b,, a2X2 + b2) = a,"2cov(X1, X2)
fora,, b,b2 e 47
We now turn to the problem of defining a normal distribution on 47n. Basically, the procedure is the same as defining a normal distribution on Rn. Step one is to define a normal distribution with mean zero and variance one
on (4, then define an arbitrary normal distribution on 47 by an affine
transformation of the distribution defined in step one, and finally we say that Z E 47 has a complex normal distribution if (a, Z) = a*Z has a
normal distribution in 47 for each a e 47". However it is not entirely obvious
how to carry out step one. Consider X e 4 and let ?7N(O, 1) denote the
distribution, yet to be defined, called the complex normal distribution with mean zero and variance one. Writing X = U + iV, we have
[XI U(E)eR2
so the distribution of X on 47 determines the joint distribution of U and V on R2 and, conversely, as [-] is one-to-one and onto. If f(X) = 47N(O, 1), then the following two conditions should hold:
(i) f&(aX) = 47N(O, 1) for a E 4with a& = 1.
(ii) [XI has a bivariate normal distribution on R2.
When ail = 1 and X has mean zero and variance one, then aX has mean zero and variance one so condition (i) simply says that a scalar multiple of a complex normal is again complex normal. Condition (ii) is the requirement that a normal distribution on 47 be transformed into a normal distribution on R2 under the real linear mapping [*]. It can now be shown that conditions (i) and (ii) uniquely define the distribution of [XI and hence provide us with the definition of a 47N(O, 1) distribution. Since &X = 0, we have S [XI = 0. Condition (i) implies that
e([X]) = (([aX]), aa = 1.
However, writing a = a + i/3,
[aXI = ( [ -[XI
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
374 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS
where r is a 2 x 2 orthogonal matrix with determinant equal to one since ai 2 ?+ 2 = 1. Therefore,
eq[x]) = P-(r[x])
for all such orthogonal matrices. Using this together with the fact that
1 = var(X) = var(U) + var(V) implies that
Cov([ X]) = 2'I2
Hence
CQ[X]) = N2(0O, I2)
so the real and imaginary parts of X are independent normals with mean
zero and variance one half.
Definition 9.1. A random variable X = U + iV E ? has a complex normal
distribution with mean zero and variance one, written f&(X) = (tN(O, 1), if
e(( V) = 2(0, 2,2)'
With this definition, it is clear that when e&( X) = ?4N(O, 1), the density of X
on ?T with respect to two-dimensional Lebesgue measure on ? is
p(x) =-exp[-xxj, xe C.
Given ,u E ? and o2, a > 0, a random variable X1 E 4a has a complex
normal distribution with mean ,u and variance a2 if f(XI) = E (aX + ,u)
where e(X) = 4N(O, 1). In such a case, we write E(X1) = 4?N(,L, U2). It is
clear that X1 = U1 + iV, has a ?N(,u, a2) distribution iff U1 and V1 are
independent and normal with variance 1l 2 and means &U, = u, & V1 =
where , = ,u I + iJ2. As in the real case, a basic result is the following.
Proposition 9.15. Suppose Xl,..., Xm are independent random variables in
?T with P(Xj)) =
4N(jj, a2)j = I,..., m. Ten
m m m
e E ( a, X + by) -?N Y. (a, 1 + b), E a, d^a f , j==l(I+ ),i )
for a.,b 9j=I, M
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.15 375
Proof. This is proved by considering the real and imaginary parts of each
Xj. The details are left to the reader. E
Suppose Y is a random vector in (W' with coordinates Y1,..., Y, and that
var(Yj) < + oo for j = 1,. . ., n. Define a complex matrix H with elements
hjk given by
hjk - cov(Yj, Yk}
Since hjk = hkj, H is a Hermitian matrix. For a, b E ?tn, a bit of algebra
shows that
cov(a*Y, b*Y) = a*Hb = (a, Hb).
As in the real case, H is the covariance matrix of Y and is denoted by
Cov(Y) H. Since a*Ha = var(a*Y) > 0, H is positive semidefinite. If
H = Cov(Y) and A EC (,, it is readily verified that Cov(A Y) = AHA*.
We now turn to the definition of a complex normal distribution on the
n-dimensional complex vector space (Tn.
Definition 9.2. A random vector X E ?n has a complex normal distribu
tion if, for each a E a', (a, X) = a*X has a complex normal distribution
on ?1.
If X E On has a complex normal distribution and if A E ,e,, it is clear that
AX also has a complex normal distribution since (a, AX) = (A*a, X). In
order to describe all the complex normal distributions on ?Vn, we proceed as in the real case. Let X1,..., Xn be independent with Q(Xj) = ?TN(O, 1) on ? and let Xe En have coordinates X1,..., Xn. Since
n
a*X= E a XI,
j=l
Proposition 9.15 shows that Q(a*X) = ?VN(O, Id1jaj). Thus X has a com plex normal distribution. Further, SX = 0 and
COV{Xj, Xk) = ajk
so Cov(X) = I. For A E C,n and ,t E ?n, it follows that Y = AX + ,u has a
complex normal distribution and
&Y= u, Cov(Y)=AA*-H.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
376 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
Since every nonnegative definite Hermitian matrix can be written as AA* for some A E C2, we have produced a complex normal distribution on n
with an arbitrary mean vector it e a' and an arbitrary nonnegative definite
Hermitian covariance matrix. However, it still must be shown that, if X and X in 4T' are complex normal with &X = &X and Cov(X) = Cov(X), then
E(X) = E( X). The proof of this assertion is left to the reader. Given this fact, it makes sense to speak of the complex normal distribution on (' with
mean vector ,u and covariance matrix H as this specifies a unique probabil
ity distribution. If X has such a distribution, the notation
f (X) = (VN(A, H)
is used. Writing X = U + iV, it is useful to describe the joint distribution of
U and V when e (X) = CN(Mi, H) on 4T'. First, consider X = U + iV where
C(X) = 4'N(t, I). Then the coordinates of X are independent and it
follows that
( V) ((2 ) 2
2n)
where L= , + 4iA2. For a general nonnegative definite Hermitian matrix
H, write H = AA* with A E Cn. Then
x)= e (Ak+ +).
Since
[X]=(U)
and
[AX+ ( A)[X]k + [y CB )( V) Ai
where A = B + iC, it follows that
E([X]) = P-({A)[k] +
But H = 2 + iF where I is positive semidefinite, F is skew-symmetric, and
the real matrix
{H) =(FE)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.16 377
is positive semidefinite. Since H = AA*, (H) = (A) (A)', and therefore,
)= ((A)[k] + [ N) N ' )
N((y2),~~~~~~~~~~~~- 2(2z)
In summary, we have the following result.
Proposition 9.16. Suppose l&(X) = ?N(,u, H) and write X = U + iV, ,u =
Al + 4L 2, and H = I + iF. Then
( V) P(2) 2F Y.))
Conversely, with U and V jointly distributed as above, set X = U + iV, ,u = Al + i,2, and H = 2 + iF. Then E(X) = ?N(p, H).
The above proposition establishes a one-to-one correspondence between n-dimensional complex normal distributions, say ?N(p, H), and 2n-dimen sional real normal distributions with a special covariance structure given by
2{H}=2(F 2
where H = Y + iF. Given a sample of independent complex normal ran
dom vectors, the above correspondence provides us with the option of either
analyzing the sample in the complex domain or representing everything in
the real domain and performing the analysis there. Of course, the advantage of the real domain analysis is that we have developed a large body of theory that can be applied to this problem. However, this advantage is a bit
illusory. As it turns out, many results for the complex normal distribution
are clumsy to prove and difficult to understand when expressed in the real
domain. From the point of view of understanding, the proper approach is
simply to develop a theory of the complex normal distribution that parallels the development already given for the real normal distribution. Because of
space limitations, this theory is not given in detail. Rather, we provide a
brief list of results for the complex normal with the hope that the reader can
see the parallel development. The proofs of many of these results are minor
modifications of the corresponding real results. Consider X E tP such that E(X) = ?N(p, H) where H is nonsingular.
Then the density of X with respect to Lebesgue measure on I'P is
f(x) = SV-P(det H) 'exp[- (x - A)*H-I(x -
I)].
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
378 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
When H = I, then
f (X*X) = X2p( *
With this result and the spectral theorem for Hermitian matrices (see Halmos, 1958, Section 79), the distribution of quadratic forms, say X*AX for a Hermitian, can be described in terms of linear combinations of independent noncentral chi-square random variables.
As in the real case, independence of jointly complex normal random vectors is equivalent to the absense of correlation. More precisely, if E (X) = ?rN(,i, H) and if A: q X p and B: r X p are complex matrices,
then AX and BX are independent iff AHB* = 0. In particular, if X is partitioned as
X= (X2) Xi E TPi, j =1, 2
and H is partitioned similarly as
(HiI H12
H H21 H22J
where Hjk ispj X Pk' then Xl and X2 are independent iff H12 = 0. When H22 is nonsingular, this implies that Xl - HH2 H2'X2 and X2 are independent.
This result yields the conditional distribution of Xl given X2, namely,
-(XIIX2) = 4TN(IL + H12UH2'(X2 - 2), H11.2)
where HH1.2 = H11 - HH2'H21 and ,j = &Xj,j = 1,2.
The complex Wishart distribution arises in a natural way, just as the real
Wishart distribution did.
Definition 9.3. A p x p random Hermitian matrix S has a complex Wishart
distribution with parameters H, p, and n if
= J = I )
where X,,..., X,, E CP are independent with
e(Xj) = (N(O, H).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.16 379
In such a case, we write
c(S) = ?W(H, p, n).
In this definition, p is the dimension, n is the degrees of freedom and H is a
p x p nonnegative definite Hermitian matrix. It is clear that S is always nonnegative definite and, as in the real case, S is positive definite with
probability one iff H is positive definite and n > p. When p = I and H = 1, it is clear that
?VW(l, 1, n)= X2
Further, complex analogues of Proposition 8.8, 8.9, and 8.13 show that if
?(S) = W(H, p, n) with n > p and H positive definite, and if E(X) =
N(O, H) with X and S independent, then
S (X*S- 'X) = F2p,2(n-p+ l),
We now turn to a brief discussion of one special case of the complex MANOVA problem. Suppose XI,..., X, Ee 4VP are independent with
( Xj) = VN(IA, H)
and assume that H > 0 -that is, H is positive definite. The joint density of
X,,..., Xn with respect to 2np-dimensional Lebesgue measure is
n
p(XIA, H) = H 1T-PtHr-'exp[ (X - uX)*H-l(Xj
- A)] j=l
= T -nPIH1-nexp- E (Xj - A)*H (Xj -A)
= 7T-nPIHV-nexp[-n(X-p)*Hl(X-1
-tr E (X - X)(X-X)*)H
where X n-'Y. Xj and tr denote the trace. Here, X is the np-dimensional
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
380 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
vector in qVP consisting of X,, X2,..., X,. Setting
n S =E ( Xi
- X )( Xj - X )*,1 j=l
we have
p(Xju, H) = -nPIHI-nexp[-n(XV-1)*H-I(X.,i) - trSH-1].
It follows that (X, S) is a sufficient statistic for this parametric family and
12 X is the maximum likelihood estimator of ,u. Thus
p(XI,I, H) = ,P-npIHi-nexp[-tr SH-'].
A minor modification of the argument given in Example 7.10 shows that when S > 0, p(XII,, H) is maximized uniquely, over all positive definite H, at H = n- 'S. When n > p + 1, then S is positive definite with probability
one so in this case, the maximum likelihood estimator of H is H = n- 'S. If A = 0, then
p (XI, H) =,-np V H I -n exp- |- XJ*H i Xj
f - np IH -n exp[-tr SH ]
where n
S = XjXJ* = S + nXX*.
j=1
Obviously, p(Xp, H) is maximized at f = n- 's. Thus the likelihood ratio test for testing , = 0 versus IL 0 rejects for small values of
A AP(XPH) IS"-n -S"n p(XIL, H) H Si = IS + nXX*In
As in the real case, X and S are independent,
C (S) =
4W(H, p, n -1
and
C (VX) = 4TN(V , H).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
ADDITIONAL EXAMPLES OF LINEAR MODELS 381
Setting Y = Vnx,
Al/n= iSI = 1
Is + YY*I l + Y*S-Iy
so the likelihood ratio test rejects for large values of Y*S- 'Y T2. Argu ments paralleling those in the real case can be used to show that
(T 2) = F(2p,2(n - p), 8)
where 8 = n,u*H-',u is the noncentrality parameter in the F distribution.
Further, the monotone likelihood ratio property of the F- distribution can be used to show that the likelihood ratio test is uniformly most powerful
among tests that are invariant under the group of complex linear transfor
mations that preserve the above testing problem. In the preceeding discussion, we have outlined one possible analysis of
the one-sample problem for the complex normal distribution. A theory for the complex MANOVA problem similar to that given in Section 9.1 for the real MANOVA problem would require complex analogues of many results given in the first eight chapters of this book. Of course, it is possible to
represent everything in terms of real random vectors. This representation consists of an n x 2p random matrix Y E E2p, nwhere
C (Y) = N(ZB, In ? I).
As usual, Z is n X r of rank r and B: r X 2p is a real matrix of unknown
parameters. The distinguishing feature of the model is that I is assumed to have the form
(F 2
where 2 : p x p is positive definite and F: p x p is skew-symmetric. For reasons that should be obvious by now, V's of the above form are said to have complex covariance structure. This model can now be analyzed using the results developed for the real normal linear model. However, as stated earlier, certain results are clumsy to prove and more difficult to understand when expressed in the real domain rather than the complex domain. Although not at all obvious, these models are not equivalent to a product of real MANOVA models of the type discussed in Section 9.1.
9.6. ADDITIONAL EXAMPLES OF LINEAR MODELS
The examples of this section have been chosen to illustrate how condition ing can sometimes be helpful in finding maximum likelihood estimators and
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
382 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
also to further illustrate the use of invariance in analyzing linear models. The linear models considered now are not products of MANOVA models and the regression subspaces are not invariant under the covariance trans formations of the model. Thus finding the maximum likelihood estimator of
mean vector is not just a matter of computing the orthogonal projection onto the regression subspace. For the models below, we derive maximum likelihood estimators and likelihood ratio tests and then discuss' the problem of finding a good invariant test.
The first model we consider consists of a variation on the one-sample problem. Suppose Xl,..., Xn are independent with C(Xi) = N(j, 2) where Xi E RP, i = 1,..., n. As usual, form the n x p matrix X whose rows are
X,, i = 1,..., n. Then
E(X) = N(ep', In ? 2)
where e E Rn is the vector of ones. When ,u and 2 are unknown, the linear
model for X is a MANOVA model and the results in Section 9.1 apply
directly. To transform this model to canonical form, let F be an n X n
orthogonal matrix with first row e'/ Vn. Setting Y = rx and ,B = 4n',
E (Y) = N(El 3, In ? 2)
where el is the first unit vector in Rn and ,B E fp . Partition Y as
/Y,
where Y, E p, Y2 E Ep m and m = n - 1. Then
P-(Y,) =N(f, 1 )
and
-(Y2) = N(O, Im ? 2).
For testing HO:,B= 0, the results of Section 9.1 show that the test that
rejects for large values of Y,(Y2Y2)-Y Y (assuming m > p) is equivalent to
the likelihood ratio test and this test is most powerful within the class of
invariant tests. We now turn to a testing problem to which the MANOVA results do not
apply. With the above discussion in mind, consider U E E and Z E m
where U and Z are independent with
(U) = N(/3, 2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
ADDITIONAL EXAMPLES OF LINEAR MODELS 383
and
e(z) = N(O, Im ?2) )
Here, ,BE Ep I and L > 0 is a completely unknown p x p covariance
matrix. Partition ,B into /3R and /,3 where
fi GE Ep,, i = 1,2, +2=
Consider the problem of testing the null hypothesis Ho: PI = 0 versus
Hi : PI * 0 where /2 and L are unknown. Under Ho. the regression sub
space of the random matrix
( z P) E lSm+ I
is
{ (0 ) P ,m+i'/32
E EP2,1}'
and the set of covariances is
It is easy to verify that MO is not invariant under all the elements of y so the
maximum likelihood estimator of /32 under Ho cannot be found by least
squares (ignoring Y.). To calculate the likelihood ratio test for Ho versus H1,
it is convenient to partition U and Z as
U=(UL,9U2), Ui ef ,, i= 1,2
Z=(Z19 Z2), ZiEep m im = 12
and then condition on U1 and Z . Since U and Z are independent, the joint
distribution of U and Z is specified by the two conditional distributions,
F&(U21U,) and e(Z21Z1), together with the two marginal distributions, P&(U,) and E(Z,). Our results for the normal distribution show that these distribu
tions are
(U2lUI) =
N(,82 + (U1
- /1)71
2 1 .22-1)
(U,) = N(/31, 1)
eF(z2Iz1) = N(Z1- '212 Im ? 221)
E(Z1) = N(O, Im X 111)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
384 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
where 2 is partitioned as
'Ell 2120
221 Y.222
with Yij being pi x pj, i, j= 1,2. As usual, 222-1 = 22 - -21 '112- By
Proposition 5.8, the reparameterization defined by '1 I = Y. I I12 - . 12
and 22 = 22 .1 is one-to-one and onto. To calculate the likelihood ratio test for Ho versus H1, we need to find the maximum likelihood estimators
under Ho and HI.
Proposition 9.17. The likelihood ratio test of Ho: PI? = 0 versus H1: ,f3 * 0 rejects Ho if the statistic
A = U1Sfl'Uf
is too large. Here, S = Z'Z and
S- S(l S12\ S21 S22
where Sij is pi x pj.
Proof. Let f1(U1J,I1,*11) be the density of f&(U1), let f2(U21U, 9PBI,/2, '12,4'22) be the conditional density of ,(U21U,), let f3(Z1Pll) be the
density of P&(ZI), and letf4(Z21Z1,4112,4'22) be the density of P(Z21Z1). Under Ho, PI = 0 and the unique value of f2 that maximizes
f2(U21U,O?, f 2, '112, 'P22) iS
I2 U2- U,*12
for 'P2 fixed. It is clear that
f2(u2119 ,o2 , 12 9 '22) & 1422V1
where the symbol a means "is proportional to." We now maximize with respect to 'P2. With /2 = P2, '12 occurs only in the density of Z2 given Z1.
Since E(Z21Z,) =
N(Z,'12, Im X '22), it follows from our treatment of the
MANOVA problem that
'12 =
(ZlZ1) Z1Z2 = SlI 12
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.17 385
and
M4Z21ZI +12 + 22) a 1*221 2X[ 2 tS2122
Since f,3 = 0, it is now clear that
*II = ZIIZI + Ul'Ul] Si mI [ + Ul'Ul]
and
+22 = m + i S22 1
Substituting these values into the product of the four densities shows that the maximum under Ho is proportional to
AO = IS22. 1V-(m+?)/21S1l + U,;U1-(m l)/2
Under the alternative H1, we again maximize the likelihood function by first noting that
/2 U2 - (Ul - #0*12
maximizes the density of U2 given U1. Also,
f2(U21u1 II1 ,g2, I12, 'I22) aj1221/
With this choice of /2, f3 occurs only in the density of U, so = U1
maximizes the density of U1 and
fl(Ull,l *11) a 1*1 11- 1/2
It now follows easily that the maximum likelihood estimators of 12' %1I and I22 are
12= S11 12
- 1 Z 1
'l= m+l Z'Z1 m+15"
1 "p22 =m + i 2-P
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
386 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
Substituting these into the product of the four densities shows that the maximum under H1 is proportional to
Al = IS22.11-Vm+ 1)/21S11I-(m+ 1)/2
Hence the likelihood ratio test will reject Ho for small values of
AO ISI111 (m?+ 1)/2 1 -2 A1 - IS11 + U;U 1(m+l)/2 (1 + U1sllQU;)(m? 1)/2
Thus the likelihood ratio test rejects for large values of
A= U= S,,'U,
and the proof is complete. El
We now want to show that the test derived above is a uniformly most powerful invariant test under a suitable group of affine transformations. Recall that U and Z are independent and
e (u) = N(13, 1), f&(Z) = N(O, Im ? 1).
The problem is to test Ho /3 = 0 where ,B = (,B,, f 2) with Pi E p, 1, 1, 2. Consider the group G with elements g = (r, A, (0, a)) where
rE cm (O a) Ef- p, a E P2
and
A AII A22)
where A1j is p, x pj and Aii is nonsingular for i = 1,2. The action of
g = (r, A,(0, a)) is
g(Z) rZA')
The group operation, defined so G acts on the left of the sample space, is
(rF, Al,(0, a1))(F2, A2,(0, a2)) = (F2, A,A2,(0, a2)A' + (O, a)).
It is routine to verify that the testing problem is invariant under G. Further,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.18 387
it is clear that the induced action of G on the parameter space is
(F, A, (0, a))(/, 2) = (f3A' + (0, a), AEA').
To characterize the invariant tests for the testing problem, a maximal invariant under the action of G on the sample space is needed.
Proposition 9.18. In the notation of Proposition 9.17, a maximal invariant is
A- U1SQ,U1.
Proof. As usual, the proof consists of showing that A = U1S-1Ul is an
orbit index. Since m > p, we deal with those Z's that have rank p, a set of
probability one. The first claim is that for a given UE ep l and Z E m of
rank p, there exists a g E G such that
(Z ) ( Zo )
where ej(1,0,..., 0) E- land
Write Z = tVwhere 1 E J m and V is a p x p upper triangular matrix so
S = Z'Z = V'V. Then consider
where (i E ?p, i = 1, 2, and note that A is of the form
{All O
A=A21 A22)
since (V')-' is lower triangular. The values of (i, i = 1, 2, are specified in a
moment. With this choice of A,
ZA' = 4VV O ) ( ? )
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
388 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
which is in Spm for any choice of E C,i= 1, 2. Hence there is a f E? m
such that
rzAt = Zo.
Since V is upper triangular, write
_l {vll l28
v ( O V22)
with VU being p1 X pj. Then
= ( uV'1v",, u1v'2t'2 + V24).
As S = V'V and V e Gt, it follows that Sjj' = V11(V11)' so the vector U1V" has squared length A = U1llV 'l= U1Sjj' Uj . Thus there exists
{1e P such that
where z = (1,0,..., 0) e Ep Now choose a e eP2 1 to be
aU= AI =- I
SO
UA' + (0,a) =A//
The above choices for A, (, 1r, and a yield g = (F, A, (0, a)), which satisfies
g(U) I (U2)
and this establishes the claim. To show that
A== u2sWVu;
is maximal invariant, first notice that A is invariant. Further, if
(7') and {2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.18 389
both yield the same value of A, then there exists gi E G such that
gi U- ) , i 12. \ZjJ ZO
Therefore,
g_g Ut= (U2)
and A is maximal invariant.
To show that a uniformly most powerful G-invariant test exists, the distribution of A = U1SjUjU1 is needed. However,
E(UI)= N(lI, Y11
P,(SI,)-W(Y I , p I, m.)
and U1 and S11 are independent. From Proposition 8.14, we see that
f(A) = F(p,, m-p1 + 1, 8)
where 8 = jQfij and the null hypothesis is Ho: 8 = 0. Since the non central F distribution has a monotone likelihood ratio, the test that rejects
for large values of A is uniformly most powerful within the class of tests
based on A. Since all G-invariant tests are functions of A, we conclude that
the likelihood ratio test is uniformly most powerful invariant. O
The final problem to be considered in this chapter is a variation of the
problem just solved. Again, the testing problem of interest is Ho: 0I = ? versus H1: #I * 0, but it is assumed that the value of 2 is known to be zero
under both Ho and HI. Thus our model for U and Z is that U and Z are
independent with
(U) = N((31,0), >:)
4(Z) = N(0, Im ? E)
where U E- P, E Ep ,I Z E p,pm' and m > p. In what follows, the
likelihood ratio test of Ho versus H1 is derived and an invariance argument
shows that there is no uniformly most powerful invariant test under a
natural group that leaves the problem invariant. As usual, we partition U
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
390 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
into U1 and U2, Ui E ,1, and Z is partitioned into Z1 e l m and
2 EP2, m S
U = (Ul, U2), z = (Z, Z2).
Also
= = [zlz1 zlz2 \ 11 S12\ S ~ (' =' Z) (SS= zz, z z2 sI s2
and S,H.2 = S,, - S12S:21S21.
Proposition 9.19. The likelihood ratio test of Ho versus H, rejects for large
values of the statistic
A - (u1 - u2S2'S21)S '2(U`- U2S-2-'S2.) 1 + U2S2;'U2
Proof. Under Ho,
C,(U) = N(0,, Y)
E(Z) = N(O,Im 8) )
so the maximum likelihood estimator of Y. is
= + I (Z'Z + 'u)= (S + U'u).
The value of the maximized likelihood function is proportional to
A_ =I2-(m+1)/2
Under H1, the situation is a bit more complicated and it is helpful to
consider conditional distributions. Under H1,
( U1JU2) = N(,31 + U2y2- 2 21, >1122)
(U2)= N(O, -22)
'( Z,1Z2 ) = N( Z2E222 I21, Im ?
E11.2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.19 391
and
e(Z2)= N(O, Im X 722)
The reparameterization defined by 'P1 = * 11-2 "21 = '2221, and 'P22 =
.22 is one-to-one and onto. Let f1(U11U2, /I, "21' I11), f2(U21I22), f3(Z11Z2, "21, "1 1), and f4(Z21"22) be the density functions with respect to Lebesgue measure dU, dU2 dZ1 dZ2 of the four distributions above. It is clear that
I= U1 - U2I21
maximizesf1(U1lU2, 8l , "21 1*1 I) and f1(UlIU2, , "21' "11) a II1/2- With
f82 substituted into fl, the parameter '21 only occurs in the density
f3(Z1IZ2, "21, '1). Since
C(Zl1Z2) =
N(Z2*21, Im X *11),
our results for the MANOVA model show that
'21 = (Z2Z2) Z2Z, = S2%1
maximizesf3(Z IZ2, '21, '1P) for each value of *, I When '21 is substituted into f3, an inspection of the resulting four density functions shows that the
maximum likelihood estimators of 'I, and '22 are
*II = m +
1 S1.2
and
'22= m Zl Z2 + U2U2)= m+l(S22 + U2U2).
Under H1, this yields a maximized likelihood function proportional to
Al =+-1 - (m+ 1)/2I42 V-m+ 1)/2
Therefore the likelihood ratio test rejects Ho for small values of
A - A0 = [ |S22 + U2U2 2IS,1 ]m21 A3 =
-s+U,U
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
392 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
However,
iS22 + uu21 = lS221(1 + U2S2 2u)
and
ISI = IS221IS11.21
Thus
[A3 ] 2/(m+) iSi(1 + U2Si2u') = 1 + U2S2u2;
Is + U'UI I + Us-U *
Now, the identity
US-'U'= (U, - u2S~'s21)sjQ2(u, - u2s2s21)' + u2s22'u2
follows from the problems in Chapter 5. Hence rejecting for small values of
[A3 ]2/(m+)- 1 I
where A is given in the statement of this proposition, is equivalent to
rejecting for large values of A. O
The above testing problem is now analyzed via invariance. The group G
consists of elements g = (1F, A) where F E Om and
A (Al A12 Aii E Glp, i = 1,2.
The group action is
- UA' ( r, A)( )=(U
and group composition is
(rF, A,)(r2, A2) = (r,r2 A1A2)
The action of the group on the parameter space is
(r, A)(/l1, L) = (I8A1AI, AEA').
It is clear that the testing problem is invariant under the group G.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.20 393
Proposition 9.20. Under the action of G on the sample space, a maximal
invariant is the pair (W,, W2) where
= (Ul) -
-U 2S2)
1 + U2s2 U2'
and
W2= U2Su2 '2u
A maximal invariant in the parameter space is
a = 2l2
Proof. As usual, the method of proof is a reduction argument that provides a convenient index for thc orbits in the sample space. Since m > p, a set of
measure zero can be deleted from the sample space so that Z has rank p on
the complement of this set. Let
Z= (P) E
and setu1 =u Ep l andu2= (0,..., 0,1,0,..., O)E= l,, where the one occurs in the ( PI + 1) coordinate of u2. Now, given U and Z, we claim that there exists a g = (r, A) E G such that
( U) /X1u + X2u2 2 2
where
X12 = (Ul -U2 -Su2sI ) St I I. 2(sUl U2S22 )S2 )
and
X22 = us2'U2;.
To establish this claim, write Z = IT where EI' E S m and T E GT is a p x p lower triangular matrix. A modification of the proof of Proposition 5.2 establishes this representation for Z. Consider
A = 6(T-)= ( fl 0
(T
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
394 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
where j E , i= 1, 2, so ( E- p and
ZA' =PTA' =4'pe,m
Thus for any such ( and F e Om, (F, A) e G. Also, 1 can be chosen so that
rZA'
Now,
UA' = (Ul, u2) T-'' = (U, u2)(T22 T22)(0 0 )
= ((U1TII + U2T2I), U2T22e,
where T1i is pi x p1 and
T- 1 = T" O T T2 T22
Since S Z'Z = T'T,
a bit of algebra shows that
(UIT" + U2T21)(U1T" + U2T21'
= (U,- u2j~'s1)s (1- -u S~921)' - 2 Ul ( U,U2 Si2-2S21 ) S,1 2 ( Ul-U2 Si22S1)=X
and
(U2T22 )(U2T22) = u2s2-1u = x2.
Let E = (1,,..., O) e and 2 =(1,,...,0) e 2 Since the vectors
XI1 , and U1T" + U2T2I have the same length, there exists E- e 0p such
that
(U1T1 U2T 22 ) =l.
For similar reasons, there exists a t' E 0P2 such that
U2 T22 V2 =
X2'2
With these choices for t, and t2'
((uTr" + U2T 21)1,U2T22v) = (X1u1 + X2u2).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 9.20 395
Thus there is a g = (F, A) E G such that
g(U) = (XIUI + X2u2)
This establishes the claim. It is now routine to show that (X1, X2) =
(Xi(U, Z), X2(U, Z)) is an invariant function. To show that (X1, X2) is
maximal invariant, suppose (U, Z) and (U, Z) yield the same (Xl, X2) values. Then there exist g and g in G such that
(U XIuI + X2u2) = CJ)
so
This shows that (X,, X2) is maximal invariant. Since the pair (W,, W2) is a
one-to-one function of (X,, X2), it follows that (W1, W2) is maximal in
variant. The proof that 8 is a maximal invariant in the parameter space is
similar and is left to the reader. z
In order to suggest an invariant test for Ho: 81 = 0 based on (W1, W2), the distribution of (W,, W2) is needed. Since
e ((U1, U2)) = N((/3, 0), Y)
and
C (S) = W(E, p, m)
with S and U independent,
e_ (W2) = e(U2Si2 2 Fp2,m -P2+
Therefore, W2 is an ancillary statistic as its distribution does not depend on any parameters under Ho or H1. We now compute the conditional distri
bution of W, given W2. Proposition 8.7 shows that
D(Sti 2) = W(-(11.2, Pi, m - P2)
t( lS211 = N(221 21,2 ?11-2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
396 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
and
IB(S22) = W(122, P2' m)
where 51 I2 is independent of (S21 S22). Thus
( u2s2'- S21IS22, U2) = N(U22'21, (U2s2 U;) 11l2)
and conditional on (S22' U2),
( U1 IS22, U2) = N(1 + U2j221121 11 2)
Further, U1 and U2S2-21S2I are conditionally independent-given (S22, U2)* Therefore,
Ul - U2Si2-2S21 IS22, U2) = N(l, (l + U2Si2 2) 21 2)
so
I , - U2S2S:2%11U =N T.) e ( U 252252} S22, U2 j=N | 211-2j
Since SH-2 is independent of all other variables under consideration, and
since
(u -Ul-2s22 S21)SH(U1 -U2S2- S21)' W1= ~~~1 + W2
it follows from Proposition 8.14 that
P(WuIS22 U2)=F(P,m P+- ; + W2)
where 8 = 2fi1* However, the conditional distribution of WI given
(S22' U2) depends on (S22, U2) only through the function W2 = U2S'2-'U2. Thus
PE (WI JW2) = F(pi,mP+; m +p2 )
and
(W2)= Fp2,m-P2+I'
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 397
Further, the null hypothesis is Ho: 8 = 0 versus the alternative H1: 8 > 0.
Under Ho, it is clear that W, and W2 are independent. The likelihood ratio
test rejects Ho for large values of W, and ignores W2. Of course, the level of
this test is computed from a standard F-table, but the power of the test
involves the marginal distribution of W, when 8 > 0. This marginal distri
bution, obtained by averaging the conditional distribution l (WI W2) with respect to the distribution of W2, is rather complicated.
To show that a uniformly most powerful test of Ho versus H1 does not exist, consider a particular alternative 8 = S0 > 0. Let fI(wIJw2, 8) denote
the conditional density function of WI given W2 and let f2(w2) denote the density of W2. For testing Ho: 8 = 0 versus HI: 8 = 80, the Neyman
Pearson Lemma asserts that the most powerful test of level a is to reject if
f1 (W 1W2, 80) >c(a) fl ( WI I W2, so) f1(w1jw2,0o)
where c(a) is chosen to make the test have level a. However, the rejection region for this test depends on the particular alternative 80 so a uniformly
most powerful test cannot exist. Since W2 is ancillary, we can argue that the test of Ho should be carried out conditional on W2, that is, the level and the
power of tests should be compared only for the conditional distribution of W, given W2. In this case, for w2 fixed, the ratio
fl(wIjw2, )
fl ( WIIW2, 0)
is an increasing function of w, so rejecting for large values of the ratio (w2 fixed) is equivalent to rejecting for W, > k. If k is chosen to make the test
have level a, this argument leads to the level a likelihood ratio test.
PROBLEMS
1. Consider independent random vectors Xij with 1E(Xij) = N(1i, 2) for j = ,...,nandi=l,..., k. For scalars a,,..., ak consider testing Ho: Yaiui- = 0 versus H1: Yajui * 0. With T2 = a?nT', let bi= T a1, set Xi = n jXij and let Si =
j(Xij- Xi)(Xij -
X)5. Write
this problem in the canonical form of Section 9.1 and prove that the test that rejects for large values of A = (EibiXi) S (YibiXi) is UMP invariant. Here S = Y2iSi. What is the distribution of A under Ho?
2. Given Y e E and X Ek n of rank k, the least-squares estimate
B = (X'X)- X'Y of B can be characterized as the B that mini
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
398 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
mizes tr(Y - XB)'(Y - XB) over all k x p matrices.
(i) Show that for any k x p matrix B,
(Y- XB)'(Y- XB) = (Y- XB)'(Y- XB)
+ (X(B - B))'(X(B - B)).
(ii) A real-valued function ( defined for p x p nonnegative definite matrices is nondecreasing if 4(SI) < 4)(S, + S2) for any S, and
S2 (Si > 0, i = 1, 2). Using (i), show that, if (p is nondecreasing,
then (((Y - XB)'(Y - XB)) is minimized by B = B.
(iii) For A that is p x p and nonnegative definite, show that (p(S) =
trAS is nondecreasing. Also, show that +(S) = det(A + S) is nondecreasing.
(iv) Suppose (p(S) = ((FSI") for S > 0 and F E ep so ? (S) can be
written as +(S) = +(X(S)) where A(S) is the vector of ordered characteristic roots of S. Show that, if 4 is nondecreasing in each
argument, then ( is nondecreasing.
3. (The MANOVA model under non-normality.) Let E be a random n x p matrix that satisfies E(FE4,') = e_(E) for all r1E en and 4 E 9p.
Assume that Cov(E) = In ? Ip and consider a linear model for Y E
9p, n generated by Y = ZB + EA' where Z is a fixed n x k matrix of
rank k, B is a k x p matrix of unknown parameters, and A is an
element of Glp. (i) Show that the distribution of Y depends on (B, A) only through
(B, AA'). (ii) Let M = {fLIp. = ZB, B e ep k) and y = I 21? > 0, is p
x p). Show that (M, y} serves as a parameter space for the linear
model (the distribution of E is assumed fixed).
(iii) Consider the problem of testing Ho: RB = 0 where R is r x k of
rank r. Show that the reduction to canonical form given in
Section 9.1 can be used here to give a model of the form
(9.6) ' 52 = (-) + EA'
3-k 0~~~~~~~~~~
where Y1 is r x p, Y2 is (k - r) x p, Y3 is (n - k) x p, B? is
r x p, B2 is (k - r) x p, E is n x p, and A is as in the original
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 399
model. Further, E and E have the same distribution and the null hypothesis is Ho: B, = 0.
(iv) Now, assume the form of the model in (9.6) and drop the tildas. Using the invariance argument given in Section 9.1, the testing problem is invariant and any invariant test is a function of the t
largest eigenvalues of Y1(Y3Y3)- 'Y, where t = min(r, p). Assume n - k > p and partition E as Y is partitioned. Under Ho, show
that
w~ I(Y~Y,y'= -EF (E3E3)1FE. W YI ( Y3'Y3 ) Y= (EE)lE1 l
(v) Using Proposition 7.3 show that W has the same distribution no matter what the distribution of E as long as l (FE) = e(E) for
all r E Qn and E3 has rank p with probability one. This distri
bution of W is the distribution obtained by assuming the ele ments of E are i.i.d. N(O, 1). In particular, any invariant test of Ho has the same distribution under Ho as when E is N(O, In 0 Ip).
4. When Y, is N(B1, Ir 0 2) and Y3 is N(O, Im 0 E) with m > p + 2,
verify the claim that
l 1( Y33 3) m_= _
I + BBi: 1. m-p-lIP
- -
5. Consider a data matrix Y: n x 2 and assume fQ(Y) = N(ZB, In 0 E) where Z is n x 2 of rank two so B is 2 x 2. In some situations, it is
reasonable to assume that a,, = u22-that is, the diagonal elements of
Y. are the same. Under this assumption, use the results of Section 9.2 to derive the likelihood ratio test for Ho: bII = b12, b2 =b22 versus
H1: b * Ib 12 or b2 l * b22 . Is this test UMP invariant?
6. Consider a "two-way layout" situation with observations ij, i= 1,..., m and j = 1,..., r. Assume Yij= , + ai + fj + eij where IL, ai, and P3j are constants that satisfy lai = l: = 0. The eij are random
errors with mean zero (but not necessarily uncorrelated). Let Y be the m x n matrix of Yij's, ul be the vector of ones in R', u2 be the vector
of ones in R', a E R"' be the vector with coordinates ai, and ,B E Rn
be the vector with coordinates f3j. Let E be the matrix of eij's.
(i) Show the model is Y = ,iuul2 + au2 + ul,' + E in the vector
space En, m Here, a E Rm with a'u, = 0 and /3 E R' with /3'u2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
400 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
= 0. Let
Ml = (xIx en, m' x = puu, i eR')
M2 = (xlx = au'2, a e R', a'u = 0)
M3 = (xix = u1,/', /3 E R', f3'u2 = 0).
Also, let ( , ) be the usual inner product on Enm
(ii) Show M, IM2 IM3 IMl in(En, m, ( )).
Now, assume Cov(E) = In ? A where A = yP + 3Q with P
n-u2u, Q = I - P, and y > 0 and 8 > 0 are unknown parameters.
(iii) Show the regression subspace M = Ml D M2 ED M3 is invariant
under each Im ? A. Find the Gauss-Markov estimates for p,, a,
and /3. (iv) Now, assume E is N(0, Im ? A). Use an invariance argument to
show that for testing Ho: a = 0 versus H1: a * 0, the test that
rejects for large values of W = IiPM2Yii2/IIQMyII2 is a UMP invariant test. Here, QM = I- PM. What is the distribution
of W?
7. The regression subspace for the MANOVA model was described as M = (,uL = ZB, B E p, k) C Ep n where Z is n x k of rank k. The
subspace of M associated with the null hypothesis HO: RB = 0 (R is
r x r of rank r) is w = {(Imu = ZB, B e FP k RB = 0). We know that
PM =
PZ ? IP where Pz = Z(Z'Z)- 'Z' (PM is the orthogonal projec tion onto M in (PS,, n, ( * , * ))). This problem gives one form for Ps,. Let
W= Z(Z'Z)'-R'.
(i) Show that W has rank r.
Let Pw = W(W'W)- 'W' so Pw ? Ip is an orthogonal projection.
(ii) Show that 6(Pw ? Ip) c M - w where M - w = M n w.
Also, show dim(6R(Pw ? Ip)) = rp.
(iii) Show that dim w = (k - r)p.
(iv) Now, show that Pw ? Ip is the orthogonal projection onto
M - w so Pz ? Ip - Pw 0 Ip is the orthogonal projection on
to W.
8. Assume X,,..., Xn are i.i.d. from a five-dimensional N(0, 2) where E
is a cyclic covariance matrix (Y. is written out explicitly at the
beginning of Section 9.4). Find the maximum likelihood estimators of a2, Pi, P2
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 401
9. Suppose X1,..., X,, are i.i.d. N(O, T) of dimension 2p and assume I has the complex form
I F) \-F E}
Let S=I:XiX,' and partition S as ' is partitioned. show that I = (2n)-'(S11 + S22) and F== (2n)f(S12 - S21) are the maximum
likelihood estimates of , and F.
10. Let Xl, . . ., Xn be i.i.d. N(,, E) p-dimensional random vectors where A and I are unknown, l > 0. Suppose R is r x p of rank r and consider
testing Ho: R 0 = _ versus H1: R,i * 0. Let X = (l/n) E'Xi and S
I ( Xi - X)( Xi - X)'. Show that the test that rejects for large values
of T = (RX)'(RSR')f (RX) is equivalent to the likelihood ratio test. Also, show this test is UMP invariant under a suitable group of transformations. Apply this to the problem of testing u, = *
= AP where ,lL ,,p are the coordinates of ,.
11. Consider a linear model of the form Y= ZB + E with Z: n X k of
rank k, B: k x p unknown, and E a matrix of errors. Assume the first
column of Z is the vector e of ones (the regression equation has the constant term in it). Assume Cov(E) = A (p) ? Y where A (p) has ones
on the diagonal and p off the diagonal (- l/(n - 1) < p < 1).
(i) Show that the GM and least-squares estimates of B are the same. (ii) When 15(E) = N(O, A (p) ? E) with 2 and p unknown, argue via
invariance to construct tests for hypotheses of the form RB = 0 where R is r x k - 1 of rank r and B: (k - 1) X p consists of
the last k - 1 rows of B.
NOTES AND REFERENCES
1. The material in Section 9.1 is fairly standard and can be found in many texts on multivariate analysis although the treatment and emphasis may be different than here. The likelihood ratio test in the MANO VA setting was originally proposed by Wilks (1932). Various competitors to the
likelihood ratio test were proposed in Lawley (1938), Hotelling (1947), Roy (1953), and Pillai (1955).
2. Arnold (1973) applied the theory of products of problems (which he had
developed in his Ph.D. dissertation at Stanford) to situations involving patterned covariance matrices. This notion appears in both this chapter and Chapter 10.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
402 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL
3. Given the covariance structure assumed in Section 9.2, the regression
subspaces considered there are not the most general for which the Gauss-Markov and least-squares estimators are the same. See Eaton
(1970) for a discussion.
4. Andersson (1975) provides a complete description of all symmetry models.
5. Cyclic covariance models were first studied systematically in Olkin and
Press (1969).
6. For early papers on the complex normal distribution, see Goodman
(1963) and Giri (1965a). Also, see Andersson (1975).
7. Some of the material in Section 9.6 comes from Giri (1964, 1965b).
8. In Proposition 9.5, when r = 1, the statistic \x is commonly known as
Hotelling's T2 (see Hotelling (1931)).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 10
Canonical Correlation
Coefficients
This final chapter is concerned with the interpretation of canonical correla
tion coefficients and their relationship to affine dependence and indepen dence between two random vectors. After using an invariance argument to
show that population canonical correlations are a natural measure of affine
dependence, these population coefficients are interpreted as cosines of the angles between subspaces (as defined in Chapter 1). Next, the sample
canonical correlations are defined and interpreted as cosines of angles. The
distribution theory associated with the sample coefficients is discussed briefly.
When two random vectors have a joint normal distribution, indepen dence between the vectors is equivalent to the population canonical correla tions all being zero. The problem of testing for independence is treated in the fourth section of this chapter. The relationship between the MANOVA testing problem and testing for independence is discussed in the fifth and final section of the chapter.
10.1. POPULATION CANONICAL CORRELATION COEFFICIENTS
There are a variety of ways to introduce canonical correlation coefficients and three of these are considered in this section. We begin our discussion with the notion of affine dependence between two random vectors. Let X E (V, (, )1) and Y E (W, (, *)2) be two random vectors defined on the same probability space so the random vector Z = (X, Y) takes values in the vector space V @ W. It is assumed that Cov(X) =
II , and Cov(Y) = -22
both exist and are nonsingular. Therefore, Cov(Z) exists (see Proposition
403
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
404 CANONICAL CORRELATION COEFFICIENTS
2.15) and is given by
= Cov(Z) E '\t 22 )
Also, the mean vector of Z is
ii = &Z= (&X, Y) = (M'I I2).
Definition 10.1. Two random vectors U and U, in (V, (-, )) are affinely equivalent if U = A U + a for some nonsingular linear transformation A and
some vector a E V.
It is clear that affine equivalence is an equivalence relation among random vectors defined on the same probability space and taking values in V.
We now consider measures of affine dependence between X and Y, which are functions of ,u = ($X, & Y} and L = Cov(Z) where Z = {X, Y). Let
m(,u, :) be some real-valued function of M and : that is supposed to
measure affine dependence. If instead of X we observe X, which is affinely equivalent to X, then the affine dependence between X and Y should be the
same as the affine dependence between X and Y. Similarly, if Y is affinely
equivalent to Y, then the affine dependence between X and Y should be the
same as the affine dependence between X and Y. These remarks imply that m(,, :) should be invariant under affine transformations of both X and Y.
If (A, a) is an affine transformation on V, then (A, a)v = Av + a where A
is nonsingular on V to V. Recall that the group of all affine transformations
on V to V is denoted by Al(V) and the group operation is given by
(A,, a,)(A2, a2) = (AIA2, A1a2 + al).
Also, let Al(W) be the affine group for W. The product group Al(V) x
Al(W) acts on the vector space V @ W in the obvious way:
((A, a), (B, b))(v, w) = (Av + b, Bw + b).
The argument given above suggests that the affine dependence between X and Y should be the same as the affine dependence between (A, a)X and
(B, b)Y for all (A, a) e Al(V) and (B, b) e Al(W). We now need to
interpret this requirement as a condition on m(,u, :). The random vector
((A, a), (B, b)){X, Y) = (AX + a, BY + b)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.1 405
has a mean vector given by
((A, a), (B, b)){tt1, t2) = (AA1 + a, A.2 + b)
and a covariance given by
AY.11Af Al ,B'
BY.2A' BE22B J
Therefore, the group Al(V) X Al(W) acts on the set
= ((A, I)J E FVED K ,2 >,0, :ii> , ,i= 1, 2}.
For g - ((A, a), (B, b)) E Al(V) X Al(W), the group action is given by
(A9, Y) -+ (gjA, g(2))
where
g= (At,1 + a, B,u2 + b)
and
g(l AIIIA' AI:,2Bf
a~~~1 22 g() B2,2A' B222B ).
Requiring the affine dependence between X and Y to be equal to the affine dependence between (A, a)X and (B, b)Y simply means that the function m defined on E0 must be invariant under the group action given above. Therefore, m must be a function of a maximal invariant function under the action of Al(V) X Al(W) on E). The following proposition gives one form of a maximal invariant.
Proposition 10.1. Let q = dim V, r = dim W, and let t = min(q, r). Given
III{21 212
E' 2X 1222
which is positive definite on V ED W, let XI >** > XI > ? be the t largest
eigenvalues of
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
406 CANONICAL CORRELATION COEFFICIENTS
where 22I -12 Define a function h on e by
h(IL, E-) = (XI, A2X- ..
At)9
where A1 > * A are defined in terms of 2 as above. Then h is a
maximal invariant function under the action of G Al(V) x Al(W) on 0.
Proof. Let {v,,. . ., v,) and (w,,. . ., w,) be fixed orthonormal sets in V and
W. For each E, define Q12(2) by
Q12(2) 12iw = I
where A, >* >X are the t largest eigenvalues of A(Y.). Given (tL, L) E
E, we first claim that there exists a g E G such that g,u = 0 and
IV Q12(() 7w )
The proof of this claim follows. For g = ((A, a), (B, b)), we have
(AYa A' Az B'12B g() kBE21A' B 222BJ
Choose A = FXjj'l2 and B = Al - 1/2 where r E 0(V), A E 0(W), and
. 1/2 iS the inverse of the positive definite square root of Eii, i = 1, 2. For
each IF and A,
AY1A' -= ]f7I/22 -11'/2r =I
BE 2B' = Al-3/222 2y-1/2A' I
and
AY212B' = 1 1222
Using the singular value decomposition, write
A _ E- 1/2E 2 - 1/2 - E X1(2XiEy, i= 1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.1 407
where (xl,..., x,) and (y y.. ,Y) are orthonormal sets in V and W,
respectively. This representation follows by noting that the rank of A12 is at most t and
A12Af2 1 1/21222 2111/
has the same eigenvalues as A(2:), which are A X> * * At > 0. For A and
B as above, it now follows that
AE12B' = Ig2(rXi)0](Ayi). i= ,
Choose F so that Fxi = vi and choose A so that Ayi = wi. Then we have
A21,2B' =
Q12 (7 )
so g(:) has the form claimed. With these choices for A and B, now choose a = -AL,u and b = -B'2. Then
9A = g(Il1 1'2) = ((A, a), (B, b))(ul, t2)
= (A/L1 + a, Bil2 + b) = {0,0) = 0.
The proof of the claim is now complete. To finish the proof of Proposition 10.1, first note that Proposition 1.39 implies that h is a G-invariant function. For the maximality of h, suppose that h(ji, E) = h(v, '). Thus
Q12(2) = Q12(*),
which implies that there exists a g and g such that
gu = 0, gv = 0,
and
g(e ) ( lV Q12= (( V =)
Therefore,
g 'g(h, a) = (n, E)
so h is maximal invariant. ol
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
408 CANONICAL CORRELATION COEFFICIENTS
The form of the singular value decomposition used in the proof of Proposition 10.1 is slightly different than that given in Theorem 1.3. For a linear transformation C of rank k defined on (V, (.,)l) to (W, ( ')2)
Theorem 1.3 asserts that
k C = :tL iwioXi
where jaiL> 0, (xI,..., Xk}, and (wl,..., Wk) are orthonormal sets in V and W. With q = dim V, r = dim W, and t = min{q, r), obviously k < t. When
k < t, it is clear that the orthonormal sets above can be extended to
{xI,..., x,) and {w1,..., wt}, which are still orthonormal sets in V and W. Also, setting 1i = 0 for i = k + 1,..., t, we have
C- YAiwi0?i C = Ett...ox ,
and #2 > * * * L 2 are the t largest eigenvalues of both CC' and C'C. This form of the singular value decomposition is somewhat more convenient in this chapter since the rank of C is not explicitly mentioned. However, the
rank of C is just the number of ,yi, which are strictly positive. The
corresponding modification of Proposition 1.48 should now be clear. Returning to our original problem of describing measures of affine
dependence, say m(,, E), Proposition 10.1 demonstrates that m is invariant under affine relabelings of X and Y iff m is a function of the t largest
eigenvalues, A,,..., XA, of A(lZ). Since the rank of A(T.) is at most t, the
remaining eigenvalues of A(s), if there are any, must be zero. Before
suggesting some particular measures m(,u, E), the canonical correlation
coefficients are discussed.
Definition 10.2. In the notation of Proposition 10.1, let P1 i
1,..., t. The numbers p > P2 > ... > p, > 0 are called the population
canonical correlation coefficients.
Since p, is a one-to-one function of Xi, it follows that the vector
(pi,... Pt,) also determines a maximal invariant function under the action
of G on 9. In particular, any measure of affine dependence should be a
function of the canonical correlation coefficients. The canonical correlation coefficients have a natural interpretation as
cosines of angles between subspaces in a vector space. Recall that Z = (X, Y)
takes values in the vector space V ED W where (V, (*, -),) and (W, (.*, )2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.2 409
are inner product spaces. The covariance of Z, with respect to the natural
inner product, say (, ), on V @ W, is
I (II 212
l21 222
In the discussion that follows, it is assumed that I is positive definite. Let
* ) denote the inner product on V @ W defined by
(z1, Z2) = (Z, z2) = COV[(Z1, Z), (Z2, Z)]
for z1, z2 E V @ W. The vector space V can be thought of as a subspace of
V D W- namely, just identify V with V @ (0) c V ED W. Similarly, W is a
subspace of V ED W. The next result interprets the canonical correlations as
the cosines of angles between the subspaces V and W when the inner
product on V @ W is (., *)1.
Proposition 10.2. Given 2, the canonical correlation coefficients p, > * > p, are the cosines of the angles between V and W as subspaces in the inner product space (V @ W, (., .)M).
Proof. Let P1 and P2 be the orthogonal projections (relative to (-, )=) onto V ED (0) and W @ (0), respectively. In view of Proposition 1.48 and Defini
tion 1.28, it suffices to show that the t largest eigenvalues of PIP2PI are A,=p2, i = 1,..., t. We claim that
0 0
is the orthogonal projection onto V @ (0). For (v, w) E V @ W,
(0V 2112 )(V, w) = (V + Efll12w,0)
so the range of C1 is V ED (0) and C1 is the identity on V @ (0). That
C2 = Cl is easily verified. Also, since
e ih e 212 i 1 21 122
9
the identity C'l = E:C, holds. Here C' is the adjoint of Cl relative to the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
410 CANONICAL CORRELATION COEFFICIENTS
inner product (-, .)-namely,
I 2Iv ? C=t21 111
This shows that C, is self-adjoint relative to the inner product (*, ) . Hence Cl is the orthogonal projection onto V @ (0) in (V ED W, (, *) ). A similar argument yields
O O
C2= 2-212 Iw 22( ~21 w
as the orthogonal projection onto (0) D W in (V E W, (, )>). Therefore Pi = Ci, i = 1, 2, and a bit of algebra shows that
PAPP<A () C)
where A(Y.)
= 2 I -2X2212I and
C= A(2)2ll zl
Thus the characteristic polynomial of P1P2P1 is given by
p(a) = det[P1P2P - aI] =(-a)rdet[A(E) - aIv]
where r = dim W. Since t = min{q, r) where q = dim V, it follows that the t
largest eigenvalues of P1P2P, are the t largest eigenvalues of A(z). These
are o p2, SO the proof is complete. O
Another interpretation of the canonical correlation coefficients can be given using Proposition 1.49 and the discussion following Definition 1.28. Using the notation adopted in the proof of Proposition 10.2, write
P2P1 = LPiti[i 1=1
where {n,..., i} is an orthonormal set in V @ (0) and (b,..., () is an orthonormal set in (0) @ W. Here orthonormal refers to the inner product
( ,* ). on V ED W, as does the symbol O in the expression for P2P1 -that is, forz ,z2e VEW,
(Z10Z2)z = (Z2, Z)2Zi
= (Z2, YZ)Zl.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.2 411
The existence of this representation for P2PI follows from Proposition 1.48, as does the relationship
('Qi, jz ijPj
for i, j = 1,..., t. Define the sets D,i and D2i, i = 1,..., t, as in Proposition 1.49 (with M,
= V @ (0) and M2 = (O) @ W), so
sup sup (71, = (71i, = Pi ijEDI, reD2,
for i - 1,..., t. To interpret pi, first consider the case i = 1. A vector q is in
D1I iff
={v,90), v e V
and
1I= (m, 7q) = (v, z11v), = var(v, X),.
Similarly, ( D21 iff
(0, w), wEE W
and
1 - (1, S) = (w, 122w)2 = var(w, Y)2.
However, for q = {v,0) E D1, and = (0, w) e D21,
(TJ, OX = (v, z12w)1 cov((v, X)1, (w, Y)2}.
This is just the ordinary correlation between (v, X), and (w, Y)2 as v and w have been normalized so that 1 = var(v, X), = var(w, Y)2. Since (q, ()y < pI forall qe DI1 and -D21, itfollows that for every x e V, x * 0, and y e W, y * 0, the correlation between (x, X)1 and (y, Y)2 is no greater than p1. Further, writing q I = (v1, 0) and 4 = (0, w1), we have
Pi = (71, ,1)2 = (mqlg 741)
= (V1, 12w1)1 = COVO(v1, X)1, (W1I Y)2),
which is the correlation between (v,, X)1 and (w,, Y)2. Therefore, p1 is the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
412 CANONICAL CORRELATION COEFFICIENTS
maximum correlation between (x, X)} and (y, Y)2 for all nonzero x e V and y X W. Further, this maximum correlation is achieved by choosing x = v, andy = w,.
The second largest canonical correlation coefficient, P2, satisfies the equality
sup sup (m ) = (n2' 42)l = P2. 'qeD12 teD22
A vector q is in D12 iff
X = {v,O), v e V
I = (n, 07. (v, 21 lV)
and
o = (m q1)E = (V, 111V)1.
Also, a vector ( is in D22 iff
(={O,w}, we W
1 = (e, ( = (w, 222W)2
and
0= (0, = (W, 122W1)2.
These relationships provide the following interpretation of P2. The maxi
mum correlation between (x, X)1 and (y, Y)2 is p1 and is
PI = cov{(v1, X)I, (w,, Y)2)
since I = var(v , X) 1 = var(w, ,Y)2. Suppose we now want to find the
maximum correlation between (x, X), and (y, Y)2 subject to the condition
J cov((x, X)1, (vI, X),} = 0
( ) lcov((Y, Y)2, (WI, Y)2) = 0
Clearly (i) is equivalent to
(ii) (X, ZVI)I = 0
( (Y, 22W)2= 0.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.2 413
Since correlation is invariant under multiplication of the random variables by positive constants, to find the maximum correlation between (x, X)I and (Y, Y)2 subject to (ii), it suffices to maximize cov((x, X)1, (y, Y)2) over those x's and y's that satisfy
{ (x, I11X)I = 1, (X, I11V1)A = 0
(y' y
22Y)2A 1, (y, :22wI )2 = 0.
However, x E V satisfies (iii) iff q = (x, 0) is in D12 and y e W satisfies (iii) iff ( = (0, y) is in D22. Further, for such x, y, q, and (,
cov((x, X)1, (y, Y)2) = ('1, o)
Thus maximizing this covariance subject to (iii) is the same as maximizing ( for X e D12 and 4 e D22. Of course, this maximum is P2 and is
achieved at 712 E D12 and i2 E D22. Writing ?2 = (V2, 0) and 42 = (0, W2), it is clear that v2 E V and w2 E W satisfy (iii) and
cov((v2, X)I, (w2, Y)2) = P2.
Furthermore, Proposition 1.48 shows that
0 (?1' qI2)X = (Q29 0D2'
which implies that
0 = cov((vl, X)1, (W2, Y)2} = CoV{(V2, X)1, (W1, Y)2}.
Therefore, the problem of maximizing the correlation between (x, X), and
(Y, Y)2 (subject to the condition that the correlation between (x, X), and
(vI, X)} be zero and the correlation between (y, Y)2 and (w1, Y)2 be zero) has been solved.
It should now be fairly clear how to interpret the remaining canonical correlation coefficients. The easiest way to describe the coefficients is by induction. The coefficient p, is the largest possible correlation between (x, X), and (y, Y)2 for nonzero vectors x E V and y E W. Further, there exist vectors v, E V and w, E W such that
cov((vI, X)1, (w,, Y)2} = pi
and
1 = var(v1, X)} = var(w1, Y)2.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
414 CANONICAL CORRELATION COEFFICIENTS
These vectors came from q and 4 in the representation
P2Pl= EPitioi i-lI
given earlier. Since -1i E VE (0), we can write qi = (vi,O), i = 1,..., t.
Similarly, {i = (0, wi), i = 1,..., t. Using Proposition 1.48, it is easy to check that
COv((Vj, X),, (Wk, Y)2) Pj6jk
cov((Vj, X)I, (Vk, X)1) =jk
COV((Wj, Y)2' (Wk, Y)2) = ajk
forj, k = 1,..., t. Of course, these relationships are simply a restatement of
the properties of b,., t and q,..., iq. For example,
cov((v1, X)1, (Wk, Y)2) = (Vj, 112Wk), = (j' 0k)= Pj=jk'
However, as argued in the case of P2, we can say more. Given P I..., pt and
the vectors v, . . ., vi- ,1 and w, ..., wi1 obtained from l,..., ni- and
b1,. . . I (i- 1, consider the problem of maximizing the correlation between
(x, X), and (y, Y)2 subject to the conditions that
{ cov((x, X)1, (v1, X)1) = 0, j = 1,..., i - 1
cov((y, Y)2, (wj, Y)2) = O, j = 1,..., i - 1.
By simply unravelling the notation and using Proposition 1.49, this maxi
mum correlation is pi and is achieved for x = v, and y = wi. This successive
maximization of correlation is often a useful interpretation of the canonical
correlation coefficients. The vectors v1,. . ., vt and w1,..., w, lead to what are called the canonical
variates. Recall that q = dim V, r = dim W and t = min{q, r). For definite
ness, assume that q < r so t = q. Thus {v1,..., vq) is a basis for V and
satisfies
(Vji I1VJk)I =
jk
for j, k = 1,..., q so (vI,. .. vq) is an orthonormal basis for V relative to
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.3 415
the inner product determined by I 1. Further, the linearly independent set (Wp,..., wq} satisfies
( Wj 22WJ)2 = jk
so (Wi,..., wq) is an orthonormal set relative to the inner product de
termined by 222' Now, extend this set to (w1,..., wr) so that this is an
orthonormal basis for W in the 22 inner product.
Definition 10.3. The real-valued random variables defined by
Xi = ( Vi, X),, i 1 1,.. q
and
yi= (w, Y)2, i 1,.. r
are called the canonical variates of X and Y, respectively.
Proposition 10.3. The canonical variates satisfy the relationships
(i) var Xj = var Yk = 1.
(ii) COV{Xj, Yk = Pj8jk'
These relationships hold for j = 1,..., q and k = 1,..., r. Here, p1,., Pq are the canonical correlation coefficients.
Proof. This is just a restatement of part of what we have established above.
Let us briefly review what has been established thus far about the population canonical correlation coefficients p,,..., pt. These coefficients were defined in terms of a maximal invariant under a group action and this group action arose quite naturally in an attempt to define measures of affine dependence. Using Proposition 1.48 and Definition 1.28, it was then shown that p,,..., p, are cosines of angles between subspaces with respect to an inner product defined by E. The statistical interpretation of the coefficients came from the detailed information given in Proposition 1.49 and this interpretation closely resembled the discussion following Definition 1.28.
Given X in (V, (., ) ) and Y in (W, (_ , )2) with a nonsingular covariance
= (Ell 12
\ 21 222
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
416 CANONICAL CORRELATION COEFFICIENTS
the existence of special bases {v1,..., vq) and (wp, w,r for V and W was
established. In terms of the canonical variates
Xi = (vi, X)1, Y1 = (Wj, Y)2,
the properties of these bases can be written
1 = var Xi = var Yj
and
cov{Xi, Yj) = Pi8ii
for i = 1,..., q and j = 1,..., r. Here, the convention that pi = 0 for
i > t = min(q, r) has been used although pi is not defined for i > t. When
q < r, the covariance matrix of the variates XI,..., Xq, Y1,..., Yr (in that
order) is
4(Iq (DO)
? ,(DO)' IrV
where D is a q X q diagonal matrix with diagonal entries p l * > Pq and
O is a q x (r - q) block of zeroes. The reader should compare this matrix
representation of 2 to the assertion of Proposition 5.7.
The final point of this section is to relate a prediction problem to that of
suggesting a particular measure of affine dependence. Using the ideas
developed in Chapter 4, a slight generalization of Proposition 2.22 is
presented below. Again, consider X E (V, (*, )1) and Y E (W, (*, *)2) with
Il= ;Yl =
JM2 and
Cov(X Y) (21 222)
It is assumed that I11 and 222 are both nonsingular. Consider the problem
of predicting X by an affine function of Y-say CY + vo where C E
I (W, V) and vo E V. Let [ , ]1 be any inner product on V and let II - be
the norm defined by [, *. The following result shows how to choose C and vo to minimize
61IX - (CY + vo)112.
Of course, the inner product [ *, * I on V is related to the inner product (*, )
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.4 417
by
[v1, V2] = (vl, Aov2 )1
for some positive definite A0.
Proposition 10.4. For any C E C4(W, V) and vo e V, the inequality
Ix -(CY + vO)112 >(AO , 1 - -12122221)
holds. There is equality in this inequality iff
VO Vo Al1 1 2 22 I 2
and
C= C -1222
Here, <*,*) is the natural inner product on f (V, V) inherited from (V, (.*, *)).
Proof. First, write
X- (CY+ VO) = Ul + U2
where
U1 = x - + Vo) = x - 1 - 212222(y - A2)
and
U2== (C-C)Y + A- v0.
Clearly, U1 has mean zero. It follows from Proposition 2.17 that U1 and U2 are uncorrelated and
Cov( U1) = -1 1122 221.
Further, from Proposition 4.3 we have S[U,, U2] = 0. Therefore,
&IIllX - (CY + vo)112 = tgllU + U2112 = &11U1112 + ;IIU2112
= &(U1, AOUI) + &11U2112 = S(AO, UIOU1) + &jjU2112
=(A0,o ll-2 122 22 221) + &1jU2jj2,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
418 CANONICAL CORRELATION COEFFICIENTS
where the last equality follows from the identity
&;U1EJU1 = Y 11 - 112
22 121
established in Proposition 2.21. Thus the desired inequality holds and there is equality iff &1JU2112 = 0. But &;IIU2112 is zero iff U2 is zero with probability one. This holds iff v = vo and C = C since Cov(Y) = 122 is positive definite. This completes the proof. o
Now, choose AO to be 2:l' in Proposition 10.4. Then the mean squared error due to predicting X by CY + vo, measured relative to Y., is
(Ell Ell - 212Y22'221) = & - (CY + vo)!12.
Here, is obtained from the inner product defined by
[V1, V21 = (v1, jIl'v2).
We now claim that 0 is invariant under the group of transformations
discussed in Proposition 10.1, and thus 4 is a possible measure of affine
dependence between X and Y. To see this, first recall that ( , * ) is just the
trace inner product for linear transformations. Using properties of the trace, we have
+(z) = <I, I - 1/22 221 Y' 1l / >
= tr(.I - 1 12222
q
= E (1 - Ai)
whereA1 >XI q > 0 are the eigenvalues of I l 2 221 E
However, at most t = min(q, r) of these eigenvalues are nonzero and, by
definition, pi = X'(2, i - l,..., t, are the canonical correlation coefficients. Thus
is a funcn of pp2) + (q - t)
is a function of p,,..., p, and hence is an invariant measure of affine
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
SAMPLE CANONICAL CORRELATIONS 419
dependence. Since the constant q - t is irrelevant, it is customary to use
+1(2 = E(1 -p2)
rather than 4(s) as a measure of affine dependence.
10.2. SAMPLE CANONICAL CORRELATIONS
To introduce the sample canonical correlation coefficients, again consider inner product spaces (V, (, -),) and (W, (., *)2) and let (V E W, (., *)) be
the direct sum space with the natural inner product (., *). The observations
consist of n random vectors Zi = {Xi, Yj) E VE W, i = 1,..., n. It is
assumed that these random vectors are uncorrelated with each other and Fe(Z1) =
E(Zj) for all i, j. Although these assumptions are not essential in
much of what follows, it is difficult to interpret canonical correlations without these assumptions. Given Z ,--., Zn, define the random vector Z by specifying that Z takes on the values Zi with probability 1/n. Obviously, the distribution of Z is discrete in V @ W and places mass 1/n at Zi for
i = 1,..., n. Unless otherwise specified, when we speak of the distribution of Z, we mean the conditional distribution of Z given Z.,..., Zn as
described above. Since the distribution of Z is nothing but the sample probability measure of Z,... , Zn, we can think of Z as a sample approxi
mation to a random vector whose distribution is E(Z1). Now, write Z =
(X, Y) with X E V and Y E W so X is Xi with probability I/n and Y is Y with probability l/n. Given Z1,..., Zn, the mean vector of Z is
n1
and the covariance of Z is
n
CovZ = S --E (Z, - Z)O(Z, - Z). i=lI
This last assertion follows from Proposition 2.21 by noting that
CovZ=&(Z- Z)o(Z- Z)
since the mean of Z is Z. When V = Rq and W = RK are the standard
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
420 CANONICAL CORRELATION COEFFICIENTS
coordinate spaces with the usual inner products, then S is just the sample covariance matrix. Since S is a linear transformation on V 0 W to V @D W, S can be written as
=(Sl1 S12
S21 S22
It is routine to show that
I n
Sti - n E ( Xi - X )EI(Xi -X) n
.1 1=1l
in
Sl x (X -,)O( y ) *=1
in
S22 E (Yi - Y)n(Yi -Y)
and S21 = S,2. The reader should note that the symbol O appearing in the
expressions for S, 1, SI2, and S22 has a different meaning in each of the three
expressions-namely, the outer product depends on the inner products on
the spaces in question. Since it is clear which vectors are in which spaces, this multiple use of 0 should cause no confusion.
Now, to define the sample canonical correlation coefficients, the results
of Section 10.1 are applied to the random vector Z. For this reason, we
assume that S = Cov Z is nonsingular. With q = dim V, r = dim W, and
t = min{q, r), the canonical correlation coefficients are the square roots of
the t largest eigenvalues of
A(S) = SQ s 12 2 S:21
In the sampling situation under discussion, these roots are denoted by ... ** r, > 0 and are called the sample canonical correlation coefficients.
The justification for such nomenclature is that r?,..., r2 are the t largest
eigenvalues of A(S) where S is the sample covariance based on Z,,..., Zn. Of course, all of the discussion of the previous section applies directly to the
situation at hand. In particular, the vector (rr,..., rt) is a maximal invariant
under the group action described in Proposition 10.1. Also, r1,. . ., r, are the
cosines of the angles between the subspaces V ED (0) and (0) @ W in the
vector space V E W relative to the inner product determined by S.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
SAMPLE CANONICAL CORRELATIONS 421
Now, let {v1,..., vq) and (w,,..., wr) be the canonical bases for V and
W. Then we have
cov((vi, X), (Wj' Y)2) -riij
for i = ,..., q and j = 1,..., r. The convention that ri O for i > t is
being used. To interpret what this means in terms of the sample ZI,..., Zn,
consider r1. For nonzero x E V and y e W, the maximum correlation
between (x, X)1 and (y, Y)2 is r, and is achieved for x = v, and y = w1.
However, given Z1,..., Z,n we have
var(x, X)1 = var((x, O}, Z) = ((x,O), S{x,O}))
in X = (x, S1lx)} = n E (x, Xi-
2
n I
and, similarly, n
var(y, Y)2 =- (YY- n )2
An analogous calculation shows that
I n
cov((x, X)0, (Y, Y)2} = n E (x, Xi - X )1(Y, Yi - Y )2. i=l
Thus var(x, X)1 is just the sample variance of the random variables (x, X1),, i = 1,..., n, and var(y, Y)2 is the sample variance of (y, Y)2,
i = 1,..., n. Also, cov((x, X)1, (y, Y)2} is the sample covariance of the
random variables (x, Xi),, (y, Yi)2, i = 1,..., n. Therefore, the correlation between (x, X)1 and (y, Y)2 is the ordinary sample correlation coefficient for the random variables (x, Xi),, (y, Yi)2, i = 1,..., n. This observation
implies that the maximum possible sample correlation coefficient for (x, X), , (y, Yi)2, i = 1, . . ., n is the largest sample canonical correlation
coefficient, r1, and this maximum is attained by choosing x = vI and
y = wI. The interpretation of r2,..., r, should now be fairly obvious. Given i, 2 < i < t, and given r1,..., tri_, vI,..., vi,, and w1,..., wi,-, consider
the problem of maximizing the correlation between (x, X)1 and (y, Y)2 subject to the conditions
cov((x, X)1, (v1, X),) = O, j= 1,..., i- 1
cov((y, Y)2, (Wn, Y)2) = 0, j = 1,..., i
- 1.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
422 CANONICAL CORRELATION COEFFICIENTS
These conditions are easily shown to be equivalent to the conditions that the
sample correlation for
(x, Xk)I, (Vj, Xk)J, k = 1,..., n
be zero for j = 1,..., i - 1 with a similar statement concerning the Y's.
Further, the correlation between (x, X), and (y, Y)2 is the sample correla tion for (x, Xk)I, (y' Yk)2, k = 1,..., n. The maximum sample correlation is ri and is attained by choosing x = vi and y = wi. Thus the sample
interpretation of r,,..., r, is completely analogous to the population inter pretation of the population canonical correlation coefficients.
For the remainder of this section, it is assumed that V = Rq and W = R'
are the standard coordinate spaces with the usual inner products, so V E W
is just RKP where p = q + r. Thus our sample is Z1,.., Zn with Zi E RP and
we write
Zi Xi
cl R)
with Xi E Rq and Yi e R , i = 1,..., n. The sample covariance matrix, assumed to be nonsingular, is
n E7Zi \i S21 S22J
where
n
I S22 = n E (Yi - Y )(Yi Y )
n
S12 = (Xi -X) (Yi-)
and S2l = S,2 Now, form the random matrix Z: n x p whose rows are
(Zi - Z)' and partition Z into U: n x q and V: n X r so that
2= (UV).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.5 423
The rows of U are (Xi - X)' and the rows of V are (Yi - Y)', i = 1,.. ., n.
Obviously, we have nS = Z'Z, nS11 = U'U, nS22 = V'V, and nS12 = U'V.
The sample canonical correlation coefficients r, >, . r, are the square roots of the t largest eigenvalues of
A(S) = SjilS12S~221 = (U'u) 'U'V(V'V) V'U.
However, the t largest eigenvalues of A(S) are the same as the t largest eigenvalues of PxPy where
PX- U(U'U) U'
and
Py V(P"V) Ivt.
Now, Px is the orthogonal projection onto the q-dimensional subspace of Rn, say Mx, spanned by the columns of U. Also, Py is the orthogonal
projection onto the r-dimensional subspace of R', say My. spanned by the columns of V. It follows from Proposition 1.48 and Definition 1.28 that the sample canonical correlation coefficients r1,..., r, are the cosines of the angles between the two subspaces Mx and My contained in Rn. Summariz ing, we have the following proposition.
Proposition 10.5. Given random vectors
Zi= Xi
e RP i=l 1. ,n yi/
where Xi e RK and Y1 E Rr, form the matrices U: n X q and V: n X r as
above. Let Mx c Rn be the subspace spanned by the columns of U and let
My C Rn be the subspace spanned by the columns of V. Assume that the
sample covariance matrix
nI
is nonsingular. Then the sample canonical correlation coefficients are the cosines of the angles between Mx and My.
The sample coefficients r1,..., r, have been shown to be the cosines of angles between subspaces in two different vector spaces. In the first case,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
424 CANONICAL CORRELATION COEFFICIENTS
the interpretation followed from the material developed in Section 10.1 of this chapter: namely, r1,..., r, are the cosines of the angles between Rq S (0) c RP and (0) D R@ c RP when RP has the inner product determined by
the sample covariance matrix. In the second case, described in Proposition 10.5, r,,.. ., r, are the cosines of the angles between Mx and My in Rn when
Rn has the standard inner product. The subspace Mx is spanned by the columns of U where U has rows (Xi - X)', i = 1,..., n. Thus the coordi
nates of the jth column of U are j - Xj for i = 1,..., n where Xij is the
jth coordinate of Xi E Rq, and Xj is the jth coordinate of X. This is the
reason for the subscript X on the subspace Mx. Of course, similar remarks apply to My.
The vector (r1,..., r,) can also be interpreted as a maximal invariant under a group action on the sample matrix. Given
Xi Zi = YE RP, 1.. n,
let X: n X q have rows X,, i = ,..., n and let Y: n x r have rows Y)',
i 1,..., n. Then the data matrix of the whole sample is
2 = (kXY) : n x p,
which has rows Z,', i = 1,..., n. Let e e R' be the vector of all ones. It is
assumed that Z E -
c ep n where S is the set of all n X p matrices such that the sample covariance mapping
s(Z) = (Zt- et')(Z -eZ')
has rank p. Assuming that n > p + 1, the complement of S in has Lebesgue measure zero. To describe the group action on Z, let G be the set
of elements g = (F, c, C) where
IF e"O(e) = {FIF E n, Fe =
e), c e RP
and
( A
B) A E Gl, B E Gl.
For g = (F, c, C), the value of g at Z is
gz = 1ZC' + ec'.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.6 425
Since
s(g2) = Cs(Z)C'.
it follows that each g E G is a one-to-one onto mapping of , to S. The
composition in G, defined so G acts on the left of F, is
(rlg C15 CM)(2 5 C2 5 C2)'=- (rl ]2 5Cl. + CIC2 5 CIC2).
Proposition 10.6. Under the action of G on F, a maximal invariant is the
vector of canonical correlation coefficients rl,..., r, where t = min{q, r).
Proof Let 5 + be the space of p x p positive definite matrices so the
sample covariance mapping s: , S+ is onto. Given S E S +, partition S
as
S=-(Si S12 s S21 S22J
where Sl 1is q x q, S22 is r X r, and S12 is q X r. Define h on Sp by letting
h(S) be the vector (X1,..., A,)' of the t largest eigenvalues of
A(S) = S1- S,'S2-S21,.
Since ri = A, i = 1,. . ., t, the proposition will be proved if it is shown that
,(2) - h(s(Z))
is a maximal invariant function. This follows since h(s(Z)) = (A1,.. ., A)', which is a one-to-one function of (r1,..., r,). The proof that qp is maximal
invariant proceeds as follows. Consider the two subgroups G1 and G2 of G
defined by
GI= (glg (F, c, Ip) E G)
and
G2= ( glg= (In, 0, C) E G).
Note that G2 acts on the space 5 ? in the obvious way- namely, if g2 (In, p 0, C), then
g2(S) CSC, s .
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
426 CANONICAL CORRELATION COEFFICIENTS
Further, since
(r, c, C) = (r, c, Ip)(In,o, C),
it follows that each g E G can be written as g = g1g2 where gi E Gi, = 1, 2. Now, we make two claims:
(i) s: - p is a maximal invariant under the action of G1 on Z. p
(ii) h: -* R' is a maximal invariant under the action of G2 on
Assuming (i) and (ii), we now show that (Z) = h(s(Z)) is maximal invariant. For g E G, write g = g1g2 with gi e Gi, i = 1, 2. Since
s(g12) = s(Z), g1 E G
and
s(g22) = g2(S(2)), g2 E G2
we have
p(g2) = h(s(g1g2Z)) = h(s(g2Z)) = h(g2s(Z)) = h(s(Z)).
It follows that 4p is invariant. To show that qp is maximal invariant, assume
p(Z1) = qZ2). A g E G must be found so that gZ1 = Z2- Since h is
maximal invariant under G2 and
h(s(Z)) = h(s(Z2)),
there is a g2 e G2 such that
g2(S(Zl)) = s(22)
However,
g2(S(Z1)) = S(g221) = s(22)
and s is maximal invariant under G, so there exists a g, such that
g1g2Z1 =Z2
This completes the proof that 4p, and hence r1,.. ., is a maximal invariant
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
SOME DISTRIBUTION THEORY 427
-assuming claims (i) and (ii). The proof that s: , + is a maximal
invariant is an easy application of Proposition 1.20 and is left to the reader. That h: 5 + -+ R' is maximal invariant follows from an argument similar to p that given in the proof of Proposition 10.1. o
The group action on Z treated in Proposition 10.6 is suggested by the following considerations. Assuming that the observations Z,, .., Zn in RP are uncorrelated random vectors and Q(Z1) = E(Z1) for i = 1, .. ., n, it
follows that
= e,u'
and
COV Z = I,, X 2
where , = &Z, and Cov Z, = :. When Z is transformed by g = (F, c, C),
we have
6gZ = e(Cu + c)'
and
CovgZ = In, (CrC').
Thus the induced action of g on (,u, :) is exactly the group action consid
ered in Proposition 10.1. The special structure of GZ and Cov Z is reflected by the fact that, for g = (F, 0, Ip), we have &SgZ = SZ and Cov gZ = Cov Z.
10.3. SOME DISTRIBUTION THEORY
The distribution theory associated with the sample canonical correlation coefficients is, to say the least, rather complicated. Most of the results in this
section are derived under the assumption of normality and the assumption
that the population canonical correlations are zero. However, the distribu
tion of the sample multiple correlation coefficient is given in the general case of a nonzero population multiple correlation coefficient.
Our first result is a generalization of Example 7.12. Let Z1,..., Zn be a
random sample of vectors in RP and partition Zi as
Zi =Xi E Yi ERr.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
428 CANONICAL CORRELATION COEFFICIENTS
Assume that Z, has a density on RP given by
p(Zjpu, Y) = 1t1- /2f((Z - )1-1(
where f has been normalized so that
fzz'f(z'z) dz = Ip.
Thus when the density of Z1 isp( I., E), then
&Z1 = , CovZ1 = E.
Assuming that n > p + 1, the sample covariance matrix
S = E (Z Z Z )( ( Zi )
= 2 2
is positive definite with probability one. Here S1 I is q x q, S22 is r x r, and
S12 is q x r. Partitioning 2 as S is partitioned, we have
I I(. 12)
\ 21 y2222
Thus the squared sample coefficients, rl? t r,, are the t largest eigenvalues of STJ'Sl2S22'S2l and the squared population coefficients, p7l >
> p, , are the t largest eigenvalues of 2 112222 221* In the present generality, an invariance argument is given to show that the joint distribu
tion of (r1,..., ,r) depends on (,u, E) only through (p1,. . ., ps). Consider the
group G whose elements are g = (C, c) where c E RP and
c=(O B)' AeGlq, Be Glr
The action of G on RP is
(C, c)z = Cz + c
and group composition is
(C,, c,)(C2, C2) = (C1C2, CIc2 + cl).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.7 429
The group action on the sample is
g(zl 9 . Zn) =(9Z], gZn)
With the induced group action on (,u, 1) given by
g(Ai, L) = (gA,u C:C')
where g = (C, c), it is clear that the family of distributions of (Z,,..., Zn) that are indexed by elements of
e I( )1p E- RP, I +
is a G-invariant family of probability measures.
Proposition 10.7. The joint distribution of (r1,..., r,) depends on (IL, E) only through (p,,..., p,).
Proof. From Proposition 10.6, we know that (r1,... , r,) is a G-invariant function of (Z,... *, Zn). Thus the distribution of (rl,,..., r,) will depend on the parameter 0 = (,u, E) only through a maximal invariant in the parame ter space. However, Proposition 10.1 shows that (pl,..., pt) is a maximal invariant under the action of G on E. O
Before discussing the distribution of canonical correlation coefficients, even for t = 1, it is instructive to consider the bivariate correlation coeffi cient. Consider pairs of random variables (Xi, Y?), i = 1,..., n, and let X E Rn and Y E Rn have coordinates Xi and Yi, i = ,...,n. With e E R' being the vector of ones, Pe = ee'/n and Qe = I - Pe, the sample correla tion coefficient is defined by
r QeY ' QeX
\IlQeyll IlQeXII,
The next result describes the distribution of r when (Xi, Yi), i = 1,..., n, is a random sample from a bivariate normal distribution.
Proposition 10.8. Suppose (Xi, Y1)' E R2, i = 1,..., n, are independent random vectors with
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
430 CANONICAL CORRELATION COEFFICIENTS
where ,u E R2 and
a(l l 012
0Y21 022
is positive definite. Consider random variables (U1, U2, U3) with:
(i) (U1, U2) independent of U3.
(ii) t(U3) = X-2.
(iii) e(U2) = Xn (iv) i&(U1IU2) = N( P (u21/2 1).
/1 - p2
where p = a127(01122)1/2 is the correlation coefficient. Then we have
Proof The assumption of independence and normality implies that the matrix (XY) E E2 n has a distribution given by
fE(XY) = N(eji', In ? 2).
It follows from Proposition 10.7 that we may assume, without loss of
generality, that 2 has the form
When 2 has this form, the conditional distribution of X given Y is
E(XIY) = N((tL- p2)e + pY, (1 - p2)In)
so
e(QeXlY) = N(PQeY, (l - p2)Qe)
Now, let v,,..., vn be an orthonormal basis for Rn with v, = Vn and
v QeY
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.8 431
Expressing QeX in this basis leads to
n
QeX = E(ViQe X) Vi
2
since Qee = 0. Setting
Vi'QeX (i = , i = ~~29 ... ., n,
it is easily seen that, conditional on Y, we have that (2 " ... 9 are indepen dent with
P-(421Y) = N(p(l _ p2 /2 IIQeYIl, 1)
and
(tj I =Y) N(O, 1), i = 3,.. ., n.
Since
n n
IIQeX112 = (v'QeX)2 = (1 -p2),{2 2 2
the identity
$S2
7/2 + n23(
holds. This leads to
r _ 2
/1-r2 An/2
Setting U1 = (2' U2 = IIQeYII2' and U3 = 2n32 yields the assertion of the proposition. O
The result of this proposition has a couple of interesting consequences. When p = 0, then the statistic
W= U1 W = n~-2 r = n-2
l-r2U1/
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
432 CANONICAL CORRELATION COEFFICIENTS
has a Students t distribution with n - 2 degrees of freedom. In the general case, the distribution of W can be described by saying that: conditional on U2, W has a noncentral t distribution with n - 2 degrees of freedom and noncentrality parameter
a- P U2 1 - p2
where f (U2) = X2". Let Pm(*I8) denote the density function of a non central t distribution with m degrees of freedom and noncentrality parame ter 8. The results in the Appendix show that Pm(-18) has a monotone likelihood ratio. It is clear that the density of W is
h(wp) = f ~Pm(W (P(l - p2)/2 )U1/2)f(U) du
where f is the density of U2 and m = n - 2. From this representation and
the results in the Appendix, it is not difficult to show that h ( -p) has a
monotone likelihood ratio. The details of this are left to the reader. In the case that the two random vectors X and Y in Rn are independent,
the conditions under which W has a tn-2 distribution can be considerably weakened.
Proposition 10.9. Suppose X and Yin Rn are independent and both liQeXII and IIQeYII are positive with probability one. Also assume that, for some
number g, e R, the distribution of X - ,ie is orthogonally invariant.
Under these assumptions, the distribution of
W= /n - r
/1r2
where
( QeY A QeX
IlQeyIll IlQeXII
is a tn2 distribution.
Proof. The two random vectors QeX and Qey take values in the (n - 1)
dimensional subspace
M = (xlx E R, x'e = 0).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.9 433
Fix Y so the vector
Qe E M IlQeyIIM
has length one. Since the distribution of X - ,ple is O,n invariant, it follows
that the distribution of Qe X is invariant under the group
G = (FIF e On, Fe = e),
which acts on M. Therefore, the distribution of QeX/llQeXIl is G-invariant on the set
'X= {xlx e M, llxll = 1).
But G is compact and acts transitively on 9 so there is a unique G-invariant
distribution for QeX/IIQeXII in 9X. From this uniqueness it follows that
IIQeXI - IIQeZII IQeXl I( QeZ)l
where l (Z) = N(0, In) on Rn. Therefore, we have
0(r) = __y' QeZ
and for each y, the claimed result follows from the argument given to prove Proposition 10.8. O
We now turn to the canonical correlation coefficients in the special case that t = 1. Consider random vectors Xi and Y with Xi e R' and Yi E R ,
= 1,..., n. Let X E R' have coordinates Xl,..., Xn and let Y E Er,n have
rows Y,,..., Y1. Assume that QeY has rank r so
P = QeY [(QeY)'QeY] (QeY)'
is the orthogonal projection onto the subspace spanned by the columns of Qey. Since t = 1, the canonical correlation coefficient is the square root of
the largest, and only nonzero, eigenvalue of
(QeX)(QeX)' P
IlQeX112
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
434 CANONICAL CORRELATION COEFFICIENTS
which is
2 (QeX)'P(QeX) IlPQeX112 I1QeXII2 IIQeXII2
For the case at hand, r, is commonly called the multiple correlation coeffi cient. The distribution of r2 is described next under the assumption of normality.
Proposition 10.10. Assume that the distribution of (XY) e Er, 1, n is given
by
E(XY) = N(eM, In, ?& )
and partition 2 as
2: (il 1 212 (r + 1) x (r+ 1
where a > 0, 22 is i X r, and 222 is r x r. Consider random variables U1,
U2, and U3 whose joint distribution is specified by:
(i) (Ul, U2) and U3 are independent.
(ii) f (U3)= X=2 (iii) tE(U2) = X2 xn- 1
(iv) (Ul IU2) = X2 (A), where A = p2(1 - p2)- IU2.
Here p = (E,2122221221/a1 )1/2 is the population multiple correlation coeffi cient. Then we have
( 2-r2) U3
Proof Combining the results of Proposition 10.1 and Proposition 5.7, without loss of generality, 2 can be assumed to have the form
where Es E Rr and ? = (1,0,..., 0). When 2 has this form, the conditional
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.10 435
distribution of X given Y is
C(XjY) = N((IL I - pWL 2e1)e + pYel,(l - p2)In)
where &X = ,u1e and &;Y = etu2. Since Qee = 0, we have
E (QeXIY) N(PQeY(l -(1 -P2)Qe)
The subspace spanned by the columns of Qey is contained in the range of
Qe and this implies that QeP = PQe = P So
IIQeXI2 = II(Qe - P)Xjj2 + (PXij2 = II(Qe - P)QeXII + IIPQeXII2.
Since
2 =iPQeX112
IlQeXI1
it follows that
I lP QeXIV
I -rl I(Qe- P)QeXII2
Given Y, the conditional covariance of QeX is (1 - P2)Qe and, therefore,
the identity PQe(Qe - P) = 0 implies that PQeX and (Qe - P)QeX are
conditionally independent. It is clear that
CI((Qe - P)QeXIY) = N(O, (1 - P2)(Qe -p)),
so we have
e 11(Qe P )QeX1121y ) = (I1 _ p2) x2_r C,(II(Qe - P)QeXI2I)_1 P) _-rI
since Qe - P is an orthogonal projection of rank n - r - 1. Again, condi
tioning on Y,
e (PQeXlY) = N(PQeYei, (1 - p2)P)
since PQe = P and QeYel is in the range of P. It follows from Proposition 3.8 that
f(1PQeX121IY) =
(1 _ p2)X2(A)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
436 CANONICAL CORRELATION COEFFICIENTS
where the noncentrality parameter A is given by
2 A =1E'eQ eYQ-YI h
That U2 ElYQeY6i has a X2-i distribution is clear. Defining U1 and U3 by
U1 = (I - P2)I IIPQeXII2
and
U3 = (1 - P2) II(Qe - P)QeXII2,
the identity
r2
1-r2 U3
holds. That U3 is independent of (U,, U2) follows by conditioning on Y. Since
fe(UllY) = xn(A)
where
2 p2
A = ' 2CY'QeYCi = 1 U2 2'
the conditional distribution of U, given Y is the same as the conditional distribution of U, given U2. This completes the proof. o
When p = 0, Proposition 10.10 shows that
i [i-rl ) ~( X2 r ) Fr=n-r-1
which is the unnormalized F-distribution on (0, ox). More generally,
i -2) =F(r,n-r- I;A)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.10 437
where
- P 2
is random. This means that, conditioning on A = 8,
e( ' lr2 )8 =F(r, n - r-1;
Let f( I8) denote the density function of an F(r, n - r - 1; 8) distribution,
and let h() be the density of a X2- distribution. Then the density of
r2l /1- r ) is
k(wlp) = f (wIp2(1 p2) u)h(u) du.
From this representation, it can be shown, using the results in the Appen
dix, that k(wlp) has a monotone likelihood ratio. The final exact distributional result of this section concerns the function
of the sample canonical correlations given by
W= H (I - ri2) i=I
when the random sample (Xi, Y,)', i = 1,..., n, is from a normal distri bution and the population coefficients are all zero. This statistic arises in
testing for independence, which is discussed in detail in the next section. To be precise, it is assumed that the random sample
Zi Xi E RP, i ,.. n
satisfies
et)= N(Aq 2:).
As usual, Xi E R, Yi E K, and the sample covariance matrix
n s= (Zi - Z)(Z1 -Z)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
438 CANONICAL CORRELATION COEFFICIENTS
is partitioned as
s- 511 S120 S\21 S22)
where S, I is q x q and S22 is r x r. Under the assumptions made this far, S
has a Wishart distribution-namely,
c (S) = W(, p, n- 1).
Partitioning :, we have
Ell 2120
(21 Y. 22
and the population canonical correlation coefficients, say p1 > > P, are all zero iff 12 = 0.
Proposition 10.11. Assume n - 1 > p and let r,> *. ** r, be the sample canonical correlations. When E12 = 0, then
ce(1ii(l -ri2)) = U(n-r-1 r, q)
where the distribution U(n - r - 1, r, q) is described in Proposition 8.14.
Proof. Since r2,..., rt2 are the t largest eigenvalues of
A(S)= SilIS=2S221
and the remaining q - t eigenvalues of A(S) are zero, it follows that
t
w -=rl (I _
ri2 ) =II- Sl I S12 SL%1
Since W is a function of the sample canonical correlations and Y.12 = 0
Proposition 10.1 implies that we can take
z = (Iq or)
without loss of generality to find the distribution of W. Using properties of
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.11 439
determinants, we have
w= isi~'iis11-
s12s'S211
- 11I1-21 W- lS1,1},
- S12Si22'2 W
IISll.2 + S12S22s211
Proposition 8.7 implies that
(Sl l2) = W(Iq, q, n - r -1)
t /2S21I
S22 ) = N(0, 'r ? hI)
and S1 I.2 and S12S2 SS21 are independent. Therefore,
&(Ss2S122S21) = W(Iq, q, r)
and by definition, it follows that
E (W) = U(n-r - 1, r,q). O
Since
w I ~~S22 .11 w2~~ S _'S121 1S22.1 + S21S12S121
the proof of Proposition 10.1 1 shows that C (W) = U(n - q - 1, q, r) so
U(n - q - 1, q, r) = U(n - r - 1, r, q) as long as n - 1 > q + r. Using
the ideas in the proof of Proposition 8.15, the distribution of W can be
derived when 212 has rank one-that is, when one population canonical correlation is positive and the rest are zero. The details of this are left to the
reader. We close this section with a discussion that provides some qualitative
information about the distribution of r, > * * * > r, when the data matrices
X E Eq n and Y E Crr, are independent. As usual, let Px and Py denote the
orthogonal projections onto the column spaces of QeX and QeY. Then the
sample canonical correlations are the t largest eigenvalues of PYPX-say
p(PYPX)- E R'.
It is assumed that QeX has rank q and QeY has rank r. Since the
distribution of p(PyPx) is of interest, it is reasonable to investigate the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
440 CANONICAL CORRELATION COEFFICIENTS
distributional properties of the two random projections Px and Py. Since X and Y are assumed to be independent, it suffices to focus our attention on
Px. First note that Px is an orthogonal projection onto a q-dimensional
subspace contained in
M = (xlx E Rn, x'e = 0).
Therefore, Px is an element of
63> ( ) Pl P is an n X n rank q orthogonal
q,(e) projection, Pe =O
Furthermore, the space Pq n(e) is a compact subset of Rn2 and is acted on
by the compact group
C)(e) r={1 En1'Fe = e),
with the group action given by P -> VPF'. Since Qn(e) acts transitively on
9',,(e), there is a unique On(e)-invariant probability distribution on 6'q n(e). This is called the uniform distribution on Pq n(e).
Proposition 10.12. If f(X) = P&(X) for r e (9n(e), then Pxhas a uniform distribution on @q, n(e).
Proof It is readily verified that
Prx = rPxr' F E (9(e).
Therefore, if f2(FX) = e( X), then
P-( Px) = P-( rPXIF)I
which implies that the distribution E(Px) on gPq n(e) is ?n(e)-invariant. The uniqueness of the uniform distribution on q, n(e) yields the result. O
When C(X) = N(epL',
In X ), then
P,(X) = E(FX)
for IF E O(e) so
Proposition 10.12 applies to this case. For any two n X n positive semidefi
nite matrices B1 and B2, define the function p(B IB2) to be the vector of the
t largest eigenvalues of B1 B2. In particular, qp(PyPx) is the vector of sample
canonical correlations.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10. 13. 441
Proposition 10.13. Assume X and Y are independent, C(FX) = E(X) for
F E8 Q9(e), QeX has rank q, and QeY has rank r. Then
f (Tp(PYPX)) = C (T (POPA
where P0 is any fixed rank r projection in grY n(e).
Proof First note that
p(PYrFPX') = (r'PYrPX)
since the eigenvalues of PyFPxrl are the same as the eigenvalues of
rFPy1Px. From Proposition 10.12, we have
E(PX) = E(rpxr ) r E (9"(e).
Conditioning on Y, the independence of X and Y implies that
e(9(PyPx)1Y) = p(T(PyrPxr1)1Y) = (fp(rfPyrPx)iY)
for all F E Q(e). The group C(Se) acts transitively on )r. n(e), so for Y fixed, there exists a F e En,(e) such that rPyr = P0. Therefore, the equa
tion
-(T(PYPX)IY) = E(T(POPA)IY) = cwp(oPx))
holds for each Y since X and Y are independent. Averaging @(p (PYPA)IY) over Y yields Ef(p(PyPx)), which must then equal Ef(p(P0Px)). This completes the proof. c
The preceeding result shows that e(q(PyPx)) does not depend on the distribution of Y as long as X and Y are independent and e(X) = E(FX)
for F E En(e). In this case, the distribution of qp(PyPx) can be derived under the assumption that e( X) = N(0, I,, ? Iq) and e(Y) = N(0, I,, ? Ir) Suppose that q < r so t = q. Then (q(PyPx)) is the distribution of
r, > *-- > rq where Ai = r, i =2 1,..., q, are the eigenvalues of
SJQS1lSi2'S2, and
= S 1 S12
S2, S22 J
is the sample covariance matrix. To find the distribution of r,,..., rq, it
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
442 CANONICAL CORRELATION COEFFICIENTS
would obviously suffice to find the distribution of yi = 1 - Xi, i = 1,..., q, which are the eigenvalues of
I- Sj1S2S2-S21 = (T, + T2) -'T
where
T1-S1- S12 SI2S21; T2 = SI2S2
It was shown in the proof of Proposition 10.11 that T, and T2 are
independent and
C(TI) = W(Iq, q, n - r - )
and
C (T)= W(Iq, q,)
Since the matrix
B = ( T, + T2 )T i/2T1 (T + T 1/2
has the same eigenvalues as (T, + T2) TI, it suffices to find the distribu
tion of the eigenvalues of B. It is not too difficult to show (see the Problems
at end of this chapter) that the density of B is
p(B) = w(n - r - 1, q).o(r, q) IBI(n--q- 2)/2lIq -Bl r-q- 1)/2
o(n - l, q)
with respect to Lebesgue measure dB restricted to the set
(= (BIB E 5, Iq - B E }
Here, w(-,-) is the Wishart constant defined in Example 5.1. Now, the
ordered eigenvalues of B are a maximal invariant under the action of the
group ?q on 9X given by B * I"B, F E ?q. Let X be the vector of ordered
eigenvaluesofB soAX E , 1 > l * > Aq> 0. Sincep(FBI') = p(B), F E q9 it follows from Proposition 7.15 that the density of A is q(X) =
p(Dx) where Dx is a q x q diagonal matrix with diagonal entries AX,_, Aq. Of course, q( ) is the density of A with respect to the measure v(dX) induced by the maximal invariant mapping. More precisely, let
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
TESTING FOR INDEPENDENCE 443
and consider the mapping (p on 9 to 5 defined by T(B) = A where X is the
vector of eigenvalues of B. For any Borel set C c F, v(C) is defined by
v(C) = | dB. m- (C)
Since q(A) has been calculated, the only step left to determine the distribu tion of A is to find the measure P. However, it is rather nontrivial to find v and the details are not given here. We have included the above argument to show that the only step in obtaining C(A) that we have not solved is the calculation of P. This completes our discussion of distributional problems associated with canonical correlations.
The measure P above is just the restriction to S of the measure P2
discussed in Example 6.1. For one derivation of v2, see Muirhead (1982, p. 104).
10.4. TESTING FOR INDEPENDENCE
In this section, we consider the problem of testing for independence based on a random sample from a normal distribution. Again, let Z,...., Zn be
independent random vectors in RP and partition Zi as
X.~~~~~~ Zi I, Xi E R y Rr.
It is assumed that l?(Zi) = N(u, 2), so
Cov(Z,) I = (=Co vi2) COV( )
for i = 1,..., n. The problem is to test the null hypothesis Ho: . 12 = 0 against the alternative H1: 12 *0 . As usual, let Z have rows Zi', i = 1,= . . , n
so Cf(Z) = N(eA', I,,n E). Assuming n > p + 1, the set F C , P, nwhere
S - (Z - eZ')'(Z - eZ) (S21 Si2)
has rank p is a set of probability one and E is taken as the sample space for
Z. The group G considered in Proposition 10.6 acts on E and a maximal invariant is the vector of canonical correlation coefficients r1,..., rt where t = min(q, r).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
444 CANONICAL CORRELATION COEFFICIENTS
Proposition 10.14. The problem of testing Ho: l12 = 0 versus H,: 112 * 0
is invariant under G. Every G-invariant test is a function of the sample canonical correlation coefficients r,,..., rt. When t = 1, the test that rejects for large values of r, is a uniformly most powerful invariant test.
Proof. That the testing problem is G-invariant is easily checked. From Proposition 10.6, the function mapping Z into r,,..., r, is a maximal invariant so every invariant test is a function of r,,..., r,. When t = 1, the
test that rejects for large values of r, is equivalent to the test that rejects for
large values of U- r2/(l - r2). It was argued in the last section (see Proposition 10.10) that the density of U, say k(ulp), has a monotone likelihood ratio where p is the only nonzero population canonical correla tion coefficient. Since the null hypothesis is that p = 0 and since every invariant test is a function of U, it follows that the test that rejects for large values of U is a uniformly most powerful invariant test.
When t = 1, the distribution of U is specified in Proposition 10.10, and this can be used to construct a test of level a for Ho. For example, if q = 1,
then e(U) = Fr, n-r- l and a constant c(a) can be found from standard
tables of the normalized Sr-distribution such that, under Ho, P{U > c(a)) = a.
In the case that t > 1, there is no obvious function of r1,..., rt that
provides an optimum test of Ho versus H1. Intuitively, if some of the ri's are
" too big," there is reason to suspect that Ho is not true. The likelihood ratio
test provides one possible criterion for testing 212 = 0.
Proposition 10.15. The likelihood ratio test of Ho versus H, rejects if the
statistic
W= (l - ri2) = 1
is too small. Under Ho, E(W) = U(n - r - 1, r, q), which is the distribu
tion described in Proposition 8.14.
Proof The density function of Z is
p(ZIP, t, = (,r )-)np11-n 2ex [ tr(Z - eW)'( Z - eu') 7]
Under both Ho and H1, the maximum likelihood estimate of IL is 4 = Z.
Under HI, the maximum likelihood estimate of L is Y = (l/n)S. Partition
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.15 445
ing S as I is partitioned, we have
S21 S22 J
where Si, is q x q, S12 is q X r, and S22 is r x r. Under Ho, I has the form
II(I Li 0= Y, 222
so
(2-1 0)
0 2-1
When I has this form,
A^ 9
- np
1II1-
n/2 I 22- n /2eXp[-
rS
_ = trS-lP n/exp[- trSI1ljj]jz |-n/2
x exp[ - I trS22 22I2
From this it is clear that, under Ho, E = (1/n)SI, and E22 = (1/n)S22. Substituting these estimates into the densities under Ho and H1 leads to a
likelihood ratio of
A(Z) (ISI;HS221) n/2
Rejecting Ho for small values of A(Z) is equivalent to rejecting for small
values of
W = ( A ( Z)) 2/n i 5 1l1
1sI1111S221'
The identity IS = 1StI1S22 - S21StISl'21 shows that
IS22- S21SI1S2lj = hr - Sr2)21Sjj'S12i = IS221 I Ir
- H ( - r2 1 221 iSi
where r2,..., r are the t largest eigenvalues of S2i1S21SjS12. Thus the
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
446 CANONICAL CORRELATION COEFFICIENTS
likelihood ratio test is equivalent to the test that rejects for small values of W. That l (W) = U(n - r - 1, r, q) under Ho follows from Proposition
10.11. o
The distribution of W under H1 is quite complicated to describe except in the case that 212 has rank one. As mentioned in the last section, when .22 has rank one, the methods used in Proposition 8.15 yields a description of the distribution of W.
Rather than discuss possible alternatives to the likelihood test, in the next section we show that the testing problem above is a special case of the
MANOVA testing problem considered in Chapter 9. Thus the alternatives to the likelihood ratio test for the MANOVA problem are also alternatives to the likelihood ratio test for independence.
We now turn to a slight generalization of the problem of testing that
212 = 0. Again suppose that Z E , satisfies e(Z) = N(e,L', In ? 1) where
,u e RP and 2 are both unknown parameters and n > p + 1. Given an
integer k > 2, let P1,..., Pk be positive integers such that zCp1 = p. Parti tion Y. into blocks Iij of dimension pi X pj for i, j = 1,..., k. We now
discuss the likelihood ratio test for testing Ho: I ij = 0 for all i, j with i * j.
For example, when k = p and each pi = 1, then the null hypothesis is that 2
is diagonal with unknown diagonal elements. By mimicking the proof of Proposition 10.15, it is not difficult to show that the likelihood ratio test for
testing Ho versus the alternative that Y. is completely unknown rejects for
small values of
A k
A- lSi.l
i= 1
Here, S = (Z - eZ')'(Z - eZ') is partitioned into Sij: pi x pj as E was
partitioned. Further, for i = 1,..., k, define S(ji) by
Sii i(i+ 1) Sik
s(ii)-:
so S(i) is (Pi + * + Pk) X (Pi + * + Pk)J Noting that SO= S, we can
write A- -
| k-I |S |
A ISI | i= 1 IS- I IS(l+ 1 i+ 1)1 i= l
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.15 447
Define Wi, i = I,., k - 1, by
W _ I (i,, l i Siil 1S(i+ 1,i+ )1,
Under the null hypothesis, it follows from Proposition 10.11 that
k k
(WI) =U n - 1 - E pip,, P, pi).
j=i+ I j=i+ I
To derive the distribution of A under Ho, we now show that WI,..., Wk
are independent random variables under Ho. From this it follows that, under Ho,
k-I k k
C(A) = r| U n - I -
E? pq, E p,, Pi i=l j=i+l I j=i+lI
so A is distributed as a product of independent beta random variables. The
independence of WI,..., Wk I for a general k follows easily by induction
once independence has been verified for k = 3.
For k = 3, we have
A = W W2= iI IS(22J1
and, under Ho,
a(S) = W(M:, ,n - 1)
where I has the form
Y. 0 0 IjI_ 0
( 0
222 0 O 0 033 ? (22)J
To show W1 and W2 are independent, Proposition 7.19 is applied. The sample space for S is p -the space of p X p positive definite matrices. Fix
Y. of the above form and let P. denote the probability measure of S so PO is the probability measure of a W(I, p, n - 1) distribution on S'. Consider the group G whose elements are (A, B) where A E Glp, and B E Gl(P2+p3)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
448 CANONICAL CORRELATION COEFFICIENTS
and the group composition is
(Al, BI)(A2, B2) = (A1A2, B1B2).
It is easy to show that the action
(A, B)[S]( A
g)S( A
B)
defines a left action of G on 5'. If e(S) = W(Y, p, n - 1), then
f-((A, B)[S]) = W((A, B)[ ], p, n - 1)
where
(A, B)[Y-] = (A O)( O) AIA
This last equality follows from the special form of E:. The first thing to
notice is that
WI = WI(S)= 5SI ISIII1S(22)I
is invariant under the action of G on 5p . Also, because of the special form
of 2, the statistic
T(S) -(SI S(22)) E 5p, X hP2+P3)
is a sufficient statistic for the family of distributions {gPoIg E G). This
follows from the factorization criterion applied to the family (gPo0g E G}, which is the Wishart family
(0 ? 22 ) Yl p, P Y22 (E 5(+P2+P3))
However, G acts transitively on + in the obvious way:
(A, B)[S1, S2] (AS1A, BS2B')
for [Si, S2] E Xpl x (+P,). Further, the sufficient statistic T(S) E p X
5(P2+P3) satisfies
T((A, B)[S]) = (A, B)[T(S)]
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.16 449
so T(-) is an equivariant function. It now follows from Proposition 7.19 that
the invariant statistic W1(S) is independent of the sufficient statistic (S). But
W2 (S) I S(22)i
1 S2211IS3 31
is a function of S(22) and so is a function of T(S) - [S11, S(22)]. Thus W1 and
W2 are independent for each value of E in the null hypothesis. Summarizing, we have the following result.
Proposition 10.16. Assume k = 3 and 2 has the form specified under Ho. Then, under the action of the group G on both S+ and
+ X )<(+P the
invariant statistic
W1(S) S15I
and the equivariant statistic
T(S) = [S,19 S(22)]
are independent. In particular, the statistic
W (S) 15I(22)1 W2(S) =
being a function of T(S), is independent of W1.
The application and interpretation of the previous paragraph for general k should be fairly clear. The details are briefly outlined. Under the null hypothesis that Eij
= 0 for i, j 1,. . ., k and i *j, we want to describe the distribution of
k-I s( ) k-i
A =H I (i = H
W
It was remarked earlier that each Wi is distributed as a product of indepen dent beta random variables. To see that Wl,..., Wk I are independent, Proposition 10.16 shows that
= ISIIS(22)1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
450 CANONICAL CORRELATION COEFFICIENTS
and S(22) are independent. Since (W2,..., Wk 1) is a function of S(22), W
and (W2,. . ., Wk I) are independent. Next, apply Proposition 10.16 to to conclude that
W _ IS(22,1 I1S2211S(33)1
and S(33) are independent. Since (W3,..., Wk-l) is a function of S(33), W2 and (W3,..., WkI ) are independent. The conclusion that WI,..., Wkl are independent now follows easily. Thus the distribution of A under Ho has been described.
To interpret the decomposition of A into the product HVI'W, first consider the null hypothesis
Ho('): Y.Ij = ? forj=- 2,..., k.
An application of Proposition 10.15 shows that the likelihood ratio test of
HoM versus the alternative that : is unknown rejects for small values of
ISIIIIS(22)1
Assuming Ho') to be true, consider testing
Hg2): 2j =0 forj =3,...,k
versus
H (2): 22* for somej = 3,..., k.
A minor variation of the proof of Proposition 10.15 yields a likelihood ratio
test of H(2) versus H(2) (given Hg')) that rejects for small values of
W2 IS(22,1
1S221IS(33)1
Proceeding by induction, assume null hypotheses Ho(', i = 1,..., m - 1, to
be true and consider testing
Hg(m): Emj
= O, j = m + II.., k
versus
H(m) Iy *0 forsomej = m + ,..., k.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
MULTIVARIATE REGRESSION 451
Given the null hypotheses H i = 1,..., m - 1, the likelihood ratio test of H(m) versus H(m) rejects for small values of
Wm = 15(mm)l ISmmIIS(m+1,m+1)V
The overall likelihood ratio test is one possible way of combining the likelihood ratio tests of H(m) versus Hen), given that HOM, i = 1,..., m - 1,
is true.
10.5. MULTIVARIATE REGRESSION
The purpose of this section is to show that testing for independence can be
viewed as a special case of the general MANOVA testing problem treated in
Chapter 9. In fact, the results below extend those of the previous section by
allowing a more general mean structure for the observations. In the notation
of the previous section, consider a data matrix Z: n x p that is partitioned
as Z = (XY) where Xis n X q and Y is n X r so p = q + r. It is assumed
that
e(Z) = N(TB, In e 2)
where T is an n x k known matrix of rank k and B is a k X p matrix of
unknown parameters. As usual, I is a p x p positive definite matrix. This is
precisely the linear model discussed in Section 9.1 and clearly includes the model of previous sections of this chapter as a special case.
To test that X and Y are independent, it is illuminating to first calculate
the conditional distribution of Y given X. Partition the matrix B as
B = (B1B2) where B1 is k X q and B2 is k x r. In describing the conditional
distribution of Y given X, say E(YIX), the notation
22-1 -22 -221211 1
is used. Following Example 3.1, we have
(YIX) = N(TB2 + (In - 12111-)(X -TB1), I, , 222-1)
= N(T(B2- B1,2'122) + X2lllj12, In ? 1221)
and the marginal distribution of X is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
452 CANONICAL CORRELATION COEFFICIENTS
Let Wbe the n x (q + k) matrix (XT) and let C be the (q + k) x r matrix
of parameters
c C2/ B2- BIE-1112
so
X21112 + T(B2 - B Ill12) = (XT)( WC .
In this notation, we have
e(YIX) =N(WC, In 0.222.1)
and
e(x) = N(TB1, In 0 211).
Assuming n > p + k, the matrix W has rank q + k with probabilty one so
the conditional model for Y is of the MANOVA type. Further, testing
Ho: 212 = 0 versus HI: 212 * 0 is equivalent to testing Ho: C, = 0 versus
HI: Cl * 0. In other words, based on the model for Z,
e(Z) = N(TB, In C 2),
the null hypothesis concerns the covariance matrix. But in terms of the conditional model, the null hypothesis concems the matrix of regression parameters.
With the above discussion and models in mind, we now want to discuss
various approaches to testing Ho and flo. In terms of the model
e(z) = N(TB, In 09 2)
and assuming HI, the maximum likelihood estimators of B and 2 are
B = (T'T)-'T'Z, s= n
where
S= (Z - TB)'(Z -TB),
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
MULTIVARIATE REGRESSION 453
so
c(S) = W(I, p, n - k).
Under Ho, the maximum likelihood estimator of B is still B as above and
since
z = t ? ~222}
it is readily verified that
Y.ii =
nSii, i , 1
where
S= (S1 S,12 S21 S22)
Substituting these estimators into the density of Z under Ho and H, demonstrates that the likelihood ratio test rejects for small values of
A(Z)= Is1121
Under Ho, the proof of Proposition 10.11 shows that the distribution of A(Z) is U(n - k - r, r, q) as described in Proposition 8.14. Of course, symmetry in r and q implies that U(n - k - r, r, q) = U(n - k - q, q, r).
An alternative derivation of this likelihood ratio test can be given using the conditional distribution of Y given X and the marginal distribution of X. This follows from two facts: (i) the density of Z is proportional to the conditional density of Y given X multiplied by the marginal density of X, and (ii) the relabeling of the parameters is one-to-one-namely, the map ping from (B, E) to (C, B1, II, 122-.1) is a one-to-one onto mapping of
p,k Xp to-(r,(q?k) X k X X eS +. We now turn to the likelihood ratio test of Ho versus H, based on the conditional model
E(YIX) = N(WC, In X 12221)
where X is treated as fixed. With X fixed, testing fto versus H, is a special
case of the MANOVA testing problem and the results in Chapter 9 are
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
454 CANONICAL CORRELATION COEFFICIENTS
directly applicable. To express Ho in the MANOVA testing problem form, let K be the q x (q + k) matrix K = (Iq 0), so the null hypothesis Ho is
Ho: KC= 0.
Recall that
C- (w'w)'W'Y
is the maximum likelihood estimator of C under HA. Let P = W(W'W)- W' denote the orthogonal projection onto the column space of W, let Qw = In - Pw, and define V E 5,+ by
V- Y'QWY (Y W )'(Y- WC).
As shown in Section 9.1, based on the model
E(YIX) = N(WC, In X E 22.1)
the likelihood ratio test of Ho: KC = 0 versus HI: KC * 0 rejects Ho for
small values of
A i~~~~~~~vi IV+(KC )'( K ( W'W )
- lK') -I(KC )
For each fixed X, Proposition 9.1 shows that under Ho, the distribution of
AI(Y) is U(n - q - k, q, r), which is the distribution (unconditional) of
A(Z) under Ho. In fact, much more is true.
Proposition 10.17. In the notation above:
(i) V= S22.1.
(ii) (KC)'(K(W'W)- 'K')- '(KC) = S21Sl1QS12. (iii) AI(Y)= A(Z).
Further, under Ho, the conditional (given X) and unconditional distribution of A,(Y) and A(Z) are the same.
Proof. To establish (i), first write S as
S = (Z- TB)'(Z - TB) = Z'(I-PT)Z
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION 10.17 455
where PT = T(T'T) 'T' is the orthogonal projection onto the column space of T. Setting QT = I - PT and writing Z = (XY), we have
S = Z'QTZ = (Y)QTXY ( Y Y'QTY) ( 2 22).
This yields the identity
S22 .1 =
QTYQ -
Y'QTX(X'QTX) X'QTY = Y(I -PT) Y- Y'POY
where P0O= QTX(X'QTX)-'X'QT is the orthogonal projection onto the column space of QTX. However, a bit of reflection reveals that Po = Pw - PT so
S22 Y (I -
PT)Y -
Y(PW -
PT)Y Y (I - P YQWY =V.
This establishes assertion (i). For (ii), we have
S21Snl 12 = Y'POY
and
(KC)'( K(W'W) 'K') KC
= YW(W'W) K'(K(W'W) 'WW(W'W)K')
xK(W'W) -W'Y
= Y'U(U(U) uu'Y Y'PuY
where U W(W'W)-'K' and Pu is the orthogonal projection onto the column space of U. Thus it must be shown that Pu = Po or, equivalently,
that the column space of U is the same as the column space of QTX. Since
W = (XT), the relationship
W'U= W'W(W'W)I'K,K= ( Iq)
proves that the q columns of U are orthogonal to the k columns of T. Thus
the columns of U span a q-dimensional subspace contained in the column
space of W and orthogonal to the column space of T. But there is only one
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
456 CANONICAL CORRELATION COEFFICIENTS
subspace with these properties. Since the column space of QTX also has these properties, it follows that Pu = Po so (ii) holds. Relationship (iii) is a consequence of (i), (ii), and
Is',IIS22 S22 1 + S21S2l.S12 1
The validity of the final assertion concerning the distribution of 1A(Y) and A(Z) was established earlier. o
The results of Proposition 10.17 establish the connection between testing for independence and the MANOVA testing problem. Further, under Ho, the conditional distribution of A1(Y) is U(n - q - k, q, r) for each value
of X, so the marginal distribution of X is irrelevant. In other words, as long
as the conditional model for Y given X is valid, we can test Ho using the
likelihood ratio test and under Ho, the distribution of the test statistic does not depend on the value of X. Of course, this implies that the conditional
(given X) distribution of A(Z) is the same as the unconditional distribution
of A(Z) under Ho. However, under H1, the conditional and unconditional
distributions of A(Z) are not the same.
PROBLEMS
1. Given positive integers t, q, and r with t < q, r, consider random
vectors U E Rt, V e Rq, and W e R' where Cov(U) = I, and U, V,
and W are uncorrelated. For A: q X t and B: r X t, construct X=
AU+ Vand Y=BU+ W.
(i) With AI1 = Cov(V) and A22 = Cov(W), show that
Cov(X) = AA' + A11
Cov(Y) = BB' + A22
and the cross covariance between X and Y is AB'. Conclude that
the number of nonzero population canonical correlations be
tween X and Y is at most t.
(ii) Conversely, given X E Rq and Y E Rr with t nonzero population
canonical correlations, construct U, V, W, A, and B as above so
that X = AU + V and Y = BU + W have the same joint covari
ance as X and Y.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 457
2. Consider X E Rq and Y e R' and assume that Cov(X) = I,, and
Cov(y) = 222 exist. Let 212 be the cross covariance of X with Y.
Recall that 6Yn denotes the group of n X n permutation matrices. (i) If gE12h = 212 for all gE'Yq and h E'tP , show that
Eh2 = 8e1e'
for some 8 E R1 where e, (e2) is the vector of ones in Rq (Rr).
(ii) Under the assumptions in (i), show that there is at most one nonzero canonical correlation and it is I(8e K e j l1e I)l/2 (e' -'21e2)'/2. What is a set of canonical coordinates?
3. Consider X E RP with Cov(X) = I > 0 (for simplicity, assume 6 =
0). This problem has to do with the approximation of X by a lower
dimensional random vector- say Y = BX where B is a t x p matrix of
rank t.
(i) In the notation of Proposition 10.4, suppose Ao: p X p is used to define the inner product [, ] on Rn and prediction error is
measured by IIjX - CYI12 where 11 * 11 is defined by [-, *] and C
is p x t. Show that the minimum prediction error (B fixed) is
8(B) = trA0(E - 2B'(BYB'1)yBY)
and the minimum is achieved for C = C = YB(B2B')- I.
(ii) Let A = l1/2A0E'/2 and write A in spectral form as A = :f Xiaia' where XI > > XA > 0 and a1,..., ap is an ortho
normal basis for RP. Show that 8(B) = trA(I - Q(B)) where
Q(B) - S'/2B'(BEB') 'BE'/2 is a rank t orthogonal projection. Using this, show that 8(B) is minimized by choosing Q = Q = E:laia,, and the minimum is + A I. What is a corresponding B and X = CBX that gives the minimum? Show that X = CBX = E11/2Q>-1i/2X.
(iii) In the special case that AO = Ip, show that
X= E2(a'X)a1 i=lI
where al,..., ap are the eigenvectors of E and a = Xiai with
AI > Ap. (The random variables a'X are often called the
principal components of X, i = 1.. . , p. It is easily verified that cov(a'X, ajX)
= 8ijiX.)
4. In RP, consider a translated subspace M + aO where ao E RP-such a
set in RP is called a flat and the dimension of the flat is the dimension
of M.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
458 CANONICAL CORRELATION COEFFICIENTS
(i) Given any flat M + ao, show that M + aO = M + bo for some
unique bo E M'. Consider a flat M + a0, and define the orthogonal projection onto M + aO by x -+ P(x -
ao) + aO where P is the orthogonal projection onto M. Given n points x,,..., xn in RP, consider the problem of finding the "closest" k-dimensional flat M + aO to the n points. As a
measure of distance of the n points from M + ao, we use A(M, ao) =
2 llx i- x i12 where 11 * 11 is the usual Euclidean norm and i = P(xi -
ao) + ao is the projection of xi onto M + aO. The problem is to find M
and ao to minimize A(M, ao) over all k-dimensional subspaces M and
all aO.
(ii) First, regard ao as fixed, and set S(b) = Y.'(xi - b)(xi - b)' for
any b E RP. With Q = I - P, show that A(M, ao) = trS(ao)Q = trS(x)Q + n(a0 - f)'Q(ao - x5) where x~ = n l:nxi.
(iii) Write S(xf) = 2 PfX v v, in spectral form where X * >Xp> 0 and v,,...., vp is an orthonormal basis for RP. Using (ii), show
that Al(M, ao) > I2P?I Xi with equality for zo = x and for M
span(vD,., g Vk).
5. Consider a sample covariance matrix
|S(I] S12
VS21 S22
with Sii > 0 for i = 1, 2. With t = min{dim Sii, i = 1, 2), show that the t sample canonical correlations are the t largest solutions (in A) to the
equation IS12Si2-IS21 -
X2SIII = O, A E [0, x0).
6. (The Eckhart-Young Theorem, 1936.) Given a matrix A: n x p (say n > p), let k < p. The problem is to find a matrix B: n x p of rank no
greater than k that is "closest" to A in the usual trace inner product on
p . Let )k be all the n x p matrices of rank no larger than k, so the
problem is to find
inf IIA - B2
where IIMI12 = tr MM' for M E p, n. (i) Show that every B E I1k can be written A C where 4' is n X k,
+4' ='Ik' and C is k x p. Conversely, AC E 6
k' for each such 4 and C.
(ii) Using the results of Example 4.4, show that, for A and 4 fixed,
inf IIA - 'C112 = IIA - x'A112. CEEp, k
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 459
(iii) With Q = I - 44', Q is a rank n - k orthogonal projection. Show that, for each B E 'ik
IIA - B112 > infllAQl12 = inf trQAA' E Q Q
k v
where A, >. * > Ap are the singular values of A. Here Q ranges
over all rank n - k orthogonal projections. (iv) Write A = E:1 iuiv, as the singular value decomposition for A.
Show that B = E:kjXu v, achieves the infimum of part (iii).
7. In the case of a random sample from a bivariate normal distribution
N(,u, Y.), use Proposition 10.8 and Karlin's Lemma in the Appendix to
show that the density of W= In - 2r(I - r2)-1/2 (r is the sample correlation coefficient) has a monotone likelhood ratio in 0 = p(l - p2)- 1/2. Conclude that the density of r has a monotone likelihood ratio
in p.
8. Let fp q denote the density function on (0, oo) of an unnormalized Fp, q
random variable. Under the assumptions of Proposition 10.10, show that the distribution of W = r2(1 - r 2)- I has a density given by
00
f(w|p) E fr+2k,n-r-1(w)h(kJp) k= I
where
h(klp) (1
p)(n )/ r((n- 1)/2 + k) (p2)k
k!r((n - 1)72)
k = 0,1....
Note that h(.jp) is the probability mass function of a negative bi nomial distribution, so f(wlp) is a mixture of F distributions. Show that f( p) has a monotone likelihood ratio.
9. (A generalization of Proposition 10.12.) Consider the space Rn and an integer k with 1 < k < n. Fix an orthogonal projection P of rank k, and for s < n - k, let Y be the set of all n X n orthogonal projections R of rank s that satisfy RP = 0. Also, consider the group 6(P) = (I'll' E= Q, rp = Pr'.
(i) Show that the group 6(P) acts transitively on @s under the action R -) IRI'.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
460 CANONICAL CORRELATION COEFFICIENTS
(ii) Argue that there is a unique @(P) invariant probability distribu tion on qs.
(iii) Let A have a uniform distribution on 0(P) and fix R0 E s
Show that AROA' has the unique 6(P) invariant distribution on
10. Suppose Z E 0Ep , has an En-left invariant distribution and has rank p
with probability one. Let Q be a rank n - k orthogonal projection with p + k < n and form W = QZ.
(i) Show that W has rank p with probability one.
(ii) Show that R = W(W'W)- W has the uniform distribution on p
(in the notation of Problem 9 above with P = I - Q and s = p).
11. After the proof of Proposition 10.13, it was argued that, when q < r,
to find the distribution of r, > --- > rq, it suffices to find the distri
bution of the eigenvalues of the matrix B = (T1 + T2)- 1/27T, (T1 +
2)-1/2 where T1 and T2 are independent with E(T1) = W(Iq, q, n - r
- 1) and E(T2) = W(Iq, q, r). It is assumed that q < n - r - 1. Let
fi( m) denote the density function of the W(Iq, q, m) distribution
(m > q) with respect to Lebesgue measure dS on 59q
Thus f(SIm)= (m, q)ISI(m-q- 1)/2 exp - 2 tr S]I(S) where
1 if S>0 l 0 otherwise
(i) With W1 = T1 and W2 = T1 + T2, show that the joint density of
W1 and W2 with respect to dW, dW2 is f(W1In - r - I)f(W2 -
WIIr). (ii) On the set where WI > 0 and W2 > 0, define B=
W7 "/2W W71/2 and W2 = V. Using Proposition 5.11, show that
the Jacobian of this transformation is Idet VI(q+ 1)/2 Show that the joint density of B and V on the set where B > 0 and V > 0 is
given by
f(V"'2BV"/2in - r - I)f(V"/2(I -B)V/2 r)Idet VI(q+ 1)/2.
(iii) Now, integrate out V to show that the density of B on the set
0 < B < Iq is
X - r -1, q)w(r, q) w (n - 1, q)
xB BI(n-r-q2)/2I q - I( r- q- 1)72
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROBLEMS 461
12. Suppose the random orthogonal transformation r has a uniform distribution on Q9n. Let A be the upper left-hand k x p block of 1 and assume p < k. Under the additional assumption that p < n - k, the
following argument shows that A has a density with respect to Lebesgue measure on ep k.
(i) Let 4:n X p consist of the first p columns of r so A: k X p has
rows that are the first k rows of 4. Show that 4 has a uniform
distribution on p n. Conclude that 4 has the same distribution as Z(Z'Z)1/2where Z: n X p is N(O, In ? Ip).
(ii) Now partition Z as Z = (X) where X is k x p and Y is
(n - k) x p. Show that Z'Z = X'X + Y'Y and that A has the
same distribution as X(X'X + Yy)- 1/2. (iii) Using (ii) and Problem 11, show that B = A'A has the density
p(B)= -w(k, p)o(n - k, p) I(k-p-l)/2I1 -gI(n-k-p-1)/2 p(B)
w(n,p) IIpB
with respect to Lebesgue measure on the set 0 < B < Ip.
(iv) Consider a random matrix L: k x p with a density with respect
to Lebesgue measure given by
h(L) = clIp- LL(n-k-P-1)/2p (L L)
where for B E SPI
?(B) 1 ifO<<B<Ip 0 otherwise
and
w(n - k, p)
(vF-a)kp w(n, p)'
Show that B = L'L has the density p(B) given in part (iii) (use
Proposition 7.6). (v) Now, to conclude that A has h as its density, first prove the
following proposition: Suppose 9X is acted on measurably by a compact group G and T: - 'J is a maximal invariant. If
Pi and P2 are both G-invariant measures on 9 such that P,(i- '(C)) = P2(T- I(C)) for all measurable C C 6, then P, = P2.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
462 CANONICAL CORRELATION COEFFICIENTS
(vi) Now, apply the proposition above with 9 = lpk' G = ?k' T(X) = x'x, P, the distribution of A, and P2 the distribution of L as given in (iv). This shows that A has density h.
13. Consider a random matrix Z: n x p with a density given byf(ZIB, 2)
- :1-"n2 h(tr(Z - TB) 2-(Z - TB)') where T: n x k of rank k is
known, B: k x p is a matrix of unknown parameters, and 2: p x p is
positive definite and unknown. Assume that n > p + k, that
sup ICIn/2h (tr(C)) < + x, c E Sp
and that h is a nonincreasing function defined on [0, oo). Partition Z
into X: n X q and Y: n X r, q + r = p, so Z = (XY). Also, partition
2 into .ij, i, j= 1,2, where 27 is q x q, 222 is r x r, and 212 iS
q x r.
(i) Show that the maximum likelihood estimator of B is B= (T'T) TZ andf(ZIB, 2) = IV- /2h(trS1) where S = Z'QZ
with Q = I - P and P = T(T'T)-'T'.
(ii) Derive the likelihood ratio test of Ho: 212 = 0 versus H,: 212 * 0. Show that the test rejects for small values of
A(Z) =1iUS22i
(iii) For U: n x q and V: n X r, establish the identity
tr(UV)2I(UV)' = tr(V - U2l'212):22'(V - U2j'212) + tr U2jTl'U'. Use this identity to derive the conditional distribu tion of Y given X in the above model. Using the notation of
Section 10.5, show that the conditional density of Y given X is
fl ( YC, B1, 119 E222.19 X)
- I222.1I-n/2h(tr(Y - WC)2 I.1(Y - WC)' + B)>(q)
where n = tr(X - TB1)2Y,1(X - TB,) and (+(X))' =
Jer h(tr uu' + ij) du.
(iv) The null hypothesis is now that Cl = 0. Show that, for each fixed
'q, the likelihood ratio test (with C and 222.1 as parameters) based on the conditional density rejects for large values of A(Z). Verify (i), (ii), and (iii) of Proposition 10.17.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
NOTES AND REFERENCES 463
(v) Now, assume that
sup sup IC I/2h(trC+ N(n)=k2< +00. 'i>0 C e
Show that the likelihood ratio test for C, = 0 (with C, 22 1' B1, and I as parameters) rejects for large values of A(Z).
(vi) Show that, under Ho, the sample canonical correlations based on
Sll, S12, S22 (here S = Z'QZ) have the same distribution as when Z is N(TB, I,, ? E). Conclude that under Ho, A(Z) has
the same distribution as when Z is N(TB, In ? E).
NOTES AND REFERENCES
1. Canonical correlation analysis was first proposed in Hotelling (1935,
1936). There are as many approaches to canonical correlation analysis as there are books covering the subject. For a sample of these, see
Anderson (1958), Dempster (1969), Kshirsagar (1972), Rao (1973), Mardia, Kent, and Bibby (1979), Srivastava and Khatri (1979), and
Muirhead (1982).
2. See Eaton and Kariya (1981) for some material related to Proposition 10.13.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
Appendix
We begin this appendix with a statement and proof of a result due to Basu (1955). Consider a measurable space (Q,.'3) and a probability model
(P0I E10 0) defined on (6X, 'Mi). Consider a statistic T defined on (9, fi3) to
(J' ,), and let =i =(T-'(B)IB e f. Thus fT is a a-algebra and
T C i. Conditional expectation given fT is denoted by S( *I?T).
Definition A.1. The statistic T is a sufficient statistic for the family {PO0I E
9) if, for each bounded @ measurable function f, there exists a T measurable function f such that &e(f IfT) = f a.e. Po for all 0 E 9.
Note that the null set where the above equality does not hold is allowed to depend on both 0 and f. The usual intuitive description of sufficiency is that the conditional distribution of X e 9 (1(X) = Po for some 0 E 9) given T(X) = t does not depend on 0. Indeed, if P(- It) is such a version of
the conditional distribution of X given T(X) = t, then f defined by f(x) = h(T(x)) where
h(t) = f(x)P(dxlt)
serves as a version of &0(f IfT) for each 0 E 9.
Now, consider a statistic U defined on (9X, 33) to (F, ffi2)*
Definition A.2. The statistic U is called an ancillary statistic for the family (P910 E 9) if the distribution of U on (5, f2) does not depend on 0 E 9- that is, if for all B e 932
PO( U- '(B)}-=PQ(U- 1(B))
for all O,'q e 9.
465
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
466%p APPENDIX
In many instances, ancillary statistics are functions of maximal invariant statistics in a situation where a group acts transitively on the family of probabilities in question-see Section 7.5 for a discussion.
Finally, given a statistic T on (9%, 63,) to (6J, ?13,) and the parametric
family {(PI E6 9), let (Q610 E 9) be the induced family of distributions of T on (6@, @ l )-that is,
Qo(B) = Pe(T-I(B)), B E
Definition A.3. The family {QeIO E= 9) is called boundedly complete if the only bounded solution to the equation
Jh(y)Qe(dy) = 0, 6 e 9
is the function h = 0 a.e. Qe for all 0 E 9.
At times, a statistic T is called boundedly complete- this means that the induced family of distributions of T is boundedly complete according to the above definition. If the family {QeIO e 0) is an exponential family on a
Euclidean space and if 9 contains a nonempty open set, then {QlIO e 9) is
boundedly complete-see Lehmann (1959, page 132).
Theorem (Basu, 1955). If T is a boundedly complete sufficient statistic and if U is an ancillary statistic, then, for each 0, T(X) and U(X) are
independent.
Proof. It suffices to show that, for bounded measurable functions h and k on 6J and Z, we have
(A.1) &eh(T(X))k(U(X)) = &0h(T(X))6;ek(U(X)).
Since U is ancillary, a = &6k(U(X)) does not depend on 0, so 6,(k(U) -
a) = 0 for all 0. Hence
So [&e((k(U) -
a)I%T)] = 0 for all 0.
Since T is sufficient, there is a 1T measurable function, say f, such that
S )- a)I@T) = f a.e. Po. But since f is T measurable, we can write
f(x) = 4(T(x)) (see Lehmann, 1959, Lemma 1, page 37). Also, since k is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
APPENDIX 467
bounded, f can be taken to be bounded. Hence
go+(T) = O forallO
and 4 is bounded. The bounded completeness of T implies that 4 is 0 a.e. Q6, where Qo(B) = Po(T-I(B)), B e '@i? . Thus h(T)4'(T) = 0 a.e. QO, so
O = &eh(T)iP(T) = &0[h(T)6e((k(U) - a)l.T)]
= go[&6(h(T)(k(U) - a)I3T)I
= &eh(T)(k(U) - a).
Thus (A.1) holds.
This Theorem can be used in many of the examples in the text where we have used Proposition 7.19.
The second topic in this Appendix concerns monotone likelihood ratio and its implications. Let 9X and '4 be subsets of the real line.
Definition A.4. A nonnegative function k defined on 3 X ?J is totally positive of order 2 (TP-2) if, for xl < x2 and yi < Y2, we have
(A.2) k(x,, yl)k(x2, Y2) > k(x,, y2)k(x2, Y1)
In the case that 'Th is a parameter space and k(-, y) is a density with respect
to some fixed measure, it is customary to say that k has a monotone
likelihood ratio when k is TP-2. This nomenclature arises from the observa tion that, when k is TP-2 and yi < Y2, then the ratio k(*, y2)/k(-, Y,) is
nondecreasing in x-assuming that k(., y,) does not vanish. Some obvious examples of TP-2 functions are: exp[xy], xy for x > 0, yX for y > 0. If
x = g(s) and y = h(t) where g and h are both increasing or decreasing,
then k,(s, t) k(g(s), h(t)) is TP-2 whenever k is TP-2. Further, if 4'(x) > 0, 42(Y) > 0, and k is TP-2, then k,(x, y) = 41(x)42(y)k(x, y) is also
TP-2. The following result due to Karlin (1956) is of use in verifying that some
of the more complicated densities that arise in statistics are TP-2. Here is the setting. Let 9, @, and 5 be Borel subsets of R' and let ,u be a a-finite
measure on the Borel subsets of 'J.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
468 APPENDIX
Lemma (Karlin, 1956). Suppose g is TP-2 on QC x L54 and h is TP-2 on
9 X q. If
k(x, z) = fg(x, y)h(y, z)M(dy)
is finite for all x E % and z E Z, then k is TP-2.
Proof. For xi < x2 and z1 < Z2, the difference
A = k(x,, z,)k(x2, Z2) - k(x1, z2)k(x2, zi)
can be written
A = ffg(xI, y1)g(x2, y2)[h(yl, zl)h(y2, z2)- h(yl, Z2)h(y2, z,)]
x 11(dyl) IL(dY2)
Now, write A as the double integral over the set {y, < y2) plus the double
integral over the set {Yi > Y2). In the integral over the set {yi > Y2}, interchange y, and Y2 and then combine with the integral over (y, < Y2).
This yields
,A=ffJ [g(x,, y1)g(x2, Y2)-g(x1, y2)g(x2, Y1)] (Yi <Y2)
[h(yl, z,)h(y2, Z2)- h(yl, Z2)(h(y2, z,)]p(dy,)p(dy2).
On the set (y, < Y2), both of the bracketed expressions are non-negative as
g and h are TP-2. Hence A > O so k is TP-2. El
Here are some examples.
* Example A.1. With X = (0, oo), let
x(m/2)-l exp[-x]
f(x,m) = 2m/2Fr(m/2)
be the density of a chi-squared distribution with m degrees of
freedom. Since xm/2, x E 9 and m > 0, is TP-2, f(x, m) is TP-2.
Recall that the density of a noncentral chi-squared distribution with
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
PROPOSITION A. 1 469
p degrees of freedom and noncentrality parameter X > 0 is given by
h(x,X) = E ' / 21] f(x, p + 2j). j=o J
Observe that f(x, p + 2j) is TP-2 in x and ] and (X/2)*exp[- 'XI/j! is TP-2 in j and X. With G? = (0, 1,. .. } and t& as counting measure, Karlin's Lemma implies that h(x, X) is TP-2. *
* Example A.2. Recall that, if x2 and x2 are independent random variables, then Y = x2 /X2 has a density given by
f(yjm n) r((m + n)/2) y (m/2)l , ' r(m/2) r(n/2) (I + y)(m+n)12
If the random variable x2 is noncentral chi-squared, say XP(X), rather than central chi-squared, then Y= X2(X)/X2 has a density
00 (X/2)Vepxn Xl
h(ylX)= F, (A /fL2Jf (ylp + 2j, n).
Of course, Y has an unnormalized F(p, m; X) distribution accord ing to our usage in Section 8.4. Since f(yIp + 2j, n) is TP-2 in y and j, it follows as in Example A. l that h is TP-2.
The next result yields the useful fact that the noncentral Student's t distribution is TP-2.
Proposition A.1. Suppose f > 0 defined on (0, oo) satisfies
(i) fo'e"xf(x) dx < +oo for u E R'
(ii) f(x/q) is TP-2 on (0, oo) X (0, ox).
For 0 E R1 and t E R', define k by k(t, 0) = J& eotxf(x) dx. Then k is
TP-2.
Proof First consider t E R' and 0 > 0. Set v = Ox in the integral defining
k to obtain
k(t, 0) e t"dv.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
470 APPENDIX
Now apply Karlin's Lemma to conclude that k is TP-2 on R' X (0, oo). A similar argument shows that k is TP-2 on R' X (- oo, 0). Since k(t, 0) is a
constant, it is now easy to show that k is TP-2 on R' x R'.
* Example A.3. Suppose X is N(,u, 1) and Y is x2. The random variable T = X/ VY, which is, up to a factor of Vn , a noncentral
Student's t random variable, has a density that depends on ,u-the noncentrality parameter. The density of T, (derived by writing down the joint density of X and Y, changing variables to T and
W = VY, and integrating out W) can be written
2ep[- i/L F(n/2)(l + t2 )(n
1)/2
x f exp[i(t)ttx] exp[ -x2]x- dx
where
ip(t) = Jt(l + t2) 1/2
is an increasing function of t. Consider the function
00 k(v, It) = jexp[ v,ux ]exp[_x2]xn- dx.
With f(x) = exp_-x2]x-, Proposition A.l shows k, and hence h, is TP-2.
We conclude this appendix with a brief description of the role of TP-2 in one sided testing problems. Consider a TP-2 density p(xI0) for x E c
and 0 E 0) c R'. Suppose we want to test the null hypothesis Ho: 0 c
(-oo, 00] nl 0 versus H1: 0 e (00, oo) rf 0. The following basic result is
due to Karlin and Rubin (1956).
Proposition A.2. Given any test 40 of Ho versus HI, there exists a test 4
of the form
I if x > x0
+(x)-(y if x =xO
O if x <xO
with O < y < 1 such that 6.0 -< S&oo for 0 < 00 and &e4 >- &e40p for 9 > 00. For any such test 4, &epO is nondecreasing in 0.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
Comments on Selected Problems
CHAPTER 1
4. This problem gives the direct sum version of partitioned matrices. For (ii), identify V1 with vectors of the form (v , 0) E VI ED V2 and restrict
T to these. This restriction is a map from V1 to V, @3 V2 so T(v , 0) =
Iz(v,), z2(v)) where zl(vl) E V1 and z2(v,) E V2. Show that z1 is a linear transformation on V1 to V, and Z2 is a linear transformation on
V, to V2. This gives A Il and A21 . A similar argument gives A,2 and A22.
Part (iii) is a routine computation.
5. If xr+1 = Ercixi, then wr+ I = ErCiWi.
8. If u E Rk has coordinates U,,..., Uk, then Au = E'uixi and all such
vectors are just span (xl,..., Xk}. For (ii), r(A) = r(A') so dim IR (A'A) = dim % (AA').
10. The algorithm of projecting x2,..., Xk onto (span x,)' is known as Bjork's algorithm (Bjork, 1967) and is an alternative method of doing Gram-Schmidt. Once you see that Y2,. Yk are perpendicular to Yi, this problem is not hard.
11. The assumptions and linearity imply that [Ax, w] = [Bx, w] for all
x e V and w E W. Thus [(A - B)x, w] = O for all w. Choose w= (A - B)x so (A - B)x = 0.
12. Choose z such that [y , z] * 0. Then [Y1, z]x1 = [Y2, z]x2 so set
c = [Y2, Z]/[YI, z]. Thus cx2 EoyY
= x2 0 Y2 so cyI O x2 = Y2 ? x2. Hence ClIX21I2yI = I1x2112Y2 so YI = c- 'Y2
13. This problem shows the topologies generated by inner products are all the same. We know [x, y] = (x, Ay) for some A > 0. Let c, be the
minimum eigenvalue of A, and let c2 be the maximum eigenvalue of A.
471
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
472 COMMENTS ON SELECTED PROBLEMS
14. This is just the Cauchy-Schwarz Inequality.
15. The classical two-way ANOVA table is a consequence of this problem. That A, B1, B2, and B3 are orthogonal projections is a routine but useful calculation. Just keep the notation straight and verify that p2 = p = p', which characterizes orthogonal projections.
16. To show that F(M') c M', verify that (u, Fv) = 0 for all u E M
when v E M'. Use the fact that FrT = I and u = Pu1 for some
U e M (since F(M) C M and F is nonsingular).
17. Use Cauchy-Schwarz and the fact that PMX = x for x E M.
18. This is Cauchy-Schwarz for the non-negative definite bilinear form [C, D] = trACBD'.
20. Use Proposition 1.36 and the assumption that A is real.
21. The representation aP + ,B(I - P) is a spectral type repre sentation-see Theorem 1.2a. If M = <R(P), let x1, ..., xr, Xr+ I, ....
xn be any orthonormal basis such that M = span{x,,..., Xr). Then
Axi = axi, i = 1,..., r, and Axi = fxi, i = r + 1,..., n. The char
acteristic polynomial of A must be (a - X)r(/3 - X),-r.
22. Since X, = suplX11l=1(x, Ax), ,i, = sup11x11==(x, Bx), and (x, Ax) >
(x, Bx), obviously XI >? ,. Now, argue by contradiction-letj be the
smallest index such that Xi
< ,u;. Consider eigenvectors x,,..., xn and
yl, yn with Axi = Aixi and Byi =
= iyi, i= l,...,n. Let M =
span{xj, xj+1,. . ., xn) and let N = span(y,..., yj). Since dim
M = n - j + 1, dim M n N > 1. Using the identities Aj =
suPX(M, IXII=I(x, Ax), l&j = inf x Nlxll=I(x, Bx), for any x E M n N,
xll = 1, we have (x, Ax) < AX < yj < (x, Bx), which is a contradic
tion. 23. Write S = EnX x 1f x in spectral form where Xi > 0, i = 1,..., n.
Then 0 = (S, T) = En Ai(xi, Txi), which implies (xi, Txi) = 0 for i =
1,... , n as T > 0. This implies T = 0.
24. Since tr A and (A, I) are both linear in A, it suffices to show equality
for A's of the form A = xOJ y. But (xO y, I) = (x, y). However, that
tr x El y = (x, y) is easily verified by choosing a coordinate system.
25. Parts (i) and (ii) are easy but (iii) is not. It is false that A2 > 12 and a
2 x 2 matrix counter example is not hard to construct. It is true that
Al/2 > B1/2. To see this, let C = B'12A-1/2, SO by hypothesis, I > C'C.
Note that the eigenvalues of C are real and positive-being the same
as those of B'14A- 1/2B1/4 which is positive definite. If A is any
eigenvalue for C, there is a corresponding eigenvector-say x such that
xll = 1 and Cx = Xx. The relation I > C'C implies A2 < 1, so 0 < A
< 1 as A is positive. Thus all the eigenvalues of C are in (0,1] so
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 1 473
the same is true of A'14B'12A'14. Hence A'14B12A"'4 < I so B1/2 1/ A12.
26. Since P is an orthogonal projection, all its eigenvalues are zero or one and the multiplicity of one is the rank of P. But tr P is just the sum of
the eigenvalues of P.
28. Since any A c f&(V, V) can be written as (A + A')/2 + (A - A')/2, it follows that M + N = (V, V). If A e M n N, then A = A' = -A,
so A = 0. Thus E(V, V) is the direct sum of M and N so dim M +
dim N = n2. A direct calculation shows that (xE1O xj + xj1 xiIi )< j U (xi L xj
- xj JxiIi <j) is an orthogonal set of vectors, none of
which is zero, and hence the set is linearly independent. Since the set has n2 elements, it forms a basis for lE(V, V). Because x,O Xi
+ xj E Xi E M and xiO xj-x1 Uxi E N, dim M > n(n + 1)/2 and dim N >
n(n - 1)/2. Assertions (i), (ii), and (iii) now follow. For (iv), just verify that the map A -* (A + A')/2 is idempotent and self-adjoint.
29. Part (i) is a consequence of suplvI-1llAvll = sup1lvil=,[Av, Av]'/2 =
sup11v11.= (v, A'Av)"/2 and the spectral theorem. The triangle inequality follows from 111A + Bill =
supliv,1=1 IhAv + Bvll < sup11v11=1(llAvll +
IlBvll) < supllvll=I IlAvil + sup,,,,ll, v IBvll.
30. This problem is easy, but it is worth some careful thought-it provides more evidence that A ? B has been defined properly and ( , -
) is an
appropriate inner produce on f (W, V). Assertion (i) is easy since (A ? B)(xi El wj) = (Axi) 0 (Bwj) = (Xixi) O (1jwj) = Xijxi EO wj.
Obviously, xi E wj is an eigenvector of the eigenvalue X juj. Part (ii)
follows since the two linear transformations agree on the basis {xio wjli = 1,..., m,j= 1,..., n) for fE(W, V). For (iii), if the eigen
values of A and B are positive, so are the eigenvalues of A ? B. Since
the trace of a self-adjoint linear transformation in the sum of the
eigenvalues (this is true even without self-adjointness, but the proof requires a bit more than we have established here), we have tr A ? B = Ei, Xji, = (EiXi)(Xj1ij) = (tr A)(tr B). Since the determinant
is the product of the eigenvalues, det(A 0 B) = 1i, j(Xjij) =
(HAi)n(lt,uj)m = (det A)'(det B)m. 31. Since 4'4 = Ip, 4 is a linearly isometry and its columns form an
orthonormal set. Since R(4) c M and the two subspaces have the same dimension, (i) follows. (ii) is immediate.
32. If C is n x k and D is k X n, the set of nonzero eigenvalues of CD is
the same as the set of nonzero eigenvalues of DC.
33. Apply Problem 32.
34. Orthogonal transformations preserve angles.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
474 COMMENTS ON SELECTED PROBLEMS
35. This problem requires that you have a facility in dealing with condi tional expectation. If you do, the problem requires a bit of calculation
but not much more. If you don't, proceed to Chapter 2.
CHAPTER 2
1. Write x = Y2c x1 so (x, X) = Ec1(xi, X). Thus &j(x, X)l <
I1 Ici1&K(xi, X)l and &;t(x , X)I is finite by assumption. To show that
Cov(X) exists, it suffices to verify that var(x, X) exists for each x E V. But var(x, X) = var{Eci(xi, X)} = E cov{c,(xj, X),
cj(xj, X)). Then var{ci(xi, X)) = -[Ci(XiX)j2 [6ci(xi, X)12, which exists by assumption. The Cauchy-Schwarz Inequality shows that [cov{ci(xi, X), cj(xj, X)}]2 < var{ci(xi, X)) var{cj(xj, X)). But, var{ci(xi, X)) exists by the above argument.
2. All inner products on a finite dimensional vector space are related via
the positive definite quadratic forms. An easy calculation yields the result of this problem.
3. Let ( *, - )i be an inner product on Vi, i = 1, 2. Since fi is linear on Vi,
f (x) = (xi, x)i for xi E Vi, i = 1,2. Thus if XI and X2 are uncorre
lated (the choice of inner product is irrelevant by Problem 2), (2.2) holds. Conversely, if (2.2) holds, then Cov((x,, X0)1,(x2, X2)2)= 0 forx i EVi, i = 1,2 since (x1, ),I and (X2, )2 are linear functions.
4. Let s n - r and consider r- E ?r and a Borel set B, of Rr. Then
Pr{PX e B) = Pr(rX E B1, XE Rs}
=Pr(( O .S
)(X c xR
= Pr{(X) E B1 X Rs} = Pr{ X E B1).
The third equality holds since the matrix
tr oi
is in OnQ. Thus X has an ?r-invariant distribution. That X given X has an
Or-invariant distribution is easy to prove when X has a density with
respect to Lebesgue measure on Rn (the density has a version that
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 2 475
satisfies f(x) = f(4x) for x E R', / E (9n,). The general case requires
some fiddling with conditional expections- this is left to the interested reader.
5. Let Ai = Cov(X,), i = 1,..., n. It suffices to show that var(x, 2X,) = I(x, Aix). But (x, Xi), i = 1,..., n, are uncorrelated, so var[l(x, Xi)I = I var(x, Xi) = 1:(x, Aix).
6. &U = 2pi,ei = p. Let U have coordinates U1,..., Uk. Then Cov(U) =
6UU' - pp' and UU' is a p x p matrix with elements UiUj. For i * j,
U,Uj = O and for i = j, UiUj =
Ui. Since &Ui = pi, I;UU' = Dp. When
0 < pi < 1, Dp has rank k and the rank of Cov(U) is the rank of
Ik - Dp1 /2pp'Dp1 I/2. Let u = Dp- 112p, so u E Rk has length one. Thus
Ik- uu' is a rank k - 1 orthogonal projection. The null space of
Cov U is span{e) where e is the vector of ones in Rk. The rest is easy.
7. The random variable X takes on n! values-namely the n! permuta
tions of x-each with probability l/n!. A direct calculation gives SX= xe where x = n-'E'xi. The distribution of X is permutation
invariant, which implies that Cov X has the form a2A where a I
and aiJ = p for i *j where - l/(n - 1) < p < 1. Since var(e'X) = 0,
we see that p = -l/(n - 1). Thus a 2 = var(X,) = n -[Ll(xi - x)2]
where X, is the first coordinate of X.
8. Setting D = -I, &X =-;X so X= 0. For i *j, cov{Xi, Xj)= cov(- Xi, Xj) = - cov(X>, Xj) so Xi and Xj are uncorrelated. The first
equality is obtained by choosing D with dii = -1 and djj = 1 in the
relation P,(X) = e,(DX).
9. This is a direct calculation. 10. It suffices to verify the equality for A = x O y as both sides of the
equality are linear in A. For A = x O y, (A, E) = (x, Yy) and (iu, Au) = (,u, x)(,, y), so the equality is obvious.
11. To say Cov(X) = In ? : is to say that cov((tr AX'), (tr BX')) =
trAM2B'. To show rows 1 and 2 are uncorrelated, pick A = e1v' and B = e2u' where u, v E RP. Let X1 and X2 be the first two rows of X. Then trAX' = v'X,, trBX' = u'X2, and trA2 B = 0. The desired
equality is established by first showing that it is valid for A xy', x, y E R', and using linearity. When A = xy', a useful equality is X'AX = EiExiyjXiXj' where the rows of X are X,,..., X".
12. The equation rArl = A for r1 e9p implies that A = cIp for some c.
13. Cov((r ? I)X) = Cov(X) implies Cov(X) - I e : for some E. Cov((I X +)X) = Cov(X) then implies 4i:4' = , which necessitates I = cI for some c > 0. Part (ii) is immediate since r ? o is an
orthogonal transformation on (E (V, W), ( . , * )).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
476 COMMENTS ON SELECTED PROBLEMS
14. This problem is a nasty calculation intended to inspire an appreciation for the equation Cov(X) = In X E.
15. Since E (X) = E (-X), &X = 0. Also, E(X) = E4J'X) implies Cov(X)
= cI for some c > 0. But 11XI12 = 1 implies c = l/n. Best affine
predictor of Xl given X is 0. I would predict XI by saying that X1 is
/1 - X'X with probability 2 and Xl is - V1 - X'X with probability 2.
16. This is just the definition of 0.
17. For (i), just calculate. For (ii), Cov(S) = 2I2 X I2 by Proposition 2.23. The coordinate inner product on R3 is not the inner product ( , *) on
52.
CHAPTER 3
2. Since var(X,) = var(Y1) = 1 and cov{X1, Y1) = p, IPI < 1. Form Z =
(XY)-an n x 2 matrix. Then Cov(Z) = In 0 A where
A(I P)
When IPI < 1, A is positive definite, so I,, 0 A is positive definite.
Conditioning on Y, E (XIY) = N(pY, ( - p2)I,,), so C(Q(Y)XIY) =
N(O,(I - p2)Q(Y)) as Q(Y)Y = 0 and Q(Y) is an orthogonal projec
tion. Now, apply Proposition 3.8 for Y fixed to get E(W) = (1 -
2)X
3. Just do the calculations. 4. Since p(x) is zero in the second and fourth quadrants, X cannot be
normal. Just find the marginal density of Xl to show that Xl is normal.
5. Write U in the form X'AX where A is symmetric. Then apply Proposi
tions 3.8 and 3.11.
6. Note that Cov(XO X) = 2I 0 I by Proposition 2.23. Since (X, AX) = (Xo X, A), and similarly for (X, BX), 0 = cov{(X, AX),
(X,BX)) = cov((XO X, A), (XO X, B)} = (A,2(I 0 I)B) = 2tr AB. Thus 0 = tr A'72BA'12 so A1/2BA1/2 = 0, which shows A1/2B1/2 = 0
and hence AB = 0.
7. Since &I;exp(itW)] = exp{ituj - ajitI], &[exp(itla^W))] = exp[itlaj1uj - (Y.IajIaI)IttI, so E(YaWj) = C(Qajuj, 2:IajIa). Part (ii) is immediate
from (i). 8. For (i), use the independence of R and ZO to compute as follows:
P{U < u} = P{Z0 < u/R) = Jf0P{Z0 < u/t)G(dt) = fro '(u/t)
G(dt) where 4D is the distribution function of ZO. Now, differentiate.
Part (ii) is clear.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 4 477
9. Let 633, be the sub a-algebra induced by T,(X) = X2 and let '2 be the
sub a-algebra induced by T2(X) = X2X2. Since 'i"2 C ,1 for any bounded function f(X), we have &(f(X)IV2) = 0; MX)1'3k) But for f( X) = h( X2 X,), the conditional expectation given 1' J can be computed via the conditional distribution of X2 X, given X2, which is
(3.3) e(X2'XIX2) = N(X2X2'221Y21, X2X2 ? 1112)
Hence 6(h(X2X,)'3?i1) is 63?2 measurable, so &(h(X2X,)t@2) = & (h(X2AX,) I)3,). This implies that the conditional distribution (3.3) serves as a version of the conditional distribution of X2 X, given X2 X2.
10. Show that T-'T,: Rn -4 Rn is an orthogonal transformation so l(C) = l((T-T,)(C)). Setting B = T,(C), we have Po(B) = vI(B) for Borel B.
11. The measures Po and v, are equal up to a constant so all that needs to be calculated is vo(C)/vl(C) for some set C with 0 < vI(C) < + oo. Do the calculation for C = {vi[v, v] < 1).
12. The inner product < on p is not the coordinate inner product. The "Lebesgue measure" on (SP, ( ,*) given by our construction is not l(dS) = HI<jdsjj, but is vO(dS) (2)P(P-')1(dS).
13. Any matrix M of the form
I b ...
b
M=i . * :p Xp : . ~~b
b ... b 11
can be written as M = a[(p - l)b + I]A + a(l - b)(I - A). This is
a spectral decomposition for M so M has eigenvalues a((p - l)b + 1) and a(l - b) (of multiplicityp - 1). Setting a = a[(p - l)b + 1] and /3 = a( -b) solves (i). Clearly, M- l = a- 'A + f3- 1 (I - A) whenever
a and /B are not zero. To do part (ii), use the parameterization (IL, a, /3)
given above (a = a2 and b = p). Then use the factorization criterion on the likelihood function.
CHAPTER 4
1. Part (i) is clear since Z/3 = - 3,Bizi for ,B E Rk. For (ii), use the singular
value decomposition to write Z = Ej1xju' where r is the rank of Z, {x,,.. ., xr) is an orthonormal set in R', (u,,..., Ur) is an orthonormal
set in Rk, M = span(x1,..., xr), and 6L(Z) = (span(u1,..., ur))l.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
478 COMMENTS ON SELECTED PROBLEMS
Thus (Z'Z 'I= iA 2uiu and a direct calculation shows that Z(Z'Z)-Z' rx xi, which is the orthogonal projection onto M.
2. Since C(Xi)= C( + Ei) where &tEi = 0 and var(ei) = 1, it follows
that e(X) = E(,6e + E) where &e = 0 and Cov(e) = I," A direct ap
plication of least-squares yields ,B = X for this linear model. For (iii), since the same ,8 is added to each coordinate of E, the vector of ordered
X's has the same distribution as the fBe + v where v is the vector of
ordered e's. Thus e (U) = E (,/e + i) so =U = ,Be + a0 and Cov(U) =
Cov(v) = E. Hence E(U - ao) = Cf(Pe + (p - ao)). Based on this
model, the Gauss-Markov estimator for ,B is /3 = (e'E 'e)- 'e'Y. '(U - ao). Since X = (l/n)e'(U - ao) (show e'ao 0 using the symme
try of f), it follows from the Gauss-Markov Theorem that var(/3) <
var(/3).
3. That M - X = M n w' is clear since w c M. The condition (PM -
P.)' = PM - P,, follows from observing that PMUP., = P,.PM =
P,. Thnus PM - P,,, is an orthogonal projection onto its range. That f(PM -P,,) = M - o is easily verified by writing x e V as x = xl + x2 +
X3 where x, E w, x2 E M - X, and x3 E M1. Then (PM - P,,)(x, +
X2 + X3)
= xi + X2
- XI = X2. Writing PM = PM - P, +
P. and not
ing that (PM - PW,)P. = 0 yields the final identity.
4. That 6A (A) = MO is clear. To show 'R(B,) = M, - MO, first consider
the transformation C defined by (Cy)i1 - ji., i = 1,..., I,j = 1,..., J. Then C2 = C = C', and clearly, 6@(C) C M,. But if y E M,, then
Cy = y so C is the orthogonal projection onto M,. From Problem 3
(with M = M, and X = Mo), we see that C - AO is the orthogonal
projection onto M, - Mo. But ((C - AO)y)ij = y.- y.., which is just
(B,y)i. Thus B, = C - AO so 6A(B,) = M, - Mo. A similar argu
ment shows IR(B2) = M2- M. For (ii), use the fact that AO + B, +
B2 + B3 is the identity and the four orthogonal projections are per
pendicular to each other. For (iii), first observe that M = M, + M2
and M, nl M2 = Mo. If It has the assumed representation, let v be the
vector with vij = a + P3i and let ( be the vector with =ij =
yj. Then
v E M, and t E M2 so ,u = v + E E M, + M2.. Conversely, suppose
A E Mo C (MI -
Mo) @ (M2 -
Mo)-say IL = + v + t. Since 8 E
09, 8i =6. for all i, j, so set a =..Since P E M, - Mo, vij - Pik = O
for all j, k for each fixed i and v..= 0. Take j = 1 and set Pi =Pi. Then vij /i for j = 1,..., J and, since it.= 0, Ifi = 0. Similarly,
setting yj= (lj, tij = yj for all i, j and since t..= 0, -yj = 0. Thus
ui = a + + -y.
where MP,8 = 2 yj
= ?
5. With n = dim V, the density of Y is (up to constants) f(YI,U, a2) = a-nexp[ - (l/2a2)Iy - AL112]. Using the results and notation Problem
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 4 479
3, write V- = =ED(M- t)@EDM' so (M- t)@M'= w'. Under
Ho, ,u e w so 10 = P,,,y is the maximum likelihood estimator of u and
(4.4) = Inexp[- 2lIIQwy112]
where QR, = I - P,,. Maxiing (4.4) over a2 yields 6,? = n- 'IIQ',yII2. A similar analysis under H1 shows that the maximum likelihood
estimator of ,u is P = PMY and a6 = n- 1IQMYII2 is the maximum likelihood estimator of a 2. Thus the likelihood ratio test rejects for small values of the ratio
f (yi 0 'i2) A6n IIQMYII2 1n/2 A ' IIIyI2 J (y}se AA2) A-,n
IIQwyll2
But Q. = QM + PM_@ and QMPM_w = 0, SO IIQ.yII2 = IIQMYII2 + 1PM-_.,YI2. But rejecting for small values of A(y) is equivalent to
rejecting for large values of (A(y))-2"n - 1 = IIPM_,YII2/IIQMYlI2. Under Ho, IL E o so C(PM,Y)= N(0,a2PMM_w) and C(QMY)= N(O, G2QM). Since QMPM-.,, = 0, QMY and PM-,Y are independent and 4(lIPM-_wlYl) = X2r where r = dim M - dim w. Also,
et(1QM,y12)= O2X2k where k = dim M.
6. We use the notation of Problems 4 and 5. In the parameterization
described in (iii) of Problem 4, ,BI = P2 = * = PI iff ,u Ee M2. Thus w = M2 so M - w = M, - MO. Since M' is the range of B3 (Problem
1.15), 11B3y112 = IIQMyII2, and it is clear that I1B3Y112 = Y(y11 - 2EE y1+ y.)2. Also, since M-w = M1 -Mo, PM-@ = PM -PMO and
IIpM-_YII = IIpM,YII - II 1pMOy112 - =i JEj(Y - y )2
7. Since 6A(X')= 91(X'X) and X'y is in the range of X', there exists a
b e RK' such that X'Xb = X'y. Now, suppose that b is any solution.
First note that PmX = X since each column of X is in M. Since
X'Xb = X'y, we have X'[Xb - PMY] = X'Xb - XPMY = X'Xb -
(PMX)'y = X'Xb - X'y = 0. Thus the vector v = Xb - PMY is per
pendicular to each column of X (X'v = 0) so v e M' . But Xb E M,
and obviously, PmyY E M, so v E M. Hence v = 0, so Xb = PMY.
8. Since I E y, Gauss-Markov and least-squares agree iff
(4.5) (aPe+fiQe)MCM, for all a, 1 > O.
But (4.5) is equivalent to the two conditions P M C M and QeM c M.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
480 COMMENTS ON SELECTED PROBLEMS
But if e E M, then M = span{e) @ M, where M, c (span(e))'. Thus PeM = span(e) 5 M arnd QeM = M1 c M, so Gauss-Markoy equals least-squares. If e E MI, then M c (span e)I, so PeM= (0) and
QeM = M, so again Gauss-Markov equals least-squares. For (ii), if
e ? M' and e ? M, then one of the two conditions P M C M or
QeM C M is violated, so least-squares and Gauss-Markov cannot
agree for all a and /3. For (ii), since M C (span(e))' and M *
(span(e))', we can write Rn = span{e) ED M @ M, where M,
(span{e))' -M and M, * (0). Let P, be the orthogonal projection
onto M,. Then the exponent in the density for Y is (ignoring the
factor - 2) (y - p)' (a 'Pe + 'Qe) (Y - A) (PeY + PIY +
PM(Y -))Y(a'Pe + A-'Qe)(PeY + PIY + PM(Y - = a'y'Pey +#-ly'Ply + /-3(y - ,u)PM(y - tL) where we have used the fact
that Qe = P1 + PM and PIPM= 0. Since det(aPe + N3Qe)= af3t ,
the usual arguments yields a y = MY' a = Y'PeY, and / = (n - l) -ly'Ply as maximum likelihood estimators. When M= span{e),
then the maximum likelihood estimators for (a, ,u) do not exist-other
than the solution a = Pey and & = 0 (which is outside the parameter
space). The whole point is that when e E M, you must have replica
tions to estimate a when the covariance structure is aPe + /Qe.
9. Define the inner product (, ) on Rn by (x, y) = x'Ej ly. In the inner
product space (R',(, *)), Y = X,B and Cov(Y) = a2I. The transfor
mation P defined by the matrix X( X'E 'X) X'2'l satisfies p2 = p
and is self-adjoint in (Rn,(, )). Thus P is an orthogonal projection
onto its range, which is easily shown to be the column space of X. The
Gauss-Markov Theorem implies that i = PY as claimed. Since u =
X,B, X',L = X'X/3 so /3 = (X'X)- 'X',4. Hence /3 = (X'X)- 'X'A, which
is just the expression given.
10. For (i), each F e C(V) is nonsingular so F(M) C M is equivalent to
r(M)= M-hence F`(M) = M and F-' = F,. Parts (ii) and (iii)
are easy. To verify (iv), to(cFY + xo) = PM(cFY + xo) = cPMFY +
xo = cFPMY + xo = crto(Y) + xo. The identity Pmr = FPm for F E
eM(V) was used to obtain the third equality. For (v), first set F = I
and xo = - PMY to obtain
(4.6) t(y) = t(QMy) + PMY
Then to calculate t, we need only know t for vectors u E M' as
QMY e M1. Fix u E M1 and let z = t(u) so z E M by assumption. Then there exists a r e ?M(V) such that Fu = u and Fz = -z. For
this r, we have z = t(u) = t(Fu) = Ft(u) = Fz = -z so z = 0. Hence
t(u) = 0 for all u e M' and the result follows.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 5 481
11. Part (i) follows by showing directly that the regression subspace M is invariant under each I,, @ A. For (ii), an element of M has the form
U Z= (Zl3, Z232) E= E2, n for some #I E Rk and /#2 E Rk. To obtain an
example where M is not invariant under all In ? 2, take k = 1,
Zi = e,, and Z2 = ?2 SO IU iS
#I 0 0 #2
O O
That the set of such ,u's is not invariant under all I,, X is easily
verified. When Z1 = Z2 then ,u = ZIB where B is k x 2 with ith
column gli, i = 1, 2. Thus Example 4.4 applies. For (iii), first observe
that Z1 and Z2 have the same column space (when they are of full
rank) iff Z2 = Z1C where C is k x k and nonsingular. Now, apply part
(ii) with f2 replaced by C/2, so M is the set of ,i's of the form j = Z,B
where B E 2, k
CHAPTER 5
1. Let a1,..., ap be the columns of A and apply Gram-Schmidt to these
vectors in the order ap,, ap 1,.. ., aI. Now argue as in Proposition 5.2.
2. Follows easily from the uniqueness of F(S). 3. Just modify the proof of Proposition 5.4. 4. Apply Proposition 5.7 5. That F is one-to-one and onto follows from Proposition 5.2. Given
AeiE, F-1(A)E6p,nxG+ isthepair(O,U)whereA=4U.For (ii), F(r, UT') = rouTr = (r X T)(OU) = (IF T)(F(41, U)). If
F-1(A) = (4, U), then A = AU and 4 and U are unique. Then (r X T)A =1'AT' =F4'UT' and F4'E' Y,nand UT' E Gt. Uniqueness implies that F- '(F4UT') = (DP, UT').
6. When Dg(xo) exists, it is the unique n x n matrix that satisfies
(5.3) li llg(x) - g(xo) -
Dg(xo)(x -
xO)II - 0.
X-.*xo lix - xoli
But by assumption, (5.3) is satisfied by A (for Dg(xo)). By definition Jg(xo) = det(Dg(xO)).
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
482 COMMENTS ON SELECTED PROBLEMS
7. With tii denoting the ith diagonal element of T, the set {Tltii > 0) is open since the function T -
ti is continuous on V to R'. But GT -
n {TItii > 0), which is open. That g has the given representation is just a matter of doing a little algebra. To establish the fact that
limX 0(IIR(x)II/IIxII) = 0, we are free to use any norm we want on V and Sp (all norms defined by inner products define the same topology). Using the trace inner product on V and S., IIR(x)ll2 = Ixx'I12 = tr xx'xx' and 11xI12 = tr xx', x E V. But for S > 0, tr S2 < (tr S)2 so
iIR(x)II/ IxII < trxx', which converges to zero as x -* 0. For (iii), write
S = L (x), string the S coordinates out as a column vector in the order
s11 S21, S22 S319 S32 S33,..., and string the x coordinates out in the
same order. Then the matrix of L is lower triangular and its determi
nant is easily computed by induction. Part (iv) is immediate from Problem 6.
8. Just write out the equations SS- = I in terms of the blocks and solve.
9. That p2 = p is easily checked. Also, some algebra and Problem 8 show that (Pu, v) = (u, Pv) so P is self-adjoint in the inner product
(-, *). Thus P is an orthogonal projection on (RP, (., *)). Obviously,
R(P) (xx (= ) ,Z }
Since
PX = -1222 0
IIPxII2 = (PX, PX) = - - )12? )
- (Y
- -12222 Z)2 (Y-212222 Z)
A similar calculation yields 11(I - P)x12 = Z'Y221z. For (iii), the expo nent in the density of X is - 2(x, x) = - 24IPxII2 - 411(1 - P)xjj2.
Marginally, Z is N(0, 222). so the exponent in Z's density is - 211(I -
P)xII2. Thus dividing shows that the exponent in the conditional density of Y given Z is - IllPx1I2, which corresponds to a normal
distribution with mean 72122921Z and covariance (21f' = -
E12222 21
10. On GT for j < i, tij ranges from - oo to + oo and each integral
contributes 2 -there are p( p - 1)/2 of these. For j = i, tii ranges
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 6 483
from 0 to oo and the change of variable uii= t2/2 shows that the
integral over tii is (V)r-i-'F((r - i + 1)/2). Hence the integral is
equal to
1 (p (P -1))/4 2(P (P-1))/42 121(r-i- 1)rt +1)
which is just 2-Pc(r, p).
CHAPTER 6
1. Each g e Gl(V) maps a linearly independent set into a linearly
independent set. Thus g(M) c M implies g(M) = M as g(M) and M
have the same dimension. That G(M) is a group is clear. For (ii),
(g11 g12 ) EM M fory E Rq 9,21 922 1 01
iff g21y = 0 fory E Rq iff g21 =0. But
( glI g12 0 g22
is nonsingular iff both gI1 and g22 are nonsingular. That G1 and G2 are
subgroups of G(M) is obvious. To show G2 is normal, consider h e G2 and g e G(M). Then
( g22 0 rIr U g-l i
has its 2, 2 element Ir, so is in G2. For (iv), that G , n G2= (I) is clear. Each g E G can be written as
/g11 912 \ IqI 0 g1191
(\0 g22 \ 0 g22 0 Ir )
which has the form g = hk with h E G, and k E G2. The representa
tion is unique as G1 r G2 = (I}. Also, g,g2 =
h1k,h2k2 -
h1h2h- 'k1h2k2 = h3k3 by the uniqueness of the representation. 2. G(M) does not act transitively on V - (0) since the vector (h), y * 0
remains in M under the action of each g E G. To show G(M) is
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
484 COMMENTS ON SELECTED PROBLEMS
transitive on V nl MC, consider
i=l,
with zt 0 and z2 * 0. It is easy to argue there is a g E G(M) such
that gx = x2 (since z, 0 andz2 * 0).
3. Each n x n matrix 1 E n can be regarded as an n2-dimensional vector. A sequence (1)j converges to a point x e R' iff each element
of F1 converges to the corresponding element of x. It is clear that the limit of a sequence of orthogonal matrices is another orthogonal
matrix. To show On is a topological group, it must be shown that the
map (r, 4) -- rF is continuous from en X O5)n to en -this is routine.
To show x(r) = I for all r, first observe that H = (x()lrF en) is a subgroup of the multiplicative group (0, oo) and H is compact as it is
the continuous image of a compact set. Suppose r E H and r * 1.
Then r1 E H for j = 1,2, ... as H is a group, but (ri) has no
convergent subsequence- this contradicts the compactness of H. Hence r = 1.
4. Set x=eu and {(u) = log X(eu), u E R'. Then {(ul + u2) =(ul) +
((u2) so ( is a continuous homomorphism on R' to R'. It must be
shown that ((u) = vu for some fixed real v. This follows from the
solution to Problem 6 below in the special case that V = R'.
5. This problem is easy, but the result is worth noting.
6. Part (i) is easy and for part (ii), all that needs to be shown is that p is
linear. First observe that
(6.6) O(vI + v2) = q4v1) + p(v2)
so it remains to verify that O(Xv) = X4(v) for X E R'. (6.6) implies
o(= 0 and 4(nv) = ncp(v) for n = 1,2,.... Also, 0(-v) = -?(v) follows from (6.6). Setting w = nv and dividing by n, we have j(w/n) = (l/n) (w) for n = 1,2,... Now 4((m/n)v) = m.p((l/n)v) =
(m/n)o(v) and by continuity, 4(Xv) = X@(v) for X > 0. The rest is
easy. 7. Not hard with the outline given. 8. By the spectral theorem, every rank r orthogonal projection can be
written FxoF' for some F E on. Hence transitivity holds. The equation
rx r' = xo holds for r E e iff r has the form
r (r ?22)
e
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 6 485
and this gives the isotropy subgroup of xo. For F E e Q, Fxol =
rxo(Pxo)' and rxo has the form (40) where 4: n X r has columns that are the first r columns of r. Thus Fx0F' = 44'. Part (ii) follows by observing that I,4/ = '22'P if ' A = A for some A E Or.
9. The only difficulty here is (iii). The problem is to show that the only
continuous homomorphisms X on G2 to (xc, oo) are to for some real a.
Consider the subgroups G3 and G4 of G2 given by
G= ((P 0)xf e RP- 1 G4 =(I 0 u C (0,oo)}.
The group G3 is isomorphic to RP-' so the only homomorphisms are
x -) exp[Ef 'aixi] and G4 is isomorphic to (0, xo) so the only homo
morphisms are u -- uG for some real a. For k E G2, write
k (IP_1 ?) (IP-1 ?)(IP-1 ?) x u x I 0 u
so x(k) = exp[2aixil]ua. Now, use the condition X(klk2)= X(k=) X(k2) to conclude a,
= a2 = = ap- = 0 so X has the claimed
form. 10. Use (6.4) to conclude that
Iy = 2P(V;)flw (n, p)f+ ui,T 2exp i] dU u ~ ~ I ii <j
and then use Problem 5.10 to evaluate the integral over Gt. You will
find that, for 2y + n > p-1, the integral is finite and is I =
(VC2 7T) nP(n, p)/w(2y + n, p). If 2y + n < p - 1, the integral di
verges. 11. Examples 6.14 and 6.17 give Ar for G(M) and all the continuous
homomorphisms for G(M). Pick xo E RP n MC to be
Vo zo 0
where z' = (1,0 ,. . ., 0), zo E Rr. Then H. consists of those g's with
the first column of g12 being 0 and the first column of g22 being zo. To
apply Theorem 6.3, all that remains is to calculate the right-hand
modulus of Ho-say AO. This is routine given the calculations of
Examples 6.14 and 6.17. You will find that the only possible multi
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
486 COMMENTS ON SELECTED PROBLEMS
pliers are x(g) = Ig,119g331 and Lebesgue measure on RP n MC is the only (up to a positive constant) invariant measure.
12. Parts (i), (ii), (iii), and (iv) are routine. For (v), J1(f)= ff(x),u(dx) and J2(f) = Jf(T-1(y))P(dy) are both invariant integrals on Sf(%). By Theorem 6.3, J1 = kJ2 for some constant k. To find k, take f(x)= ( s)-'sn(x)exp[- jx'x] so J1(f) = 1. Since s(T-'(y)) = v for y = (u, v, w),
J2(f) = (27TY vn exp[- iv2- nu2] du2v(dw) 2 2
1 F((n-1)/2) 1 2 (WT)n-l k
For (vi), the expected value of any function of x and s(x), say
q(5-, s(x)) is
&;q(5, s(x)) = fq(5, s(x))f(x)sn(x),U(dx)
r ~~~~~~dv = kjq(u, V)f (T-'(U, v, w)) v du y (dw)
=kJq(,v)vn2h(-2 v + n(u -) 2)dd
Thus the joint density of x and s(x) is
puv)= kVn2
h o + 2) (withrespecttodudv).
13. We need to show that, with Y(X) = X/IIXII, P(IIXII E B, Y e C) =
P{(II XjE B)P{Y E C). If P(IIXII E B) = 0, the above is obvious. If
not, set v(C) = P(Y E C,IIXI E B)/P{II XII e B) so v is a probability
measure on the Borel sets of {IyIIlII = 1) c Rn. But the relation
O(Ix) = f4(x) and the on invariance of P(X) implies that v is an
On-invariant probability measure and hence is unique -(for all Borel
B) -namely, " is uniform probability measure on (y{I yYII = 1).
14. Each x E X can be uniquely written as gy with g E 6n and y e 6 (of
course, y is the order statistic of x). Define '0Pn acting on 6Yn x '64 by
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 7 487
g(P, y) = (gP, y). Then 4-'(gx) = g4-'(x). Since P(gx) = gP(x),
the argument used in Problem 13 shows that P(X) and Y(X) are
independent and P(X) is uniform on
CHAPTER 7
1. Apply Propositions 7.5 and 7.6. 2. Write X = AU as in Proposition 7.3 so 4 and U are independent. Then
P(X) = A+' and S(X) = U'U and the independence is obvious. 3. First, write Q in the form
where M is n X n and nonsingular. Since M is nonsingular, it suffices to show that (M- (A))C has measure zero. Write x = (X) where ic is
r x p. It then suffices to show that BC = (xIx E Ep ,,, rank(&) = p)C has measure zero. For this, use the argument given in Proposition 7.1.
4. That the 4p's are the only equivariant functions follows as in Example 7.6.
5. Part (i) is obvious. For (ii), just observe that knowledge of Fn allows you to write down the order statistic and conversely.
6. Parts (i) and (ii) are clear. For (iii), write x = Px + Qx. If t is
equivariant t(x + y) = t(x) + y, y e M. This implies that t(Qx) =
t(x) + Px (picky = Px). Thus t(x) = Px + t(Qx). Since Q = I - P, Qx E M' , so BQx =Qx for any B with (B, y) E G. Since t(Qx) E M, pick B such that Bx = - x for x e M. The equivariance of t then gives
t(Qx) = t(BQx) = Bt(Qx) = -t(Qx), so t(Qx) = 0.
7. Part (i) is routine as is the first part of (ii) (use Problem 6). An
equivariant estimator of u2 must satisfy t(afx + b) = a2t(x). G acts
transitively on 9 and G acts transitively on (0, oo) (6J for this case) so
Proposition 7.8 and the argument given in Example 7.6 apply. 8. When X E 9 with density f(x'x), then Y = X- '2 =
(I, ? 2'/2)X
has density f(2 -"2x'x2- 1/2) since dx/lx'xI1/2 is invariant under x -* xA for A E
Glp. Also, when X has density f, then E ((I ? A))X) = E(X) for all I' en nand A E Op. This implies (see Proposition 2.19)
that Cov(X) = cIn 0 Ip for some c > 0. Hence Cov((In 0 11/2)X) =
cIn 0 E. Part (ii) is clear and (iii) follows from Proposition 7.8 and
Example 7.6. For (iv), the definition of CO and the assumption on f
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
488 COMMENTS ON SELECTED PROBLEMS
imply f(FCO ') = f(COF'F) = f(CO) for each r e O.? The uniqueness
of CO implies CO = aIp for some a > 0. Thus the maximum likelihood
estimator of I must be aX'X (see Proposition 7.12 and Example 7.10).
9. If e(X) = PO, then lE(IiXII) is the same whenever fC(X) e {PIP = gPo, g E C(V)} since x -- llxll is a maximal invariant under the action
of 6(V) on V. For (ii), E(ll XII) depends on ft through IJ,LI. 10. Write V = X ED (M - w) ED M' . Remove a set of Lebesgue measure
zero from V and show the F ratio is a maximal invariant under the
group action x -f alx + b where a > 0, b E w, and F c e(V) satis
fies r(w) c w, r(M - w) c (M - w). The group action on the
parameter (IL, a2) is ,u -- art + b and a2 -- a2a2. A maximal in
variant parameter is IIPM tt II 2/a 2, which is zero when t EC w. 11. The statistic V is invariant under xi -f Axi + b, i = 1,..., n, where
b E RP, A E Glp, and det A = 1. The model is invariant under this
group action where the induced group action on (,u, E) is yt -* A,u + b
and Y. - AYA'. A direct calculation shows 0 = det(E) is a maximal
invariant under the group action. Hence the distribution of V depends on (,u, L) only through 0.
12. For (i), if h E G and B E '@, (hPNB) = P(h- 1B) = fG(gQ)(h `B)
AL(dg)=fGQ(g- 'h- B)ft(dg)=fGQ((hg)- 'B)At(dg)=JQ(g- 'B)ft(dg)= P(B), so hP = P for h E G and P is G invariant. For (ii), let Q be
the distribution described in Proposition 7.16 (ii), so if 15(X)= P, then E(X) = I.(UY) where U is uniform on G and is independent of Y. Thus for any bounded 6i-measurable function f,
f(x)PI(dx) = (gy)ft(dg) Q(dy) = f (gx),u(dg)Q(dx).
Setf = IB and we have P(B) = JcQ(g-'B)f(dg) so (7.1) holds.
13. For y c 6J and B E fi3, define R(BIy) by R(Bly) = fGI(gy)jL(dg).
For each y, R( * lY) is a probability measure on (6, fi3) and for fixed B,
R(BI - ) is (6@, C() measurable. For P E 1'P, (ii) of Proposition 7.16
shows that
(7.2) fh(x)P(dx) = ffh(gy)ft(dg)Q(dy).
But by definition of R( -* ), JGh(gy)p(dg) = J h(x)R(dxIy), so (7.2)
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 7 489
becomes
fh(x)P(dx) = J h(x)R(dxIy)Q(dy).
This shows that R( -Iy) serves as a version of the conditional distribu
tion of X given T(X). Since R does not depend on P E VP, T(X) is
sufficient. 14. For (i), that t(gx) = g o t(x) is clear. Also, X - Xe = QeX which is
N(O, Qe) so is ancillary. For (ii), (f(XI)IX = t) = &(f(XI - X
+X)iX =t) = &;(f (eZ(X) + X)jX = t) since Z(X) has coordinates Xi-X, i= l,..., n. Since Z and X are independent, this last condi
tional expectation (given X = t) is just the integral over the distribu
tion of Z with X = t. But e'Z(X) = XI - Xis N(O, 62) so the claimed
integral expression holds. When f(x) = 1 for x < uo and 0 otherwise,
the integral is just (I((uo - t)/8) where 4 is the normal cumulative
distribution function.
15. Let B be the set (- oc, uo] so IB(XI) is an unbiased estimator of
h(a, b) when P,(X) = (a, b)Po. Thus h(t(X)) = &(IB(XI)lt(X)) is an
unbiased estimator of h(a, b) based on t(X). To compute h, we
have &(IB(XI)lt(X))= P{X1 < uoit(X)) = P((X1 - X)/s < (uO -
X)/sl(s, X)). But (XI - X)/s Z1 is the first coordinate of Z(X) so is independent of (s, X). Thus h(s, X) = Pz,{Z1 < (uo - X)/s) =
F((uo - X)/s) where F is the distribution function of the first coordi nate of Z. To find F, first observe that Z takes values in F, {xlx e
Rn, x'e = 0, llxii = 1) and the compact group (9"(e) acts transitively on
S. Since Z(rX) = FZ(X) for F E (n(e), it follows that Z has a
uniform distribution on E (see the argument in Example 7.19). Let U
be N(O, In) so Z has the same distribution as QeU/ilQeUll and E(Z,) - (C,QeU/iIQeUi12) = PB((QeEI)'QeU/IIQeU 12). Since WIQee112 = (n
- 1)/n and QeU is N(O, Qe), it follows that F1 (Z,) = P4(((n -
1)/n)'/2W) where W1 = Ul/( j''U7I)/2. The rest is a routine com
putation. 16. Part (i) is obvious and (ii) follows from
(7.3) &(f(X)IT(X) = g) = &(f(T(X)(T(X)) Y'X)IT(X) = g)
= 6;(f(T(X)Z(X))IT(X) = g).
Since Z( X) and T( X) are independent and T( X) = g, the last member
of (7.3) is just the expectation over Z of f(gZ). Part (iii) is just an
application and QO is the uniform distribution on qp n. For (iv), let B
be a fixed Borel set in R' and consider the parametric function
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
490 COMMENTS ON SELECTED PROBLEMS
h(Y.) = PI(XI E B) = JIB(X)(V2>7T-jI["2exp[- 2x':'x]dX, where X1 is the first row of X. Since T(X) is a complete sufficient statistic, the MVUE of h(E) is
(7.4)
h(T) = &(IB(XI)LT(X) = T) = P(T(T(X))'X, E BIT(X) = T).
But Z' = (T-r(X)X1)' is the first row of Z(X) so is independent of
T(X). Hence h(T) = P1(ZI E T'(B)} where P1 is the distribution of Z, when Z has a uniform distribution on Cp n. Since Z, is the first p
coordinates of a random vector that is uniform on (xl lIx I = 1, x E R'), it follows that Z1 has a density 4(ll11u2) for u E RP where 4 is given by
C( _ V-)(-p-2)/2 < V <
= otherwise
where c = r(n/2)/p2F r((n - p)/2). Therefore h(T)
fRPIB(TU)P(IlUll2)dU = (det T)- JRPIB(u)4(lT- U1ll2)du. Now, let B shrink to the point uo to get that (det T)- 1'(IIT- 'uOI12) is the MVUE for (V ) -p I /2exp[- lu'W luo].
CHAPTER 8
1. Make a change of variables to r, xl = s/llua1 and X2 = S22/a22, and
then integrate out xl and x2. That p(rlp) has the claimed form follows
by inspection. Karlin's Lemma (see Appendix) implies that 4(pr) has a monotone likelihood ratio.
3. For a = 1/2,...,(p - 1)/2, let X1,..., Xr be i.i.d. N(O, Ip) with r =
2a. Then S = X, X,' has 0fa, as its characteristic function. For a > (p -
1)/2, the function pa(s) = k(a)lslGexp[-2 trs] is a density with
respect to ds/lslsI(P 1)/2 on Sp'. The characteristic function of pa is 4a.
To show that 4J(EA) is a characteristic function, let S satisfy
&exp(i(A, S)) = 0.(A) = IIp - 2iAj". Then 2I/2S5i/2 has OJ(EA) as
its characteristic function. 4. IS(S) = E (PSP')impliesthatA = &SsatisfiesA = rAP Ifor allE e (.
This implies A = for some constant c. Obviously, c = &s1l. For (ii)
var(tr DS) = var(Efdisii) = Efd7 var(sii) + Fi:jdidjco v s
Noting that E(S) = (FSr') for r E (p, and in particular for permu
tation matrices, it follows that y = var(sii) does not depend on i and
, = cov(sI,, sjj) does not depend on i andj (i * j). Thus var(D, S> =
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 8 491
-yE)d2 + 3EEi didj = (y - 3)fd7 + f(Efdi)2. For (iii), writeA E
SP as J'DF' so var(A, S) = var(rDr', S> = var(D, r'SI) =
var(D, S) = (y - fi)E'd72 + 3(Efdi)2 = (y - f3)trA2 + [3(tr A)2 =
(y - 1)KA, A) + 13(I, A)2. With T = (y - ),)Ip @ Ip + 1,BIp Ip, it
follows that var(A, S) = (A, TA), and since T is self-adjoint, this implies that Cov(S) = T.
5. Use Proposition 7.6.
6. Immediate from Problem 3.
7. For (i), it suffices to show that l ((ASA')') = W((AAA') -, r, v + r - 1). Since f (S-1) = W(A-, p, v + p - 1),- Proposition 8.9 implies
that desired result. (ii) follows immediately from (i). For (iii), (i) implies S = A- 2SA'12 iS IW(Ip, p, P) and E(5) = C(FS1') for all r 9E ,. Now, apply Problem 4 to conclude that &S = cIp where
c = &SH,. That c = (v - 2)-' is an easy application of (i). Hence
(v -2)-'I, = = A-1/2(&;S)A-"/2 so 6S = (v - 2)-'A. Also, CovS = (y - /3Ip , I, + f3I , Ip as in Problem 4. Thus CoV(S) =
(A'32 0 A/2)(CovS)(A"/2 ? A'/2) - (y - ,B)A e A + fAO A. For (iv), that P(SI1) = IW(A 1, q, v), take A = (Iq 0) in part (i). To show
f(S2',)= W(A 21I1, r, v + q + r - 1), use Proposition 8.8 on S
whichisW(A-,p,v+p- 1).
8. For (i), let p, (x)p2(s) denote the joint density of X and S with respect to the measure dx dslsI(P+ 1)/2. Setting T = XS- 1/2 and V = S, the
joint density of T and V is p,(tvl/2)p2(v)IVIr/2 with respect to dt dv/IVI(p+
1)/2_ the Jacobian of x -+ tv"'2 is Ivlr/2-see Proposition 5.10. Now, integrate out v to get the claimed density. That C(T) =
C(rTA') is clear from the form of the density (also from (ii) below). Use Proposition 2.19 to show Cov(T) = c,Ir 0 Ip. Part (ii) follows by
integrating out v from the conditional density of T to obtain the marginal density of T as given in (i). For (iii) represent T as: T given V
is N(O, Ir 0 V) where Vis IW(Ip, p, '). Thus T,, given Vis N(O, Ik ?
VI,) where V,, is the q x q upper left-hand corner of V. Since
(VI,) = IW(Iq, q, v), the claimed result follows from (ii).
9. With V = S- 1/2s1Si- 1/2 and S - S2' , the conditional distribution of V given S is W(S, p, m) and s (S) = IW(I , p, v). Since V is uncon
ditionally F(m, i, IP), (i) follows. For (ii), (T) = T(v, Ir, Ip) means
that 15(T) = f (XS'/2) where e (X) = N(O, Ir 0 Ip) and s (S) =
IW(Ip, p, v). Thus C(T'T) = e(S'/2X'XS'/2). Since E(X'X) =
W(Ip, p, r), (ii) follows by definition of F(r, ', Ip). For (iii), write F = T'T where e(T) = T(v, Irh Ip), which has the density given in (i)
of Problem 8. Since r > p, Proposition 7.6 is directly applicable to yield the density of F. To establish (iv), first note that e(F) = e(FFF')
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
492 COMMENTS ON SELECTED PROBLEMS
for all 1 Ei p. Using Example 7.16, F has the same distributions as
AD+' where 4 is uniform on 9, and is independent of the diagonal matrix D whose diagonal elements A, > >, - Ap are distributed as
the eigenvalues of F. Thus A,,. .., Ap are distributed as the eigenvalues of S7-'S1 where S is W(Ip, p, r) and S2j' is IW(Ip, p, v). Hence
e( F-) = e (4'D-'4') = e(4ADA') where the diagonal elements of D, say A'
- > - * * > A , are the eigenvalues of ST 'S2. Since S2 is
W(Ip, p, v + p - 1), it follows that ADA' has the same distribution as
an F(v + p - 1, r - p + 1, Ip) matrix by just repeating the orthogo
nal invariance argument given above. (v) is established by writing F = T'T as in (ii) and partitioning T as T,: r x q and T2: r x (ip - q)
so
T'T=(~ T2T T2T2)
Since Q(T1) = T(v, Ir' Iq) and F1, = T1T1, (ii) implies that Q(F,,) =
F(r, v, Iq). (vi) can be established by deriving the density of XS- 'X'
directly and using (iii), but an alternative argument is more instructive. First, apply Proposition 7.4 to X' and write X = V'/24" where V E V = XX' is W(I, r, p) and is independent of A: p x r, which
is uniform on ';r p. Then XS'-X' = V1/2W-IV1/2 where W =
(4"S 14'f' and is independent of V. Proposition 8.1 implies that 0(W) = W(Ir, r, m - p + r). Thus l (W-1) = IW(Ir, r, m - p + 1).
Now, use the orthogonal invariance of the distribution of XS- 'X' to
conclude that E (XS- 1X') = , (D'Dr) where I' and D are independent, F is uniform on O r5 and the diagonal elements of D are distributed as
the ordered eigenvalues of W- 'V. As in the proof of (iv), conclude that
0(FD')= F(p,m -p + 1,Ir).
10. The function S -* S1/2 on Sp to + satisfies (IS I)!/2 = fS1/2f" for
IF E Op. With B(S1, S2) = (S1 + I 2) 1(S1 + S2f1/2, it follows that B(FSjF, FS2P') = FB(S,sS2)rF. SinceD(fSiF') = 0(S),i = 1,2, and S, and S2 are independent, the above implies that e(B) =-(rBBF)
for r E 9,,. The rest of (i) is clear from Example 7.16. For (ii), let
B1 = SI"2(S5 + S2)'S"2I so 0(B1) = (1FB,F') for F e 9.. Thus
E(BI) = C0(4DA') where 4 and D are independent, 4 is uniform on
Op,. Also, the diagonal elements of D, say 1I >- * * Ap > 0, are distrib
uted as the ordered eigenvalues of S,(SI + S2)-' so B1 is
B(m,, M2, Ip). (iii) is easy using (i) and (ii) and the fact that F(I +
F)-I is symmetric. For (iv), let B = X(S + X'X) -1X' and observe
that E(B) = 0(FBF'), E9 e,. Since m > p, S' exists so B =
P + S- "12X'XS- 1/2)- IS- 2X. Hence T = XS is2 is T(m
- p + 1, Ir, Ip). Thus 0(B) = 0(4D') where 4 is uniform on Or and
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 9 493
is independent of D. The diagonal elements of D, say A1,..., Xr, are the eigenvalues of T(Ip + T'T) -T'. These are the same as the eigen
values of TT'(Ir + TT') - (use the singular value decomposition for T). But e(TT') = Ef(XS-'X') = F(p, m - p + 1, Ir) by Problem 9
(vi). Now use (iii) above and the orthogonal invariance of e(B). (v) is
trivial.
CHAPTER 9
1. Let B have rows .',..., Pk and form X in the usual way (see Example
4.3) so 4= ZB with an appropriate Z: n x k. Let R: I x k have
entries al,..., ak. Then RB = Ekait,u and Ho holds iff RB = 0. Now
apply the results in Section 9.1. 2. For (i), just do the algebra. For (ii), apply (i) with S1 = (Y - XB)'(Y
- XB) and S2 = (X(B - B))'(X(B - B)), so p(S1) < 4(S1 + S2) for
every B. Since A > 0, trA(Si + S2) = trAS1 + trAS2> trAS1 since
trAS2 > 0 as S2 > 0. To show det(A + S) is nondecreasing in S > 0,
First note that 4L + S1 < A + S1 + S2 in the sense of positive definite ness as S2 > 0. Thus the ordered eigenvalues of (A + S1 + S2), say
A1,..., A,p, satisfy Ai > ,i, i = l,...,p, where u,,..., ,,p are the
ordered eigenvalues of A + SI. Thus det(A + S1 + S2) > det(A + SI). This same argument solves (iv).
3. Since E(E4'A') = C(EA') for 4 E Op, the distribution of EA' depends only on a maximal invariant under the action A -b A4, of 4 on Glp. This maximal invariant is AA'. (ii) is clear and (iii) follows since the reduction to canonical form is achieved via an orthogonal transforma tion Y = VY where r E n. Thus Y = rFp + rEA'. F is chosen so r, has the claimed form and Ho is B1 = 0. Setting E = FE, the model has the claimed form and e(E) = E(E) by assumption. The arguments given in Section 9.1 show that the testing problem is invariant and a maximal invariant is the vector of the t largest eigenvalues of
Y,(Y3Y3)-'Y. Under Ho, Y1 = E1A', Y3 = E3A' so Y(Y3'Y3)-'Y, =
El (E'E3)- E --W. When e(FE) = e&(E) for all F E Q, write E =
AU according to Proposition 7.3 where 4i and U are independent and 4 is uniform on p nF Partitioning 4 as E is partitioned, Ei = 4iU, = 1,2,3, so W= 4IWU((43U)'4,3U) IU'4' = 41(43f'4,. The rest
is obvious as the distribution of W depends only on the distribution of 4.
4. Use the independence of Y1 and Y3 and the fact that &((Y3Y3)-' = (m
-p- 1-12-1
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
494 COMMENTS ON SELECTED PROBLEMS
5. Let r ECe2be given by
and set Y= YE. Then E(Y) = N(ZBE, I, X r?E2E). Now, let BE have columns 3,B and f2. Then Ho is that f3 = 0. Also E'rE is
diagonal with unknown diagonal elements. The results of Section 9.2 apply directly to yield the likelihood ratio test. A standard invariance argument shows the test is UMP invariant.
6. For (i), look at the i, j elements of the equation for Y. To show
M2 I M3, compute as follows: (au'2, u3') = tr au'j u' = u'j u'a = 0
from the side conditions on a and P8. The remaining relations Ml I M2 and M1 I M2 are verified similarly. For (iii) consider (Im X A)(Ituu'2 + au'2 + ulft') =
[Lu,(Au2)' + a(Au2)' + u,(A1)' = tyuu'2 + yau'2
+.Suf3' E M where the relationsPu2 = u2andQ,B = 2Bwhen u'2 = 0 have been used. This shows that M is invariant under each Im ? A. It
is now readily verified that ,u - Y.., & = I.- Y..and P3=i Y-Y... For (iv), first note that the subspace w = (xlx E M, a = 0) defined by
Ho is invariant under each Im X A. Obviously, X = Ml@ DM3. Con
sider the group whose elements are g = (c, E, b) where c is a positive
scalar, b E Ml ED M3, and E is an orthogonal transformation with
invariant subspaces M2, M1 @ M3, and MI. The testing problem is
invariant under x -k cEx + b and a maximal invariant is W (up to a
set a measure zero). Since W has a noncentral F-distribution, the test
that rejects for large values of W is UMP invariant.
7. (i) is clear. The column space of W is contained in the column space of
Z and has dimension r. Let xl,... , Xr, Xr+I.. X xk, Xk+1,. , xn be
an orthonormal basis for Rn such that span(x,,..., Xr) = column
space of W and span{xl,..., Xk} = column space of Z. Also, let
yl, ..., yp be any orthonormal basis for RP. Then {x i yjIi = 1, ....
r, j = 1,..., p} is a basis for 6i(Pw ? Ip), which has dimension rp.
Obviously, 6iK(PW ? Ip) c M. Consider x E X so x = ZB with
RB = 0. Thus (Pw 0 Ip)x = PWZB = W(W'W)- 'W'ZB =
W(W'W)-'R(Z'Z)-'(ZZ)B = W(W'W)- 1RB = 0. Thus 9L(Pw 0
Ip) D w, which implies 6Yu(Pw 0 Ip) c wx. Hence 65(Pw 0 Ip) C M
nl (Of. That dimo = (k - r)p can be shown by a reduction to
canonical form as was done in Section 9.1. Since X c M, dim(M - o) = dim M - dim X = rp, which entails 6Ai(Pw 0 Ip) = M - w. Hence
Pz X Ip - Pw 0 Ip is the orthogonal projection onto w.
8. Use the fact that E'E is diagonal with diagonal entries al, a2, a3, a3, a2 (see Proposition 9.13 ff.) so the maximum likelihood estimators a,, a2,
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 10 495
and a3 are easy to find-just transform the data by F. Let D have
diagonal entries al, &2, a3, a3, a2 so L:- rD1 gives the maximum likelihood estimators of 2, pl, and P2.
9. Do the problems in the complex domain first to show that if Z,,..., Zn are i.i.d. (rN(O, 2H), then H = (I/2n) IjZj Z. But if Zj = Uj + ij and
Xi = i
then H= (1/2n)Ej(Lj + iJ')(Ly - )= (/2n)[(S11 + S22) +
i(12 - S2 50 = (H). This gives the desired result.
10. Write R = M(Ih 0)r where M is r x r of rank r and r EeOp. With
8 = r,F, the null hypothesis is (Ir 0)8 = 0. Now, transform the data by
F and proceed with the analysis as in the first testing problem considered in Section 9.6.
11. First write Pz = PI + P2 where P1 is the orthogonal projection onto e and P2 is the orthogonal projection onto (column space of Z) n
(span e)-'. Thus PM = PI 1 IP + P2 ? IP. Also, write A(p) = yPI +
8Q1 where y = 1 + (n - l)p, 8 = 1 - p, and Q1 = I,, - P1. The rela
tions P1P2 = 0 = Q1P1 and P2Q1 = Q1P2 = P2 show that M is in
variant under A(p) 0 E for each value of p and :. Write ZB = eb' +
L2zjbj so Q1Y is N2 j(Q z1)b,(Q1A(p)QI) X E). Now, Q1A(p)Q= 8Q1 so Q1Y is N(12k(Qlzj)bj, SQ? 0 2:). Also, PY is N(eb', yP, ? 2).
Since hypotheses of the form RB = 0 involve only b2,..., bp, an
invariance argument shows that invariant tests of Ho will not involve
PIY-so just ignore P1Y. But the model for QIY is of the MANOVA
type; change coordinates so
Now, the null hypothesis is of the type discussed in Section 9.1.
CHAPTER 10
1. Part (i) is clear since the number of nonzero canonical correlations is always the rank of 212 in the partitioned covariance of (X, Y). For (ii), write
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
496 COMMENTS ON SELECTED PROBLEMS
Cov(X, Y)= (2 1 22
where 2T2 has rank t, and I > 0, Y222> 0. First, consider the case when q < r, 1, ='Iq 22 = Ir' and
{D 00 212( )
where D > 0 is t x t and diagonal. Set
A = (Do ):qx t, B= (Do):rXt
so AB' = I22 Now, set A1I = Iq-AA', A22 = Ir-BB', and the
problem is solved for this case. The general case is solved by using Proposition 5.7 to reduce the problem to the case above.
2. That 212 = 8e1e' for some 8 E R' is clear, and hence 212 has rank
one-hence at most one nonzero canonical correlation. It is the square root of the largest eigenvalue of I`E = 82zj'e-1eej21e2e. The only nonzero (possibly) eigenvalue is 8 2e :l-jele' e e e2. To de scribe canonical coordinates, let
11 el 1= 22 ej
I 1 ,/2,l WI
1 /2e
and then form orthonormal bases {t), 32,..., 1q) and ( vr,..., wr) for R" and RK. Now, set vi = I2i
j 2t2
j for i = 1,..., q,
j=l,..., r. Then verify that X, = v,X and Yj = wj'Y form a set of
canonical coordinates for X and Y.
3. Part (i) follows immediately from Proposition 10.4 and the form of the covariance for (X, Y). That 8(B) = trA(I - Q(B)) is clear and the
minimization of 6(B) follows from Proposition 1.44. To describe B, let 4: p x t have columns a,. .., a, so 4'4 = I, and Q = A+'. Then show directly that B = 4,' I/2 is the minimizer and CBX = Q is the best predictor. (iii) is an immediate application of (ii).
4. Part (i) is easy. For (ii), with ui =xi -
n n
A(M, ao) = Elixi -
(P(xi -
ao) + ao)112 =
Ellui -
Pui2 I I
n n n
= EIIQuiII2 = E trQuiu' = trQ3uiuu' = trS(a0)Q. I I I
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 10 497
Since S(ao) = S(Jx) + n(x~ - ao)(x - ao)', (ii) follows. (iii) is an ap plication of Proposition 1.44.
6. Part (i) follows from the singular value decomposition: For (ii), (X E tp 1jx = AC, e E tp k) is a linear subspace of Ep and the orthogonal projection onto this subspace is (4,') ? Ip. Thus the closest point to A is ((4,4) ? I)A = 44'A, and the C that achieves
the minimum is C = 4'A. For B E k write B -AC as in (i). Then
IIA - BR12 > inf inf lA - 4CIj2 = infhtA - 4/'A112 = inf IAQ112.
The last equality follows as each 4 determines a Q and conversely. Since II AQ1l2 = trAQ(AQ)' trAQ2A' = tr QAA',
IIA- Rh2 - inf trQAA'.
Q
Writing A = E:PXuiv' (the singular value decomposition for A), AA' = fXiuiuu' is a spectral decomposition for AA'. Using Proposition 1.44, it follows easily that
p inf tr QAA'= EA.. Q k+l
That B achieves the infimum is a routine calculation. 7. From Proposition 10.8, the density of W is
h(wjO) = fPn-2(wIOu1/2)f(u)du
where Pn-2 is the density of a noncentral t distribution and f is the density of a X -2 distribution. For 0 > 0, set v = OU"/2 so
h(wIO) = 02 Pn-2(WIV)f 02
v dv
Since pn-2(wIv) has a monotone likelihood ratio in w and v and
f(v2/02) has a monotone likelhood ratio in v and 0, Karlin's Lemma implies that h(wh0) has a monotone likelihood ratio. For 0 < 0, set V = 0U- 1/2, change variables, and use Karlin's Lemma again. The last
assertion is clear.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
498 COMMENTS ON SELECTED PROBLEMS
8. For U2 fixed, the conditional distribution of W given U2 can be described as the ratio of two independent random variables-the numerator has a X2+2K distribution (given K) and K is Poisson with parameter A/2 where
A = p2(l - p2)- U2 and the denominator is
xn -r- 1 Hence, given U2, this ratio is 'Yr+2K n-r-1 with K described
above, so the conditional density of W is
fl (W|P, U2) E fr+2k, n-r-1I (W) ( kI2)
where A4' ( /2) is the Poisson probability function. Integrating out U2 gives the unconditional density of W (at p). Thus it must be shown that Su24(kIA/2) = h(kjp)-this is a calculation. That f(tIp) has a
monotone likelihood ratio is a direct application of Karlin's Lemma. 9. Let M be the range of P. Each R e 'Rs can be represented as R = A+'
where 4 is n x s, 4'+ = Is, and PA = 0. In other words, R corresponds
to orthonormal vectors 4', . . ., 4 (the columns of 4 ) and these vectors
are in M' (of course, these vectors are not unique). But given any two
such sets-say 4',..., AS and 6 8,..., 65, there is a r E e(P) such that P4i = Si, i = 1,..., s. This shows C(P) is compact and acts transi
tively on Psg so there is a unique ((P) invariant probability distribu
tion on <Ps. For (iii), AROA' has an 6(P) invariant distribution on
's-uniqueness does the rest.
10. For (i), use Proposition 7.3 to write Z = 4U with probability one where 4' and U are independent, 4' is uniform on Y,p and U E Gt.
Thus with probability one, rank(QZ) = rank(Q4). Let S > 0 be inde
pendent of 4 with f (S2) = W(Ip, p, n) so S has rank p with probabil
ity one. Thus rank(Q4) = rank(Q4S) with probability one. But 4'S is N(0, In X Ip), which implies that Q4S has rank p. Part (ii) is a direct
application of Problem 9. 12. That 4 is uniform follows from the uniformity of r on Q. For
(ii), () = C(Z(Z'Z)-1/2) and A = (Ik 0)4 imphes that E(4)=
(9(X(X'X + Y'Y) -'). (iii) is immediate from Problem 11, and (iv) is
an application of Proposition 7.6. For (v), it suffices to show that
Jf(x)P,(dx) = Jf(x)P2(dx) for all bounded measurablef. The invari ance of Pi implies that for i = 1,2, ff(x)Pi(dx) = ff(gx)Pi(dx), g E G. Let v be uniform probability measure on G and integrate the
above to get ff(x)Pi(dx) = J( fGf(gx)v(dg))Pi(dx). But the function x -+
JGf(gx)v(dg) is G-invariant and so can be written f(T(x)) as T is
a maximal invariant. Since P,(- '(C)) = P2( '(C)) for all measur
able C, we have Jk(T(x))P,(dx) = fk(T(x))P2(dx) for all bounded
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 10 499
measurable k. Putting things together, we have Jf(x)P1(dx)=
ff(T(x))P1(dx) = ff(T(x))P2(dx) =Jf(x)P2(dx) so P1 = P2. Part (vi) is immediate from (v).
13. For (i), argue as in Example 4.4:
tr(Z - TB)Y7.(Z - TB)'
= tr(Z - TB + T(B - B))2-'(Z- TB + T(B - B))'
= tr(QZ + T(B - B))Y-?(QZ + T(B - B))'
= tr(QZ)' - '(QZ)' + tr T(B - B)2 (B-B)'T'
. tr(QZ)'E- '(QZ)' = trZ'QZ.-'1.
The third equality follows from the relation QT = 0 as in the normal
case. Since h is nonincreasing, this shows that for each 2: > 0,
supf(ZIB, 2) =f(ZIB, 2) B
and it is obvious that f(ZIB, Y.) = I2I -h(trSl- '). For (ii), first note that S > 0 with probability one. Then, for S > 0,
sup f(ZIB, 2) = supf(ZiB 2) HI UHo w:>0
= sup III-,/2h (tr SY)
= IS-n/2 sup ICIn12h (tr C). c>o
Under Ho, we have
supf(ZIB, 2) Ho
- sup l1yV-n/2I222-n/2h(tr l2'S,- + tr:2y'S22) ii > 0, i = 1, 2
lS11" 'n/2S22 -n/2 sup IC 1dn/2IC22In/2h(trC,1 + trC22). Cii>O, i= 1,2
This latter sup is bounded above by
sup ICIn/2h(trC) k, c>o
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
500 COMMENTS ON SELECTED PROBLEMS
which is finite by assumption. Hence the likelihood ratio test rejects for small values of k1Id11'-'2IS22Ln/21SIn/2, which is equivalent to rejecting for small values of A(Z). The identity of part (iii) follows from the equations relating the blocks of 2 to the blocks of 2'.
PartitionBintoBI: k X qandB2: k X rso&X= TB1 andY = TB2. Apply the identity with U = X - TB1 and V = Y - TB2 to give
f(ZIB, E) = lyl-n/2 22 11-n/2
xh[tr(Y- TB2 - (X- TB1)1Ij'y12)
x 2-1.(Y- TB2 -(x-TB1)- j'12)
+tr(X- TB1)2-'(X- TB1)'].
Using the notation of Section 10.5, write
f (X, YIB, E) - 1 111-n72 12221-n/2
xhtr(Y -
WC) Y.- 1(Y -WC),
+tr(X- TB,)E-'l(X- TBI)T]
Hence the conditional density of Y given X is
f, (YlC, B,,~ 211 I 22 -1 X)
= nE22.1V-/2h(tr(Y - WC)2Y- .1(Y - WC)' +
whereiq = tr(X- TB1)Y'(X- TB1) and (f{q)f = Je h(truu' + ii)du. For (iv), argue as in (ii) and use the identities established in Proposition 10.17. Part (v) is easy, given the results of (iv)-just note that the sup over Y. l and B1 is equal to the sup over qj > 0. Part (vi) is
interesting- Proposition 10.13 is not applicable. Fix X, B1, and Y. and note that under Ho, the conditional density of Y is
f2( YIC2, 22 1, 1)
- I22.1V-/2h(tr(Y - TC2) p22 ,I(Y - TC2) + q>p(?)
This shows that Y has the same distribution (conditionally) as Y=
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 10 501
TC2 + El2?/21 where E E Cr,n has density h(trEE' + q)qQ). Note that E(rEA) = e(E) for ali r' E e and A E 0r, Let t = min(q, r)
and, given any n x n matrix A with real eigenvalues, let X(A) be the
vector of the t largest eigenvalues of A. Thus the squares of the sample canonical correlations are the elements of the vector X(RyRx) where Ry= (QY)(Y'QY)-'(QY), R* = QX(X'QX)'-QX, since
S ( {X'QX X'Q Y - y'QX r'Q Y
(You may want to look at the discussion preceding Proposition 10.5.) Now, we use Problem 9 and the notation there-P = I - Q. First,
R -
Er, Rx EG @q, and ?(P) acts transitively on ffl and Yq. Under Ho
(and X fixed), f (QY) = 2(QE 1), which implies that f (rRyn) =
E(Ry), r1 E (P). Hence Ry is uniform on 9, for each X. Fix
Ro E 6YUq and choose Fo so that roR0 ro= Rx, Then, for each X,
E(A(RyRo)) = C(A(ToRyRoro)) = C(X(FoRyF0orORo'))
= E((X(rORyrORx) = (X(RyRx)).
This shows that for each X, X(RyRx) has the same distribution as X(RYRO) for Ro fixed where Ry is uniform on 96, Since the distribu tion of A(RYR0) does not depend on X and agrees with what we get in
the normal case, the solution is complete.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
BIBLIOGRAPHY
Anderson, T. W. (1946). The noncentral Wishart distribution and certain problems of multi
variate statistics. Ann. Math. Stat., 17, 409-431.
Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
Anderson, T. W., S. Das Gupta, and G. H. P. Styan (1972). A Bibliography of Multivariate
Analysis. Oliver and Boyd, Edinburgh.
Andersson, S. A. (1975). Invariant normal models. Ann. Stat., 3, 132-154.
Andersson, S. A. (1982). Distributions of maximal invariants using quotient measures. Ann.
Stat., 10, 955-961.
Arnold, S. F. (1973). Applications of the theory of products of problems to certain patterned covariance matrices. Ann. Stat., 1, 682-699.
Arnold, S. F. (1979). A coordinate free approach to finding optimal procedures for repeated measures designs. Ann. Stat., 7, 812-822.
Arnold, S. F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley, New York.
Basu, D. (1955). On statistics independent of a complete sufficient statistic. Sankhya, 15, 377-380.
Billingsley, P. (1979). Probability and Measure. Wiley, New York.
Bj?rk, A. (1967). Solving linear least squares problems by Gram-Schmidt orthogonalization.
BIT,1, 1-21.
Blackwell, D. and M. Girshick (1954). Theory of Games and Statistical Decision Functions.
Wiley, New York.
Bondesson, L. (1977). A note on sufficiency and independence. Preprint, University of Lund,
Lund, Sweden.
Box, G. E. P. (1949). A general distribution theory for a class of likelihood criteria. Biometrika,
36, 317-346.
Chung, K. L. (1974). A Course in Probability Theory, second edition. Academic Press, New
York.
Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press, Princeton, N.J.
Das Gupta, S. (1979). A note on anciliarity and independence via measure-preserving transfor
mations. Sankhya, 41, Series A, 117-123.
503
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
504 BIBLIOGRAPHY
Dawid, A. P. (1978). Spherical matrix distributions and a multivariate model. /. Roy. Stat. Soc.
B, 39, 254-261.
Dawid, A. P. (1981). Some matrix-variate distribution theory: Notational considerations and a
Bayesian application. Biometrika, 68, 265-274.
Deemer, W. L. and I. Olkin (1951). The Jacobians of certain matrix transformations useful in
multivariate analysis. Biometrika, 38, 345-367.
Dempster, A. P. (1969). Elements of Continu?las Multivariate Analysis. Academic Press, Read
ing, Mass.
Eaton, M. L. (1970). Gauss-Markov estimation for multivariate linear models: A coordinate
free approach. Ann. Math. Stat., 41, 528-538.
Eaton, M. L. (1972). Multivariate Statistical Analysis. Institute of Mathematical Statistics,
University of Copenhagen, Copenhagen, Denmark.
Eaton, M. L. (1978). A note on the Gauss-Markov Theorem. Ann. Inst. Stat. Math., 30, 181-184.
Eaton, M. L. (1981). On the projections of isotropic distributions. Ann. Stat., 9, 391-400.
Eaton, M. L. and T. Kariya (1981). On a general condition for null robustness. University of
Minnesota Technical Report No. 388, Minneapolis.
Eckhart, C. and G. Young (1936). The approximation of one matrix by another of lower rank.
Psychometrika, 1, 211-218.
Farrell, R. H. (1962). Representations of invariant measures. 77/. /. Math., 6, 447-467.
Farrell, R. H. (1976). Techniques of Multivariate Calculation. Lecture Notes in Mathematics
#520. Springer-Verlag, Berlin.
Giri, N. (1964). On the likelihood ratio test of a normal multivariate testing problem. Ann.
Math. Stat., 35, 181-190.
Giri, N. (1965a). On the complex analogues of T2 and Attests. Ann. Math. Stat., 36, 664-670.
Giri, N. (1965b). On the likelihood ratio test of a multivariate testing problem, II. Ann. Math.
Stat.,S6, 1061-1065.
Giri, N. (1975). Invariance and Minimax Statistical Tests. Hindustan Publishing Corporation,
Dehli, India.
Giri, N. C. (1977). Multivariate Statistical Inference. Academic Press, New York.
Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations.
Wiley, New York.
Goodman, N. R. (1963). Statistical analysis based on a certain multivariate complex Gaussian
distribution (An introduction). Ann. Math. Stat., 34, 152-177.
Hall, W. J., R. A. Wijsman, and J. K. Ghosh (1965). The relationship between sufficiency and
invariance with applications in sequential analysis. Ann. Math. Stat., 36, 575-614.
Halmos, P. R. (1950). Measure Theory. D. Van Nostrand Company, Princeton, N.J.
Halmos, P. R. (1958). Finite Dimensional Vector Spaces. Undergraduate Texts in Mathematics,
Springer-Verlag, New York.
Hoffman, K. and R. Kunze (1971). Linear Algebra, second edition. Prentice Hall, Englewood
Cuffs, N.J.
Hotelling, H. (1931). The generalization of Student's ratio. Ann. Math. Stat., 2, 360-378.
Hotelling, H. (1935). The most predictable criterion. /. Educ. Psych., 26, 139-142.
Hotelling, H. (1936). Relations between two sets of vari?tes. Biometrika, 28, 321-377.
Hotelling, H. (1947). Multivariate quality control, illustrated by the air testing of sample
bombsights, in Techniques of Statistical Analysis. McGraw-Hill, New York, pp. 111-184.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
BIBLIOGRAPHY 505
James, A. T. (1954). Normal multivariate analysis and the orthogonal group. Ann. Math. Stat.,
25, 40-75.
Kariya, T. (1978). The general MANOVA problem. Ann. Stat., 6, 200-214.
Karlin, S. (1956). Decision theory for Polya-type distributions. Case of two actions, I, in Proc.
Third Berkeley Symp. Math. Stat. Prob., Vol. 1. University of California Press, Berkeley,
pp. 115-129.
Karlin, S. and H. Rubin (1956). The theory of decision procedures for distributions with
monotone likelihood ratio. Ann. Math. Stat. 27, 272-299.
Kiefer, J. (1957). Invariance, minimax sequential estimation, and continuous time processes.
Ann. Math. Stat., 28, 573-601.
Kiefer, J. (1966). Multivariate optimality results. In Multivariate Analysis, edited by P. R.
Krishnaiah. Academic Press, New York.
Kruskal, W. (1961). The coordinate free approach to Gauss-Markov estimation and its
application to missing and extra observations, in Proc. Fourth Berkeley Symp. Math. Stat.
Prob., Vol. 1. University of California Press, Berkeley, pp. 435-461.
Kruskal, W. (1968). When are Gauss-Markov and least squares estimators identical? A
co-ordinate free approach. Ann. Math. Stat., 39, 70-75.
Kshirsagar, A. M. (1972). Multivariate Analysis. Marcel Dekker, New York.
Lang, S. (1969). Analysis II. Addison-Wesley, Reading, Massachusetts.
Lawley, D. N. (1938). A generalization of Fisher's z-test. Biometrika, 30, 180-187.
Lehmann, E. L. (1959). Testing Statistical Hypotheses. Wiley, New York.
Mallows, C. L. (1960). Latent vectors of random symmetric matrices. Biometrika, 48, 133-149.
Mardia, K. V., J. T. Kent, and J. M. Bibby (1979). Multivariate Analysis. Academic Press, New
York.
Muirhead, R. J. (1982). Aspects of Multivariate Statistical Theory. Wiley, New York.
Nachbin, L. (1965). The Haar Integral. D. Van Nostrand Company, Princeton, N.J.
Noble, B. and J. Daniel (1977). Applied Linear Algebra, second edition. Prentice Hall,
Englewood Cliffs, N.J.
Olkin, I. and S. J. Press (1969). Testing and estimation for a circular stationary model. Ann.
Math. Stat., 40, 1358-1373.
Olkin, I. and H. Rubin (1964). Multivariate beta distributions and independence properties of
the Wishart distribution. Ann. Math. Stat., 35, 261-269.
Olkin, I. and A. R. Sampson (1972). Jacobians of matrix transformations and induced
functional equations. Linear Algebra Appl., 5, 257-276.
Peisakoff, M. (1950). Transformation parameters. Thesis, Princeton University, Princeton, N.J.
Pillai, K. C. S. (1955). Some new test criteria in multivariate analysis. Ann. Math. Stat., 26, 117-121.
Pitman, E. J. G. (1939). The estimation of location and scale parameters of a continuous
population of any form. Biometrika, 30, 391-421.
Potthoff, R. F. and S. N. Roy (1964). A generalized multivariate analysis of variance model
useful especially for growth curve problems. Biometrika, 51, 313-326.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, second edition. Wiley, New
York.
Roy, S. N. (1953). On a heuristic method of test construction and its use in multivariate
analysis. Ann. Math. Stat., 24, 220-238.
Roy, S. N. (1957). Some Aspects of Multivariate Analysis. Wiley, New York.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
506 BIBLIOGRAPHY
Rudin, W. (1953). Principles of Mathematical Analysis. McGraw-Hill, New York.
Scheff?, H. (1959). The Analysis of Variance. Wiley, New York.
Segal, I. E. and Kunze, R. A. (1978). Integrals and Operators, second revised and enlarged edition. Springer-Verlag, New York.
Serre, J. P. (1977). Linear Representations of Finite Groups. Springer-Verlag, New York.
Srivastava, M. S. and C. G. Khatri (1979). An Introduction to Multivariate Statistics. North
Holland, Amsterdam.
Stein, C. (1956). Some problems in multivariate analysis, Part I. Stanford University Technical
Report No. 6, Stanford, Calif.
Wijsman, R. A. (1957). Random orthogonal transformations and their use in some classical
distribution problems in multivariate analysis. Ann. Math. Statist., 28, 415-423.
Wijsman, R. A. (1966). Cross-sections of orbits and their applications to densities of maximal
invariants, in Proc. Fifth Berkeley Symp. Math. Stat. Probl., Vol. 1. University of
California Press, Berkeley, pp. 389-400.
Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika, 1A, 471-494.
Wishart, J. (1928). The generalized product moment distribution in samples from a normal
multivariate population. Biometrika, 20, 32-52.
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
Index
Affine dependence: invariance of, 405
measure of, 404, 418, 419
between random vectors, 403 Affinely equivalent, 404 Almost invariant function, 287 Ancillary, 285 Ancillary statistic, 465 Angles between subspaces:
definition, 61 geometric interpretation, 61
Action group, see Group
Beta distribution: definition, 320 noncentral, 320 relation toF, 320
Betarandom variables, products of, 236, 321,
323 Bilinear, 33 Bivariate correlation coefficient:
density has monotone likelihood ratio, 459 distribution of, 429, 432
Borel a-algebra, 70
Borel measurable, 71
Canonical correlation coefficients: as angles between subspaces, 408, 409
definition, 408 density of, 442 interpretation of sample, 421 as maximal correlations, 413
model interpretation, 456 and prediction, 418
population, 408 sample, as maximal invariant, 425
Canonical variates: definition, 415 properties of, 415
Cauchy-Schwarz Inequality, 26 Characteristic function, 76 Characteristic polynomial, 44 Chi-square distribution:
definition, 109, 110 density, 110, 111
Compact group, invariant integral on, 207 Completeness:
bounded, 466 independence, sufficiency and, 466
Complex covariance structure: discussion of, 381 example of, 370
Complex normal distribution: definition, 374, 375 discussion of, 373 independence in, 378 relation to real normal distribution,
377 Complex random variables:
covariance of, 372
covariance matrix of, 375
mean of, 372
variance of, 372 Complex vector space, 39 Complex Wishart distribution, 378 Conditional distribution:
fornormal variables, 116, 117 in normal random matrix, 1 18
507
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
508 INDEX
Conjugate transpose, 39 Correlation coefficient, density of in normal
sample, 329 Covariance:
characterization of, 75 of complex random variables, 372 definition, 74 of outer products, 96, 97
partitioned, 86 properties of, 74 of random sample, 89
between two random variables, 28
Covariance matrix, 73 Cyclic covariance:
characterization of, 365 definition, 362 diagonalization of, 366 multivariate, 368
Density, of maximal invariant, 272, 273-277
Density function, 72 Determinant:
definition, 41 properties of, 41
Determinant function: alternating, 40 characterization of, 40 definition, 39 as n-linear function, 40
Direct product, 212 Distribution, induced, 71
Eigenvalue: and angles between subspaces, 61
definition, 44 of real linear transformations, 47
Eigenvector, 45 Equivariant, 218
Equivariant estimator, in simple linear model,
157
Equivariant function:
description of, 249 example of, 250
Error subspace, see Linear model
Estimation: Gauss-Markov Theorem, 134
linear, 133 of variance in linear model, 139
Expectation, 71
Factorization, see Matrix F Distribution:
definition, 320 noncentral, 320 relation to beta, 320
F test, in simple linear model, 155
Gauss-Markov estimator: definition, 135 definition in general linear model, 146 discussion of, 140-143 equal to least squares, 145
existence of, 147 in k-sample problem, 148
for linear functions, 136 in MANOVA, 151
in regression model, 135 in weakly spherical linear model, 134
Generalized inverse, 87 Generalized variance:
definition, 315 example of, 298
Gram-SchmidtOrthogonalization, 15 Group:
action, 186, 187 affine, 187, 188 commutative, 185 definition, 185 direct product, 212 general linear, 186
isotropy subgroup, 191 lower triangular, 185, 186 normal subgroup, 189, 190 orthogonal, 185 permutation matrices, 188, 189 sign changes, 188, 189 subgroup, 186 topological, 195 transitive, 191 unimodular, 200 upper triangular, 186
Hermitian matrix, 371 Homomorphism:
definition, 218 on lower triangular matrices, 230, 231 matrices, on non-singular, 230
Hoteling's 7e:
complex case of, 381
as likelihood ratio statistic, 402
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INDEX 509
Hypothesis testing, invariance in, 263
Independence: in blocks, testing for, 446 characterization of, 78 completeness, sufficiency and, 466 decomposition of test for, 449 distribution of likelihood ratio test for, 447 likelihood ratio test for, 444,446
MANOVA model and, 453,454 of normal variables, 106, 107
of quadratic forms, 1 14
of random vectors, 77
regression model and, 451 sample mean and sample covariance, 126,
127 testing for, 443,444
Inner product: definition, 14 for linear transformations, 32 norm defined by, 14 standard, 15
Inner product space, 15 Integral:
definition, 194 left invariant, 195 right invariant, 195
Intraclass covariance: characterization of, 355 definition, 131, 356 multivariate version of, 360
Invariance: in hypothesis testing, 263 and independence, 289 of likelihood ratio, 263 in linear model, 296, 256,257
in MANOVA model with block covariance, 353
in MANOVA testing problem, 341 of maximum likelihood estimators, 258
Invariance and independence, example of, 290-291,292-295
Invariant densities: definition, 254 example of, 255
Invariant distribution: example of, 282-283 on nxp matrices, 235
representation of, 280 Invariant function:
definition, 242 maximal invariant, 242
Invariant integral: on affine group, 202, 203
on compact group, 207
existence of, 196 on homogeneous space, 208, 210
on lower triangular matrices, 200 on matrices (nxp of rank p), 213-218
on m-frames, 210,211
and multiplier, 197 on non-singular matrices, 199 on positive definite matrices, 209, 210
relatively left, 197 relatively right, 197, 198 uniqueness of, 196 see also Integral
Invariant measure, on a vector space, 121-122
Invariant probability model, 251 Invariant subspace, 49 Inverse Wishart distribution:
definition, 330 properties of, 330
Isotropy subgroup, 191
Jacobian: definition, 166 example of, 168, 169, 171, 172, 177
Kronecker product: definition, 34 determinant of, 67
properties of, 36,68 trace of, 67
Lawley-Hotelling trace test, 348 Least squares estimator:
definition, 135 equal to Gauss-Markov estimator, 145
ink-sample problem, 148 inMANOVA, 151 in regression model, 135
Lebesque measure, on a vector space, 121-125
Left homogeneous space: definition, 207 relatively invariant integral on, 208-210
Left translate, 195
Likelihood ratio test: decomposition of in MANOVA, 349 definition, 263
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
510 INDEX
Likelihood ratio test (Continued) in MANOVA model with block covariance,
351 in MANOVA problem 340 in mean testing problem, 384, 390
Linear isometry: definition, 36 properties of, 37
Linear model: errorsubspace, 133 error vector, 133 invariance in, 156, 157, 256, 257
with normal errors, 137 regression model, 132 regression subspace, 133
weakly spherical, 133 Linear transformation:
adjoint of, 17,29
definition, 7 eigenvalues, see Eigenvalues invertible, 9
matrix of, 10 non-negative definite, 18 null space of, 9
orthogonal, 18 positive definite, 18 range of, 9
rank, 9
rank one, 19
self-adjoint, 18 skew symmetric, 18 transpose of, 17
vector space of, 7
Locally compact, 194
MANOVA: definition, 150
maximum likelihood estimator in, 151 with normal errors, 151
MANOVA model: with block diagonal covariance, 350 canonical form of, 339 with cyclic covariance, 366 description of, 336 example of, 308 and independence, 453,454 with intraclass covariance, 356 maximum likelihood estimators in, 337 under non-normality, 398
with non-normal density, 462 testing problem in, 337'
MANOVA testing problem: canonical form of, 339
complex case of, 379
description of, 337 with intraclass covariance structure, 359 invariance in, 341 likelihood ratio test in, 340,347
maximal invariant in, 342,346 maximal invariant parameter in, 344
uniformly most powerful test in, 345 Matric t distribution:
definition, 330 properties of, 330
Matrix: definition, 10
eigenvalue of, 44 factorization, 160, 162, 163, 164 lower triangular, 44, 159 orthogonal, 25 partitioned positive definite, 161, 162 positive definite, 25 product, 10 skew symmetric, 25 symmetric, 25 upper triangular, 159
Maximal invariant: density of, 278-279 example of, 242,243, 246
parameter, 268
and product spaces, 246
representing density of, 272 Maximum likelihood estimator:
of covariance matrix, 261
invariance of, 258 in MANOVA model, 151
in simple linear model, 138
Mean value, of random variable, 72
Mean vector: of coordinate random vector, 72
definition, 72 for outer products, 93
properties of 72
of random sample, 89
M-frame, 38
Modulus, right hand, 196 Monotone likelihood ratio:
for non-central chi-squared, 468,469
fornon-centralF, 469 for non-central Student's t, 470
and totally positive of order 2,467
Multiple correlation coefficient:
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
INDEX 511
definition, 434 distribution of, 434
Multiplier: on affine group, 204
definition, 197 and invariant integral, 197
on lower triangular matrices, 201
on non-singular matrices, 199 Multivariate beta distribution:
definition, 331 properties of, 331, 332
Multivariate F distribution: definition, 331 properties of, 331
Maximal invariant, see Invariant function
Multivariate General Linear Model, see MANOVA
Non-central chi-squared distribution: definition, 110, 111 for quadratic forms, 1 12
Noncentral Wishart distribution: covariance of, 317
definition, 316 density of, 317
mean of, 317
properties of, 316 as quadratic form in normals, 318
Normal distribution: characteristic function of, 105, 106 complex, see Complex normal distribution conditional distribution in, 1 16, 117
covariance of, 105, 106
definition, 104
density of, 120-126
density of normal matrix, 125
existence of, 105, 106
independence in 106, 107
mean of, 105, 106
and non-central chi-square, 111 and quadratic forms, 109
relation to Wishart, 307
representation of, 108 for random matrix, 1 18
scale mixture of, 129, 130
sufficient statistic in, 126, 127, 131 of symmetric matrices, 130
Normal equations, 155, 156
Orbit, 241 Order statistic, 276-277
Orthogonal: complement, 16 decomposition, 17 definition, 15
Orthogonal group, definition, 23 Orthogonally invariant distribution, 81 Orthogonally invariant function, 82 Orthogonal projection:
characterization of, 21 definition, 17 in Gauss-Markov Theorem, 134 random, 439, 440
Orthogonal transformation, characterization of, 22
Orthonormal: basis, 15 set, 1 5
Outer product: definition, 19 properties of, 19, 30
Parameter, maximal invariant, 268 Parameter set:
definition, 146 in linear models, 146
Parameter space, 252 Partitioning, a Wishart matrix, 310 Pillai trace test, 348 Pitman estimator, 264-267 Prediction:
affine, 94 and affine dependence, 416
Principal components: and closest flat, 457, 458
definition, 457 low rank approximation, 457
Probability model, invariant, 251 Projection:
characterization of, 13
definition, 12
Quadratic forms: independence of, 1 14, 115
in normal variables, 109
Radon measure: definition, 194 factorization of, 224
Random vector, 71 Regression:
multivariate, 451
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions
512 INDEX
Regression (Continued)
and testing for independence, 451 Right translate, 195 Roy maximum root test, 348, 349
Regression model, see Linear model Regression subspace, see Linear model Relatively invariant integral, see Invariant
integral
Sample correlation coefficient, as a maximal invariant, 268-271
Scale mixture of normals, 129, 130 Self adjoint transformations, functions of, 52 Singular Value Decomposition Theorem, 58 Spectral Theorem:
and positive definiteness, 51 statement of, 50, 53
Spherical distributions, 84 Sufficiency:
completeness, independence and, 466 definition, 465
Symmetry model: definition, 361 examples of, 361
Topological group, 195 Trace:
of linear transformation, 47
of matrix, 33
sub-k, 56 Transitive group action, 191 Two way layout, 155
Uniform distribution: on M-frames, 234
on unit sphere, 101
Uncorrelated: characterization of, 98 definition, 87 random vectors, 88
Uniformly most powerful invariant test,
in MANOVA problem, 345
Vector space: basis, 3
complementary subspaces, 5 definition, 2 dimension, 4 direct sum, 6 dual space, 8 finite dimensional, 3 linearly dependent, 3 linearly independent, 3 linear manifold, 4 subspace, 4
Weakly spherical: characterization of, 83 definition, 83 linear model, 133
Wishart constant, 175 Wishart density, 239, 240
Wishart distribution: characteristic function of, 305 convolution of two, 306 covariance of, 305 definition, 303 density of, 239, 240
for nonintegral degrees of freedom, 329
in MANOVA model, 308 mean of, 305
noncentral, see Noncentral Wishart
distribution nonsingular, 304 partitioned matrix and, 310
of quadratic form, 307
representation of, 303 triangular decomposition of, 313, 314
Wishart matrix: distribution of partitioned, 31 1
ratio of determinants of, 319
This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions