multivariate statistics: a vector space approach || multivariate statistics: a vector space approach

Multivariate Statistics: A Vector Space ApproachAuthor(s): Morris L. EatonSource: Lecture Notes-Monograph Series, Vol. 53, Multivariate Statistics: A Vector SpaceApproach (2007), pp. i-viii, 1-463, 465-501, 503-512Published by: Institute of Mathematical StatisticsStable URL: http://www.jstor.org/stable/20461449 .

Accessed: 14/06/2014 17:27

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access toLecture Notes-Monograph Series.

http://www.jstor.org

This content downloaded from 91.229.229.49 on Sat, 14 Jun 2014 17:27:22 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=ims

http://www.jstor.org/stable/20461449?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Institute of Mathematical Statistics

LECTURE NOTES-MONOGRAPH SERIES

Volume 53

Multivariate Statistics A Vector Space Approach

Morris L. Eaton

Institute of Mathematical Statistics 2IS Beachwood, Ohio, USA



Institute of Mathematical Statistics Lecture Notes-Monograph Series

Series Editor: R. A. Vitale

The production of the Institute of Mathematical Statistics Lecture Notes-Monograph Series is managed by the

IMS Office: Jiayang Sun, Treasurer and Elyse Gustafson, Executive Director.

Library of Congress Control Number: 2006940290

International Standard Book Number 9780940600690,

0-940600-69-2

International Standard Serial Number 0749-2170

Copyright ? 2007 Institute of Mathematical Statistics

All rights reserved

Printed in Lithuania



Contents Preface ............................. v

Notation ..............................viii

1. VECTOR SPACE THEORY ................................................ 1

1.1. Vector Spaces .......................................................... 2

1.2. Linear Transformations . ................................................. 6

1.3. Inner Product Spaces .................................................. 13

1.4. The Cauchy-Schwarz Inequality .......... ............................. 25

1.5. The Space L(V, W) ...................................................... 29

1.6. Determinants and Eigenvalues ............ ............................. 38

1.7. The Spectral Theorem . ................................................. 49

Problems . ............................................................ 63

Notes and References ...................................................... 69

2. RANDOM VECTORS ...................................................... 70

2.1. Random Vectors ....................................................... 70

2.2. Independence of Random Vectors .......... ............................. 76

2.3. Special Covariance Structures .......................................... 81

Problems . ............................................................ 98

Notes and References ..................................................... 102

3. THE NORMAL DISTRIBUTION ON A VECTOR SPACE ............... 103

3.1. The Normal Distribution ............................................. 104

3.2. Quadratic Forms ...................................................... 109

3.3. Independence of Quadratic Forms ......... ........................... 113

3.4. Conditional Distributions .............. ............................... 116

3.5. The Density of the Normal Distribution ....... ....................... 120

Problems .......................................................... 127


4. LINEAR STATISTICAL MODELS .......... ............................ 132

4.1. The Classical Linear Model ........................................... 132

4.2. More About the Gauss-Markov Theorem ............................. 140

4.3. Generalized Linear Models ........................................... 146

Problems .......................................................... 154


5. MATRIX FACTORIZATIONS AND JACOBIANS ...... ................. 159

5.1. Matrix Factorizations ................................................. 159

5.2. Jacobians ............................................................ 166

Problems .......................................................... 180


6. TOPOLOGICAL GROUPS AND INVARIANT MEASURES ..... ....... 184

6.1. Groups ............................................................ 185

6.2. Invariant Measures and Integrals ..................................... 194

6.3. Invariant Measures on Quotient Spaces ....... ........................ 207

6.4. Transformations and Factorizations of Measures ...... ................ 218

Problems ................................................................. 228

Notes and References ...................................................... 232

7. FIRST APPLICATIONS OF INVARIANCE ....... ...................... 233

7.1. Left On Invariant Distributions on n x p Matrices ...... ............... 233

7.2. Groups Acting on Sets ................................................ 241

iii



iv

7.3. Invariant Probability Models ......................................... 251

7.4. The Invariance of Likelihood Methods ...... .......................... 258

7.5. Distribution Theory and Invariance ....... ............................ 267

7.6. Independence and Invariance ........ .................................. 284

Problems ................................................................. 296

Notes and References ........................ ............................. 299

8. THE WISHART DISTRIBUTION ........................................ 302

8.1. Basic Properties ............. ......................................... 302

8.2. Partitioning a Wishart Matrix ....... ................................. 309

8.3. The Noncentral Wishart Distribution ...... ........................... 316

8.4. Distributions Related to Likelihood Ratio Tests ....................... 318

Problems . ................................................................ 329

Notes and References ............. ........................................ 332

9. INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS .... 334

9.1. The MANOVA Model .......... ...................................... 336

9.2. MANOVA Problems with Block Diagonal Covariance Structure ....... 350

9.3. Intraclass Covariance Structure ....... ................................ 355

9.4. Symmetry Models: An Example ....... ............................... 361

9.5. Complex Covariance Structures ....................................... 370

9.6. Additional Examples of Linear Models ...... .......................... 381

Problems . ................................................................ 397

Notes and References ............. ........................................ 401

10. CANONICAL CORRELATION COEFFICIENTS ..... .................. 403

10.1. Population Canonical Correlation Coefficients ..... .................. 403

10.2. Sample Canonical Correlations ....... ............................... 419

10.3. Some Distribution Theory ........ ................................... 427

10.4. Testing for Independence ............................................ 443

10.5. Multivariate Regression ......... .................................... 451

Problems .............................. ................................... 456

Notes and References ............................... 463

APPENDIX . ................................................................ 465

COMMENTS ON SELECTED PROBLEMS ...... .......................... 471

BIBLIOGRAPHY . .......................................................... 503

INDEX ..................................................................... 507



Preface

The purpose of this book is to present a version of multivariate statistical the ory in which vector space and invariance methods replace, to a large extent, more

traditional multivariate methods. The book is a text. Over the past ten years, var

ious versions have been used for graduate multivariate courses at the University of Chicago, the University of Copenhagen, and the University of Minnesota. Designed for a one year lecture course or for independent study, the book contains a full

complement of problems and problem solutions. My interest in using vector space methods in multivariate analysis was aroused

by William Kruskal's success with such methods in univariate linear model the ory. In the late 1960s, I had the privilege of teaching from Kruskal's lecture notes

where a coordinate free (vector space) approach to univariate analysis of variance

was developed. (Unfortunately, Kruskal's notes have not been published.) This ap proach provided an elegant unification of linear model theory together with many useful geometric insights. In addition, I found the pedagogical advantages of the

approach far outweighed the extra effort needed to develop the vector space ma

chinery. Extending the vector space approach to multivariate situations became a goal, which is realized here. Basic material on vector spaces, random vectors, the

normal distribution, and linear models take up most of the first half of this book.

Invariance (group theoretic) arguments have long been an important research tool in multivariate analysis as well as in other areas of statistics. In fact, invariance

considerations shed light on most multivariate hypothesis testing, estimation, and distribution theory problems. When coupled with vector space methods, invariance provides an important complement to the traditional distribution theory-likelihood approach to multivariate analysis. Applications of invariance to multivariate prob

lems occur throughout the second half of this book. A brief summary of the contents and flavor of the ten chapters herein follows. In

Chapter 1, the elements of vector space theory are presented. Since my approach to

the subject is geometric rather than algebraic, there is an emphasis on inner product

spaces where the notions of length, angle, and orthogonal projection make sense.

Geometric topics of particular importance in multivariate analysis include singular

value decompositions and angles between subspaces. Random vectors taking values

in inner product spaces is the general topic of Chapter 2. Here, induced distribu

tions, means, covariances, and independence are introduced in the inner product

space setting. These results are then used to establish many traditional properties of

the multivariate normal distribution in Chapter 3. In Chapter 4, a theory of linear

models is given that applies directly to multivariate problems. This development, suggested by Kruskal's treatment of univariate linear models, contains results that

identify all the linear models to which the Gauss-Markov Theorem applies. Chapter 5 contains some standard matrix factorizations and some elementary

Jacobians that are used in later chapters. In Chapter 6, the theory of invariant

integrals (measures) is outlined. The many examples here were chosen to illustrate

the theory and prepare the reader for the statistical applications to follow. A host

of statistical applications of invariance, ranging from the invariance of likelihood

methods to the use of invariance in deriving distributions and establishing inde

v



vi

pendence, are given in Chapter 7. Invariance arguments are used throughout the

remainder of the book. The last three chapters are devoted to a discussion of some traditional and not so

traditional problems in multivariate analysis. Here, I have stressed the connections

between classical likelihood methods, linear model considerations, and invariance arguments. In Chapter 8, the Wishart distribution is defined via its representa

tion in terms of normal random vectors. This representation, rather than the form

of the Wishart density, is used to derive properties of the Wishart distribution.

Chapter 9 begins with a thorough discussion of the multivariate analysis of vari

ance (MANOVA) model. Variations on the MANOVA model including multivariate

linear models with structured covariances are the main topic of the rest of Chap

ter 9. An invariance argument that leads to the relationship between canonical

correlations and angles between subspaces is the lead topic in Chapter 10. After

a discussion of some distribution theory, the chapter closes with the connection

between testing for independence and testing in multivariate regression models. Throughout the book, I have assumed that the reader is familiar with the basic

ideas of matrix and vector algebra in coordinate spaces and has some knowledge

of measure and integration theory. As for statistical prerequisites, a solid first year

graduate course in mathematical statistics should suffice. The book is probably best read and used as it was written from front to back. However, I have taught short

(one quarter) courses on topics in MANOVA using the material in Chapters 1, 2,

3, 4, 8, and 9 as a basis.

It is very difficult to compare this text with others on multivariate analysis. Al

though there may be a moderate amount of overlap with other texts, the approach

here is sufficiently different to make a direct comparison inappropriate. Upon reflec

tion, my attraction to vector space and invariance methods was, I think, motivated

by a desire for a more complete understanding of multivariate statistical models

and techniques. Over the years, I have found vector space ideas and invariance ar

guments have served me well in this regard. There are many multivariate topics not

even mentioned here. These include discrimination and classification, factor analy

sis, Bayesian multivariate analysis, asymptotic results and decision theory results.

Discussions of these topics can be found in one or more of the books listed in the

Bibliography. As multivariate analysis is a relatively old subject within statistics, a bibliog

raphy of the subject is very large. For example, the entries in A Bibliography of

Multivariate Analysis by T. W. Anderson, S. Das Gupta, and G. H. P. Styan, pub

lished in 1972, number over 6000. The condensed bibliography here contains a few

of the important early papers plus a sample of some recent work that reflects my

bias. A more balanced view of the subject as a whole can be obtained by perusing

the bibliographies of the multivariate texts listed in the Bibliography.

My special thanks go to the staff of the Institute of Mathematical Statistics at the

University of Copenhagen for support and encouragement. It was at their invitation

that I spent the 1971-1972 academic year at the University of Copenhagen lecturing

on multivariate analysis. These lectures led to Multivariate Statistical Analysis,

which contains some of the ideas and the flavor of this book. Much of the work herein

was completed during a second visit to Copenhagen in 1977-1978. Portions of the

work have been supported by the National Science Foundation and the University

of Minnesota. This generous support is gratefully acknowledged.

A number of people have read different versions of my manuscript and have made

a host of constructive suggestions. Particular thanks go to Michael Meyer, whose good sense of pedagogy led to major revisions in a number of places. Others whose



vii

help I would like to acknowledge are Murray Clayton, Siu Chuen Ho, and Takeaki

Kariya. Most of the typing of the manuscript was done by Hanne Hansen. Her efforts are

very much appreciated. For their typing of various corrections, addenda, changes, and so on, I would like to thank Melinda Hutson, Catherine Stepnes, and Victoria

Wagner.

Morris L. Eaton Minneapolis, Minnesota

May 1983



Notation

(V, (.,)) an inner product space, vector space V and inner product (,.) L(V, W) the vector space of linear transformations on V to W

Gl(V) the group of nonsingular linear transformations on V to V e(V) the orthogonal group of the inner product space (V, (., )) Rn Euclidean coordinate space of all n-dimensional column vectors Ip,n the linear space of all n x p real matrices

Gln the group of n x n nonsingular matrices

On the group of n x n orthogonal matrices

-Fp,n the space of n x p real matrices whose p columns form an ortho

normal set in Rn

G+ the group of lower triangular matrices with positive diagonal

elements-dimension implied by context G+ the group of upper triangular matrices with positive diagonal

elements-dimension implied by context S+ the set of p x p real symmetric positive definite matrices

A > 0 the matrix or linear transformation A is positive definite

A ) 0 A is positive semidefinite (non-negative definite) det determinant tr trace x L] y the outer product of the vectors x and y

A 0 B the Kronecker product of the linear transformations A and B

Ar the right-hand modulus of a locally compact topological group

LO( ) the distributional law of "*" N(u, E) the normal distribution with mean ,u and covariance Z on an inner

product space W(E, p, n) the Wishart distribution with n degrees of freedom and p x p

parameter matrix E

viii



CHAPTER 1

Vector Space Theory

In order to understand the structure and geometry of multivariate distribu tions and associated statistical problems, it is essential lhat we be able to

distinguish those aspects of multivariate distributions that can be described without reference to a coordinate system and those that cannot. Finite dimensional vector space theory provides us with a framework in which it becomes relatively easy to distinguish between coordinate free and coordi nate concepts. It is fair to say that the material presented in this chapter furnishes the language we use in the rest of this book to describe many of

the geometric (coordinate free) and coordinate properties of multivariate probability models. The treatment of vector spaces here is far from com plete, but those aspects of the theory that arise in later chapters are covered. Halmos (1958) has been followed quite closely in the first two sections of this chapter, and because of space limitations, proofs sometimes read "see

Halmos (1958)." The material in this chapter runs from the elementary notions of basis,

dimension, linear transformation, and matrix to inner product space, or thogonal projection, and the spectral theorem for self-adjoint linear trans formations. In particular, the linear space of linear transformations is studied in detail, and the chapter ends with a discussion of what is

commonly known as the singular value decomposition theorem. Most of the vector spaces here are finite dimensional real vector spaces, although excursions into infinite dimensions occur via applications of the Cauchy-Schwarz Inequality. As might be expected, we introduce complex coordinate spaces in the discussion of determinants and eigenvalues.

Multilinear algebra and tensors are not covered systematically, although the outer product of vectors and the Kronecker product of linear transfor

mations are covered. It was felt that the simplifications and generality obtained by introducing tensors were not worth the price in terms of added notation, vocabulary, and abstractness.



2 VECTOR SPACE THEORY

1.1. VECTOR SPACES

Let R denote the set of real numbers. Elements of R, called scalars, are

denoted by a, /B,....

Definition 1.1. A set V, whose elements are called vectors, is called a real vector space if:

(I) to each pair of vectors x, y E V, there is a vector x + y e V, called the

sum of x and y, and for all vectors in V,

(i) x+y=y+x.

(ii) (x+y)+z=x+(y+z). (iii) There exists a unique vector 0 E V such that x + 0 = x for all x.

(iv) For each x e V, there is a unique vector - x such that x + (-x)

= 0.

(II) For each a E R and x E V, there is a vector denoted by ax E V, called

the product of a and x, and for all scalars and vectors,

(i) a(/3x) = (a/)x.

(ii) lx = x.

(iii) (a + =)x =ax + Ax.

(iv) a(x + y) =ax + ay.

In II(iii), (a + ,B)x means the sum of the two scalars, a and /3, times x,

while ax + fix means the sum of the two vectors, ax and /3x. This multiple

use of the plus sign should not cause any confusion. The reason for calling V a real vector space is that multiplication of vectors by real numbers is

permitted. A classical example of a real vector space is the set Rn of all ordered

n-tuples of real numbers. An element of R', say x, is represented as

X2 x= . X xiE R 1.. n,

and xi is called the ith coordinate of x. The vector x + y has ith coordinate

xi + y, and ax, a E R, is the vector with coordinates axi, i = ,..., n. With



VECTOR SPACES 3

0 E R' representing the vector of all zeroes, it is routine to check that Rn is

a real vector space. Vectors in the coordinate space Rn are always repre

sented by a column of n real numbers as indicated above. For typographical

convenience, a vector is often written as a row and appears as x' = (x

xn). The prime denotes the transpose of the vector x E R=.

The following example provides a method of constructing real vector spaces and yields the space Rn as a special case.

* Example 1.1. Let 9X be a set. The set V is the collection of all the

real-valued functions defined on %. For any two elements xl, x2 E

V, define x, + x2 as the function on 9C whose value at t is

xl(t) + x2(t). Also, if a E R and x E V, ax is the function on 9

given by (ax)(t) ax(t). The symbol 0 E Vis the zero function. It is easy to verify that V is a real vector space with these definitions of addition and scalar multiplication. When QX = (1, 2,. . ., n}, then

V is just the real vector space Rn and x E Rn has as its ith

coordinate the value of x at i E 9X. Every vector space discussed in

the sequel is either V (for some set 9X) or a linear subspace (to be

defined in a moment) of some V.

Before defining the dimension of a vector space, we need to discuss linear dependence and independence. The treatment here follows Halmos (1958, Sections 5-9). Let V be a real vector space.

Definition 1.2. A finite set of vectors (xii = 1,..., k) is linearly dependent if there exist real numbers a,,..., ak, not all zero, such that 2aixi = 0.

Otherwise, (xiAi = 1,..., k) is linearly independent.

A brief word about summation notation. Ordinarily, we do not indicate indices of summation on a summation sign when the range of summation is clear from the context. For example, in Definition 1.2, the index i was

specified to range between 1 and k before the summation on i appeared;

hence, no range was indicated on the summation sign. An arbitrary subset S c V is linearly independent if every finite subset of

S is linearly independent. Otherwise, S is linearly dependent.

Definition 1.3. A basis for a vector space V is a linearly independent set S such that every vector in V is a linear combination of elements of S. V is

finite dimensional if it has a finite set S that is a basis.

* Example 1.2. Take V = R' and let E, = (0,. . ., 0, 1, 0, . . ., 0) where

the one occurs as the ith coordinate of e, i = 1,...,n. For x E R ,



it is clear that x = ExiEi where xi is the ith coordinate of x. Thus

every vector in Rn is a linear combination of E1,..., e,. To show that {ejli - 1,..., n) is a hnearly independent set, suppose Eaiei = 0 for

some scalars ai, i =1,...,n. Then x = aiei 0= has ai as its ith

coordinate, so ai = , i =,..., n.Thus {eii = 1,...,nis a basis

for Rn and Rn is finite dimensional. The basis {eji = 1,..., n} is called the standard basis for Rn.

Let V be a finite dimensional real vector space. The basic properties of

linearly independent sets and bases are:

(i) If (x,,..., xm) is a linearly independent set in V, then there exist

vectors xm+19... xm+k such that xm,..., Xm+k} is a basis for V.

(ii) All bases for V have the same number of elements. The dimension

of V is defined to be the number of elements in any basis.

(iii) Every set of n + 1 vectors in an n-dimensional vector space is

linearly dependent.

Proofs of the above assertions can be found in Halmos (1958, Sections 5-8).

The dimension of a finite dimensional vector space is denoted by dim(V). If

{x,,..., xn) is a basis for V, then every x E V is a unique linear combina

tion of (xl,..., xn)-say x = E2aixi. That every x can be so expressed

follows from the definition of a basis and the uniqueness follows from the

linear independence of (xl,..., xn). The numbers a,,..., -an are called the

coordinates of x in the basis {x,,..., xn). Clearly, the coordinates of x

depend on the order in which we write the basis. Thus by a basis we always

mean an ordered basis.

We now introduce the notion of a subspace of a vector space.

Definition 1.4. A nonempty subset M c V is a subspace (or linear mani

fold) of V if, for each x, y E M and a, ,B E R, ax + f,y E M.

A subspace M of a real vector space V is easily shown to satisfy the

vector space axioms (with addition and scalar multiplication inherited from V), so subspaces are real vector spaces. It is not difficult to verify the

following assertions (Halmos, 1958, Sections 10-12):

(i) The intersection of subspaces is a subspace.

(ii) If M is a subspace of a finite dimensional vector space V, then

di-M < 1

di_m(V__)T.

PROPOSITION 1.1 5

(iii) Given an m-dimensional subspace M of an n-dimensional vector space V, there is a basis (xI..., xm,..., x") for V such that (xl,..., xm) is a basis for M.

Given any set S c V, span(S) is defined to be the intersection of all the

subspaces that contain S-that is, span(S) is the smallest subspace that contains S. It is routine to show that span(S) is equal to the set of all linear

combinations of elements of S. The subspace span(S) is often called the subspace spanned by the set S.

If M and N are subspaces of V, then span(M U N) is the set of all

vectors of the form x + y where x E M and y E N. The suggestive notation

M + N (zlz = x + y, x E M, y E N) is used for span(M U N) when M

and N are subspaces. Using the fact that a linearly independent set can be

extended to a basis in a finite dimensional vector space, we have the

following. Let V be finite dimensional and suppose M and N are subspaces of V.

(i) Let m = dim(M), n = dim(N), and k = dim(M r) N). Then there

exist vectors XI,... . Xk, Yk+1... ' Ym' and Zk+ 1 ... Zn such that

(xi, ..., xk) is a basis for M n N, (XI ..., Xk, Yk+1l... Ym) is a

basis for M, (xI ... 9 Xk, Zk+1,... Zn} is a basis for N, and (xl,....

Xk, Yk+ I** Ymm Zk+ 1 *... I Zn) is a basis for M + N. If k O, then

(xi,...I, Xk) is interpreted as the empty set.

(ii) dim(M + N) = dim(M) + dim(N) - dim(M n N). (iii) There exists a subspace M1 c V such that M n M1 = (0) and

M+M1= V.

Definition 1.5. If M and N are subspaces of V that satisfy M n N = (0) and M + N = V, then M and N are complementary subspaces.

The technique of decomposing a vector space into two (or more) comple mentary subspaces arises again and again in the sequel. The basic property of such a decomposition is given in the following proposition.

Proposition 1.1. Suppose M and N are complementary subspaces in V. Then each x E V has a unique representation x = y + z with y e M and z e N.

Proof. Since M + N = V, each x E V can be written x = y, + z1 with y1 E M and z1 E N. If x = Y2 + Z2 withy2 e M and Z2 E N, then 0 = x -




x = (Y- Y2) + (ZI - Z2). Hence (Y2 - Yi) = (z, - z2) so (Y2 - YE M

nN = (0). Thusy, - y2. Similarly, z, = Z2 ?

The above proposition shows that we can decompose the vector space V into two vector spaces M and N and each x in V has a unique piece in M

and in N. Thus x can be represented as (y, z) withy E M and z E N. Also,

note that if xl, x2 E V and have the representations (y1, z1), (Y2' Z2), then

ax, + fix2 has the representation (ay, + fly2, az, + f3Z2), for a, ,B e R. In

other words the function that maps x into its decomposition (y, z) is linear. To make this a bit more precise, we now define the direct sum of two vector

spaces.

Definition 1.6. Let V1 and V2 be two real vector spaces. The direct sum of V1 and V2, denoted by VI @ V2, is the set of all ordered pairs {x, y),

x E VI, y E V2, with the linear operations defined by a1{x,, yl) +

2{X2, Y2) =(aix + a2x2, a1ly + a2y2).

That V, ED V2 is a real vector space with the above operations can easily

be verified. Further, identifying V1 with ({x1,0}Ix E V1) V1 and V2 with ((0, y} y e V2}- V2, we can think of V1 and V2 as complementary sub

spaces of V, E V2, since V, + V2 = V, @ V2 and V1, n V2 = (0,0), which is

the zero element in V1 3 V2. The relation of the direct sum to our previous

decomposition of a vector space should be clear.

* Example 1.3. Consider V = R', n > 2, and let p and q be positive

integers such that p + q = n. Then RP and Rq are both real vector

spaces. Each element of Rn is a n-tuple of real numbers, and we can

construct subspaces of Rn by setting some of these coordinates

equal to zero. For example, consider M = (x E R' x = (O) with

y E RP,0 E Rq) and N = (x E RnIx = (?) with 0 E RP and z E

Rq). It is clear that dim(M) = p, dim(N) = q, M n N = (0), and

M + N = Rn. The identification of RP with M and Rq with N

shows that it is reasonable to write R P d R@ = RP+.

1.2. LINEAR TRANSFORMATIONS

Linear transformations occupy a central position, both in vector space theory and in multivariate analysis. In this section, we discuss the basic



PROPOSITION 1.2 7

properties of linear transforms, leaving the deeper results for consideration after the introduction of inner products. Let V and W be real vector spaces.

Definition 1.7. Any function A defined on V and taking values in W is

called a linear transformation if A(a1x1 + a2x2) = a,A(x1) + a2A(x2) for

all xl, x2 E V and a,, a2 E R.

Frequently, A(x) is written Ax when there is no danger of confusion. Let

tP (V, W) be the set of all linear transformations on V to W. For two linear

transformations A1 and A2 in t (V, W), Al + A2 is defined by (Al + A2)(x) = Alx + A2x and (aA)(x) = aAx for a E R. The zero linear transforma

tion is denoted by 0. It should be clear that f,(V, W) is a real vector space with these definitions of addition and scalar multiplication.

* Example 1.4. Suppose dim(V) = m and let x,,..., XM be a basis

for V. Also, let y1,. . ., Ym be arbitrary vectors in W. The claim is

that there is a unique linear transformation.A such that Axi = yi, i = 1,. .., m. To see this, consider x E V and express x as a unique

linear combination of the basis vectors, x = Eaixi. Define A by

n n Ax = EaiAxi =

Eaiy i I

The linearity of A is easy to check. To show that A is unique, let B

be another linear transformation with Bxi = yi, i = 1,... , n. Then

(A - B)(xi)= 0 for i = 1,..., n, and (A - B)(x)= (A -

B)(Eaixi) = Eai(A - B)(xi) = 0 for all x e V. Thus A = B.

The above example illustrates a general principle-namely, a linear transformation is completely determined by its values on a basis. This principle is used often to construct linear transformations with specified properties. A modification of the construction in Example 1.4 yields a basis for f (V, W) when V and W are both finite dimensional. This basis is given in the proof of the following proposition.

Proposition 1.2. If dim(V) = m and dim(W) = n, then dim(fD(V, W)) = mn.

Proof. Let xl,..., xm be a basis for V and let y,,. .., yn be a basis for W. Define a linear transformation Aji, i = 1, . . ., m andj = ,..., n, by

(0 if k *

Aji (Xk) = yj ifk=i




For each (j, i), Aji has been defined on a basis in V so the linear

transformation Aji is uniquely determined. We now claim that (A1jii =

1,. .., Im; j = 1,. . ., n) is a basis for l (V, W). To show linear independence,

suppose EY2ajiAji = 0. Then for each k = 1,..., m,

0 = E EajiAji(Xk)

= E ajkYj j i j

Since {Yl,..., Yin) is a linearly independent set, this implies that ajk = 0 for

all j and k. Thus linear independence holds. To show every A E f (V, W) is

a linear combination of the Aji, first note that Axk is a vector in W and thus

is a unique linear combination of yl... , y, say Axk = Ejajkyj

where

ajk E R. However, the linear transformation E2ajiAji evaluated at Xk is

E EajiAji(Xk) =

EajkYj.

Since A and Y2ajiAji agree on a basis in V, they are equal. This completes

the proof since there are mn elements in the basis (Ajili = 1,..., m;

j= 1 ,... ,n) for C(V,W). El

Since E(V, W) is a vector space, general results about vector spaces, of

course, apply to E(V, W). However, linear transformations have many

interesting properties not possessed by vectors in general. For example,

consider vector spaces Vi, i = 1, 2, 3. If A e 4(VI, V2) and B c e_ (V2, V3),

then we can compose the functions B and A by defining (BA)(x) = B(A(x)).

The linearity of A and B implies that BA is a linear transformation on V, to

V3- that is, BA e E (V1, V3). Usually, BA is called the product of B and A.

There are two special cases of P,(V, W) that are of particular interest.

First, if A, B e fS(V, V), then AB E 4(V, V) and BA E f,(V, V), so we

have a multiplication defined in f (V, V). However, this multiplication is not

commutative-that is, AB is not, in general, equal to BA. Clearly, A(B +

C) = AB + AC for A, B, C E C(V, V). The identity linear transformation

in e (V, V), usually denoted by I, satisfies AI = IA = A for all A e f,(V, V),

since Ix = x for all x E V. Thus E(V, V) is not only a vector space, but

there is a multiplication defined in f (V, V).

The second special case of C (V, W) we wish to consider is when W = R

-that is, W is the one-dimensional real vector space R. The space e ( V, R)

is called the dual space of V and, if dim(V) = n, then dim(E (V, R)) = n.

Clearly, I (V, R) is the vector space of all real-valued linear functions

defined on V. We have more to say about f (V, R) after the introduction of

inner products on V.



PROPOSITION 1.3 9

Understanding the geometry of linear transformations usually begins with a specification of the range and null space of the transformation. These

objects are now defined. Let A e C (V, W) where V and W are finite

dimensional.

Definition 1.8. The range of A, denoted by @A (A), is

6A (A) {(uIu E W, Ax = u for some xe V).

The null space of A, denoted by OL(A), is

9L(A) (xlx e V, Ax = 0).

It is routine to verify that 6i(A) is a subspace of W and 6L(A) is a

subspace of V. The rank of A, denoted by r(A), is the dimension of 6A (A).

Proposition 1.3. If A EE f(V, W) and n = dim(V), then r(A) +

dim(9L(A)) = n.

Proof Let M be a subspace of V such that M D 9I (A) = V, and consider

a basis (xI,..., Xk) for M. Since dim(M) + dim(9L(A)) = n, we need to

show that k = r(A). To do this, it is sufficient to show that (Ax,,..., Axk) is a basis for 6i(A). If 0 = EaiAxi = A(Eaixi), then 2aixi e M n %(A)

so Eaixi = O. Hence ail = ... = a k= as x,,..., Xk) is a basis for M.

Thus (Ax,,..., Axk) is a linearly independent set. To verify that (Ax,,..., Axk) spans @%R(A), suppose w e 6R4A). Then w = Ax for some x E V.

Write x = y + z where y E M and z e %(A). Then w = A(y + z) = Ay.

Since y EM, y = Xaixi for some scalars a,,..., ak. Therefore, w=

A(Eaixi)= EaiAxi. a

Definition 1.9. A linear transformation A E f&(V, V) is called invertible if there exists a linear transformation, denoted by A - such that AA- -l

A-'A = I.

The following assertions hold; see Halmos (1958, Section 36):

(i) A is invertible iff 6{(A) = V iff Ax = 0 implies x = 0.

(ii) If A, B, C e E I(V, V) and if AB = CA = I, then A is invertible and

B = C = A-'.

(iii) If A and B are invertible, then AB is invertible and (AB)-' = B-'A. If A is invertible and a * 0, then (aA) = a-A-' and

(A-)-' =A.




In terms of bases, invertible transformations are characterized by the following.

Proposition 1.4. Let A E E(V, V) and suppose (x,,..., xn) is a basis for V. The following are equivalent:

(i) A is invertible. (ii) (Ax1,..., Ax,) is a basis for V.

Proof. Suppose A is invertible. Since dim(V) = n, we must show (Ax1,...,

Ax,) is a linearly independent set. Thus if 0 = 2ajAxi = A(Eaixi), then

2aixi = 0 since A is invertible. Hence ai = 0, i = 1 ... ., n, as (x,,..., xn) is

a basis for V. Therefore, (Ax,'I, , Axn) is a basis.

Conversely, suppose (Ax,,..., Ax,) is a basis. We show that Ax = 0

implies x = 0. First, write x = Eaixi so Ax = 0 implies EaiAxi = 0. Hence

ai = 0, i = ,..., n, as (Ax,,..., Ax,) is a basis. Thus x = 0, so A is

invertible. ?

We now introduce real matrices and consider their relation to linear

transformations. Consider vector spaces V and W of dimension m and n,

respectively, and bases (xl,..., xm) and {y. ., yn) for V and W. Each

x E V has a unique representation x = Eaixi. Let [x] denote the column

vector of coordinates of x in the given basis. Thus [x] E Rm and the ith

coordinate of [xl is ai, i = 1,..., m. Similarly, [y] e Rn is the column

vector of y E W in the basis (yl,..., yn}. Consider A E f (V, W) and

express Axj in the given basis of W; Axj = Eiaijyy for unique scalars aij, = 1,..., n,j = 1,..., m. The n X m rectangular array of real scalars

all a12 ... alm

a21 a22 ... a2m

[A]= .-aij)

a ~~~~a anl anm_

is called the matrix of A relative to the two given bases. Conversely, given

any n x m rectangular array of real scalars (aij), i = 1,..., n,j = 1,. .., m,

the linear transformation A defined by Axj = 1iaijyj has as its matrix

[A] = (aij).

Definition 1.10. A rectangular array (aij): m x n of real scalars is called

an m x n matrix. If A = (aij): m x n is a matrix and B = (bij}: n X p is a

matrix, then C = AB, called the matrix product of A and B (in that order) is

defined to be the matrix (cij): m x p with cij = Y2kaikbkj.



PROPOSITION 1.5 11

In this book, the distinction between linear transformations, matrices, and the matrix of a linear transformation is always made. The notation [A]

means the matrix of a linear transformation with respect to two given bases.

However, symbols like A, B, or C may represent either linear transforma

tions or real matrices; care is taken to clearly indicate which case is under

consideration. Each matrix A = (aij): m x n defines a linear transformation on Rn to

Rm as follows. For x E Rn with coordinates xl,..., xn,, Ax is the vectory in

Rm with coordinates yi = Ejaijxj, i = 1,..., m. Of course, this is the usual

row by column rule of a matrix operating on a vector. The matrix of this

linear transformation in the standard bases for R' and Rm is just the matrix

A. However, if the bases are changed, then the matrix of the linear

transformation changes. When m = n, the matrix A = (aij) determines a

linear transformation on Rn to Rn via the above definition of a matrix times

a vector. The matrix A is called nonsingular (or invertible) if there exists a matrix, denoted by A - 1, such that AA - 1 = A - 'A = In where In is the n X n

identity matrix consisting of ones on the diagonal and zeroes off the diagonal. As with linear transformations, A -' is unique and exists iff

Ax = O implies x = 0.

The symbol En mdenotes the real vector space of m x n real matrices

with the usual operations of addition and scalar multiplication. In other words, if A = (aij) and B = (bij) are elements of enm' then A + B = (a + bIj) and aA = ({aij). Notice that n, m is the set of m x n matrices (m

and n are in reverse order). The reason for writing En, m is that an m x n

matrix determines a linear transformation from Rn to Rm. We have made the choice of writing E(V, W) for linear transformations from V to W, and

it is an unpleasant fact that the dimensions of a matrix occur in reverse order to the dimensions of the spaces V and W. The next result summarizes

the relations between linear transformations and matrices.

Proposition 1.5. Consider vector spaces V,, V2, and V3 with bases {xI,.... xnl), (YI. . ., yn2), and (z,,... , zn3), respectively. For x E V1, y E V2, and

z E V3, let [x], [y], and [z] denote the vector of coordinates of x, y, and z in

the given bases, so [x] E Rn', [y] E Rn2, and [z] E For A E fS(V,, V2) and B E f,(V2, V3) let [A] ([B]) denote the matrix of A(B) relative to the

bases (xl,..., xnl) and (yl,..., Yn2) ({I,- *., yn2) and (z,,..., zn3)). Then:

(i) [Ax] = [A][x]. (ii) [BA] = [B][AJ.

(iii) If V, = V2 andA is invertible, [A-] = [A]-'. Here, [A-] and [A]

are matrices in the bases (x,,..., x)ni and (x,,..., xnl).




Proof. A few words are in order concerning the notation in (i), (ii), and (iii). In (i), [Ax] is the vector of coordinates of Ax E V2 with respect to the basis (y,,.. .,yn2) and [A][x] means the matrix [A] times the coordinate vector [x] as defined previously. Since both sides of (i) are linear in x, it suffices to verify (i) for x = xj, j = 1,.. ., ni. But [A][xj] is just the column vector with coordinates aiJ, i = l,..., n2, and Axj = Eiaijyi, so [Axj] is the

column vector with coordinates aij, i = 1,.. . n2. Hence (i) holds.

For (ii), [B][A] is just the matrix product of [B] and [A]. Also, [BA] is the matrix of the linear transformation BA E E(V,, V3) with respect to the

bases (xl,..., xnl) and {z,,..., zn3). To show that [BA] = [B][A], we must

verify that, for all x E V, [BA][x] = [B][A][x]. But by (i), [BA][x] = [BAx]

and, using (i) twice, [B][A][x] = [B][Ax] = [BAx]. Thus (ii) is established. In (iii), [A]'- denotes the inverse of the matrix [A]. Since A is invertible,

AA- 1 = A - 'A = I where I is the identity linear transformation on VI to V1. Thus by (ii), with In denoting the n X n identity matrix, In = [I] = [AA - '] = [A][A-] = [A-A] = [A ][A]. By the uniqueness of the matrix inverse, [A-'] = [A]'.

Projections are the final topic in this section. If V is a finite dimensional vector space and M and N are subspaces of V such that M @ N = V, we

have seen that each x e V has a unique piece in M and a unique piece in N.

In other words, x = y + z where y e M, z E N, and y and z are unique.

Definition 1.11. Given subspaces M and N in V such that M E N = V, if x = y + z with y e M and z E N, then y is called the projection of x on M

along N and z is called the projection of x on N along M.

Since M and N play symmetric roles in the above definition, we con centrate on the projection on M.

Proposition 1.6. The function P mapping V into V whose value at x is the

projection of x on M along N is a linear transformation that satisfies

(i) 6@(P) = M, 6%(P)

= N.

(ii) p2= p.

Proof. We first show that P is linear. If x = y + z with y E M, z E N, then by definition, Px = y. Also, if xl = y, + z, and x2 = Y2 + Z2 are the

decompositions of xl and x2, respectively, then a1x1 + a2x2 = (a,y, +

a2y2) + (aIzI + a2z2) is the decomposition of a,x, + a2x2. Thus P(alx, + a2x2) = alPxl + a2Px2 so P is linear. By definition Px E M, so 6A(P)



PROPOSITION 1.7 13

C M. But if x e M, Px = x and 6A(P) = M. Also, if x E N, Px = 0 so

6L(P) D N. However, if Px = 0, then x = 0 + x, and therefore x E N.

Thus %L(P) = N. To show p2 = P, note that Px E M and Px = x for x E M. Hence, Px = P(Px) = P2X, which implies that P = p2. o

A converse to Proposition 1.6 gives a complete description of all linear transformations on V to V that satisfy A2 = A.

Proposition 1.7. If A e 1 (V, V) and satisfies A2 = A, then 6t(A) ED OL(A) = V and A is the projection on @{(A) along PZ,(A).

Proof. To show 6AJ(A) ED L(A) = V, we must verify that iAR(A) nl %1(A) = (0) and that each x E V is the sum of a vector in @Th(A) and a vector in 9 (A). If x E 6A(A) n L(A), then x = Ay for some y e V and Ax = O.

Since A2 = A, O = Ax = A2y = Ay = x and 6RA(A) n 6X(A) = (0). For x E

V, write x = Ax + (I - A)x and let y = Ax and z = (I - A)x. Then y e 6i(A) by definition and Az = A(I - A)x = (A - A2)x = 0, so z E

(X (A). Thus @A (A) @ PL(A) = V.

The verification that A is the projection on 6i(A) along 9L(A) goes as follows. A is zero on %DL(A) by definition. Also, for x E 6A (A), x = Ay for some y e V. Thus Ax = A2y = Ay = x, so Ax = x and x E 'it(A). How

ever, the projection on 6i(A) along 9L(A), say P, also satisfies Px = x for x e @Ai(A) and Px = 0 for x E 9L(A). This implies that P = A since

6AW(A) @ L(A) = V. El

The above proof shows that the projection on M along N is the unique linear transformation that is the identity on M and zero on N. Also, it is clear that P is the projection on M along N iff I - P is the projection on N along M.

1.3. INNER PRODUCT SPACES

The discussion of the previous section was concerned mainly with the linear aspects of vector spaces. Here, we introduce inner products on vector spaces so that the geometric notions of length, angle, and orthogonality become

meaningful. Let us begin with an example.

* Example 1.5. Consider coordinate space Rn with the standard basis (1,. ..., En). For x, y E R', define x'y =

Exiyi where x and y have coordinates xl,..., xn and Yi ... y,y Of course, x' is the

transpose of the vector x and x'y can be thought of as the 1 x n




matrix x' times the n x 1 matrix y. The real number x'y is some

times called the scalar product (or inner product) of x and y. Some properties of the scalar product are:

(i) x'y = y'x (symmetry). (ii) x'y is linear in y for fixed x and linear in x for fixed y.

(iii) x'x = YIx'i > 0 and is zero iff x = 0.

The norm of x, defined by llxll = (x'x)'/2, can be thought of as the distance between x and 0 E Rn. Hence, llx - Yll = (E(x, - Y )2)1/2 is usually called the distance between x and y. When x and y are

both not zero, then the cosine of the angle between x and y is

x'y/llxIIllyll (see Halmos, 1958, p. 118). Thus we have a geometric interpretation of the scalar product. In particular, the angle between

x and y is r/2(cos r/2 = 0) iff x'y = 0. Thus we say x and y are

orthogonal (perpendicular) iff x'y = 0.

Let V be a real vector space. An inner product on V is obtained by

simply abstracting the properties of the scalar product on Rn.

Definition 1.12. An inner product on a real vector space V is a real valued

function on V x V, denoted by (-, .), with the following properties:

(i) (x, y) = (y, x) (symmetry).

(ii) (a,x, + a2X2, y) = a,(x,, y) + a2(x2, y) (linearity).

(iii) (x, x) > 0 and (x, x) = 0 only if x = 0 (positivity).

From (i) and (ii) it follows that (x, a,y1 + a2y2) = a,(x, y,) + a2(x, Y2). In other words, inner products are linear in each variable when the other

variable is fixed. The norm of x, denoted by llxll, is defined to be llxll = (x, x)112 and the distance between x and y is lix - yIi. Hence geometrically

meaningful names and properties related to the scalar product on Rn have

become definitions on V. To establish the existence of inner products on finite dimensional vector spaces, we have the following proposition.

Proposition 1.8. Suppose (x,,. . ., xn) is a basis for the real vector space V.

The function (-, *) defined on V x V by (x, y) = Enai/i, where x = E2a x

and y = ,23ixi, is an inner product on V.

Proof Clearly (x, y) = (y, x). If x = 2aixi and z = Eyixi, then (ax +

yz, y) = -(aaj + yyi)f3i

= aEaj38 + yE2yifi = a(x, y) + y(z, y). This



PROPOSITION 1.9 15

establishes the linearity. Also, (x, x) = Xa , which is zero iff all the ai are zero and this is equivalent to x being zero. Thus (-, *) is an inner product on V. El

A vector space V with a given inner product (,*) is called an inner

product space.

Definition 1.13. Two vectors x and y in an inner product space (V, (, *)) are orthogonal, written x I y, if (x, y) = 0. Two subsets SI and S2 of V are

orthogonal, written S1 I S2, if x I y for all x E S, and y e S2.

Definition 1.14. Let (V, (-, *)) be a finite dimensional inner product space. A set of vectors {xl,..., xk) is called an orthonormal set if (xi, xj) = 8ij for i, j= l,..,k where Si.= 1 if i and 0 if i j. A set (x1,..., xk) is called an orthonormal basis if the set is both a basis and is orthonormal.

First note that an orthonormal set (x,,..., Xk) is linearly independent. To see this, suppose 0 = Yaixi. Then 0 = (0, xj) = (Eaixi, xj)

=

Eai(xi9 xj) = Eiai8ij = aj. Hence aj = 0 for j = ,..., k and the set (xl,..., Xk) is linearly independent.

In Proposition 1.8, the basis used to define the inner product is, in fact, an orthonormal basis for the inner product. Also, the standard basis for Rn

is an orthonormal basis for the scalar product on Rn-this scalar product is called the standard inner product on R'. An algorithm for constructing orthonormal sets from linearly independent sets is now given. It is known as the Gram-Schmidt orthogonalization procedure.

Proposition 1.9. Let (xl,..., Xk) be a linearly independent set in the inner product space (V, ( , )). Define vectors y,,.. ., Yk as follows:

xi Yl =

and

Xiy I (Xi+lI Yj)Yj

Yi+ I = i 5

jjxi+ - ? (xi+I yj)yjll j=,

for i = 1,..., k - 1. Then {Y,I... .Yk) is an orthonormal set and

span(xl,.. ., Xi) = span{y1,. .., yi), i = 1,..., k.

Proof. See Halmos (1958, Section 65). 0




An immediate consequence of Proposition 1.9 is that if (xl,..., x") is a basis for V, then {y,...., yn} constructed above is an orthonormal basis for (V,(-, )). If (y...,y y) is an orthonormal basis for (V,(., *)), then each x

in V has the representation x = 2(x, yi)yi in the given basis. To see this, we

know x = 2aiyi for unique scalars a,,..., a,. Thus

(x, y1) = Liai(yi, yj) = E

= a1

Therefore, the coordinates of x in the orthonormal basis are (x, yi), i =

1,..., n. Also, it follows that (x, x) = X(x, yi)2. Recall that the dual space of V was defined to be the set of all real-valued

linear functions on V and was denoted by P,(V, R). Also dim(V) =

dim(E(V, R)) when V is finite dimensional. The identification of V with e (V, R) via a given inner product is described in the following proposition.

Proposition 1.10. If (V, (-, *)) is a finite dimensional inner product space

and if f E f (V, R), then there exists a vector xo E V such that f(x) =

(xo, x) for x E V. Conversely, (xo, ) is a linear function on V for each

xo e V.

Proof Let xl,..., xn be an orthonormal basis for V and set ai = f(xi) for

i = 1,.. ., n. For xo = Eaixi, it is clear that (xo, xj) = aj = f(xj). Since the

two linear functions f and (xo, ) agree on a basis, they are the same

function. Thus f(x) = (xo, x) for x E V. The converse is clear. o

Definition 1.15. If S is a subset of V, the orthogonal complement of S,

denoted by S', is S' = (xlx I y for ally E S).

It is easily verified that S' is a subspace of V for any set S, and S I S'.

The next result provides a basic decomposition for a finite dimensional

inner product space.

Proposition 1.11. Suppose M is a k-dimensional subspace of an n-dimen

sional inner product space (V, (*, )). Then

(i) M n M = (O}.

(ii) M@M'= V.

(iii) (M I) I= M.

Proof. Let {xl,..., x.) be a basis for V such that (xl,..., Xk) is a basis for

M. Applying the Gram-Schmidt process to {x,,..., x ), we get an ortho



PROPOSITION 1.12 17

normal basis (Y,,.. . yn) such that {Yl,'* ,Yk} is a basis for M. Let

N = span(yk?+,,.. .,Yn}. We claim that N = M'. It is clear that N c M'

sinceyj t M forj= k + 1,..., n. But if x e M', then x = En(x, yi)yi and

(x, yi) =

O for i = 1,. . ., k since x E- M', that is, X =

En+ (X, yi)yi E- N.

Therefore, M = N'. Assertions (i) and (ii) now follow easily. For (iii), M' is spanned by {Yk+ 1*.. yn)} and, arguing as above, (M')' must be

spanned by y1,.. . ., Yk, which is just M. El

The decomposition, V = M ED M', of an inner product space is called

an orthogonal direct sum decomposition. More generally, if M,,..., Mk are

subspaces of V such that M, I Mj for i * j and V = M, ED M2 ED ... ED Mk,

we also speak of the orthogonal direct sum decomposition of V. As we have

seen, every direct sum decomposition of a finite dimensional vector space has associated with it two projections. When V is an inner product space

and V = M @E M', then the projection on M along M' is called the

orthogonal projection onto M. If P is the orthogonal projection onto M, then I - P is the orthogonal projection onto M'. The thing that makes a

projection an orthogonal projection is that its null space must be the orthogonal complement of its range. After introducing adjoints of linear transformations, a useful characterization of orthogonal projections is given.

When (V,(., *)) is an inner product space, a number of special types of

linear transformations in E(V, V) arise. First, we discuss the adjoint of a linear transformation. For A E Q(V, V), consider (x, Ay). For x fixed, (x, Ay) is a linear function of y, and, by Proposition 1.9, there exists a

unique vector (which depends on x) z(x) E V such that (x, Ay) = (z(x), y)

for all y e V. Thus z defines a function from V to V that takes x into z(x).

However, the verification that z(a,x, + a2x2) = a,z(x,) + a2z(x2) is

routine. Thus the function z is a linear transformation on V to V, and this

leads to the following definition.

Definition 1.16. For A E Q(V, V), the unique linear transformation in E(V, V), denoted by A', which satisfies (x, Ay) = (A'x, y), for all x, y E V,

is called the adjoint (or transpose) of A.

The uniqueness of A' in Definition 1.16 follows from the observation that if (Bx, y) = (Cx, y) for all x, y E V, then ((B - C)x, y) = 0. Taking

y = (B - C)x yields ((B - C)x, (B - C)x) = O for all x, so (B - C)x = 0

for all x. Hence B = C.

Proposition 1.12. If A, B e Q(V, V), then (AB)' = B'A', and if A is

invertible, then (A-')' = (A')-'. Also, (A')' = A.




Proof (AB)' is the transformation in f2(V, V) that satisfies ((AB)'x, y) = (x, ABy). Using the definition of A' and B', (x, ABy) = (A'x, By) -

(B'A'x, y). Thus (AB)' = B'A'. The other assertions are proved similarly.

Definition 1.17. A linear transformation in E(V, V) is called:

(i) Self-adjoint (or symmetric) if A = A'. (ii) Skew symmetric if A' = -A.

(iii) Orthogonal if (Ax, Ay) = (x, y) for x, y c V.

For self-adjoint transformations, A is:

(iv) Non-negative definite (or positive semidefinite) if (x, Ax) > 0 for x e V.

(v) Positive definite if (x, Ax) > 0 for all x * 0.

The remainder of this section is concerned with a variety of descriptions

and characterizations of the classes of transformations defined above.

Proposition 1.13. Let A E I (V, V). Then

(i) 6R,(A) = (9L (A'))1

(ii) 6@(A) = 6(AA').

(iii) %(A) = 9L(A'A).

(iv) r(A) = r(A')

Proof. Assertion (i) is equivalent to (6 (A))' = '(A'). But x E % (A')

means that 0 = (y, A'x) for all y E V, and this is equivalent to x I 6AP(A)

since (y, A'x) = (Ay, x). This proves (i). For (ii), it is clear that 6AK(AA') c

6YL(A). If x E i(A), then x = Ay for somey E V. Writey = yl + y2 where

yI 6E (A') and y2 C (6IY(A'))'. From (i), (6A (A'))' = 6X(A), so Ay2 = 0.

Sincey, E IK(A'), y1 = A'z for some z E V. Thus x = Ay = Ay1 = AA'z, so

x c 6fR(AA'). To -prove (iii), if Ax = 0, then A'Ax = 0, so 6X(A) C L(A'A). Con

versely, if A'Ax = 0, then 0 = (x, A'Ax) = (Ax, Ax), so Ax = 0, and

%(A'A) c %(A). Since dim(6A(A)) + dim(%(A)) = dim(V), dim(6Y (A')) + dim(6Y (A'))

= dim(V), and @(A) = (9L (A'))' , it follows that r(A) = r(A'). O



PROPOSITION 1.14 19

If A E C(V, V) and r(A) = 0, then A = 0 since A must map everything

into 0 e V. We now discuss the rank one linear transformations and show

that these can be thought of as the "building blocks" for E(V, V).

Proposition 1.14. For A E (V, V), the following are equivalent:

(i) r(A)= 1. (ii) There exist x0 * 0 and YO * 0 in V such that Ax = (ye, x)xo for

x E V.

Proof That (ii) implies (i) is clear since, if Ax = (ye, x)x0, then 6il(A) =

span(xo), which is one-dimensional. Thus suppose r(A) = 1. Since @R(A) is one-dimensional, there exists x0 E @Ai(A) with x0 * 0 and @AR(A) = span(xo}. As Ax e i3{(A) for all x, Ax = a(x)xo where a(x) is some scalar that depends on x. The linearity of A implies that a(/,3x, + 182x2) =

f3la(x1) + fi2a(x2). Thus a is a linear function on V and, by Proposition 1.10, a(x) = (yo, x) for some YO E V. Since a(x) * 0 for some x E V,

yo * 0. Therefore, (i) implies (ii). El

This description of the rank one linear transformations leads to the following definition.

Definition 1.18. Given x, y E V, the outer product of x and y, denoted by x E y, is the linear transformation on V to V whose value at z is (x El y)z

(y, z)x. Thus x Ey E E(V, V) and xE 0y = 0 iff x ory is zero. When x * 0 and

y*0, 6A (xOy)=span{x) and (xOy}=(span(y})'. The result of Proposition 1.14 shows that every rank one transformation is an outer product of two nonzero vectors. The following properties of outer products are easily verified:

(i) xO(aly1 + a2y2)= aIxO Y1 + a2xO Y2. (ii) (a,x, + a2X2)Oy = a1x1 ? y + a2X2 ? Y

(iii) (xfly)'=yox.

(iv) (x1 O y1)(x20 Y2) = (Y1, x2)x1 1 Y2*

One word of caution: the definition of the outer product depends on the inner product on V. When there is more than one inner product for V, care

must be taken to indicate which inner product is being used to define the outer product. The claim that rank one linear transformations are the building blocks for f (V, V) is partially justified by the following proposi tion.




Proposition 1.15. Let (x,,..., x,,) be an orthonormal basis for (V,(., )). Then (xio xj; i, j - 1,.. ., n) is a basis for C(V, V).

Proof. If A E t(V, V), A is determined by the n2 numbers ai = (x, Axj). But the linear transformation B = -2aijxi EO xj satisfies

(xi, Bxj) = i(x ( E aklXk O

x)xj) =

E Eakl(XI Ij)(X;, Xk) aj1.

Thus B = A so every A e IS (V, V) is a linear combination of (xi O xj Ii, j =

1,. .., n). Since dim(f,(V, V)) = n2, the result follows.

Using outer products, it is easy to give examples of self-adjoint linear transformations. First, since linear combinations of self-adjoint linear trans formations are again self-adjoint, the set M of self-adjoint transformations is a subspace of l (V, V). Also, the set N of skew symmetric transformations

is a subspace of E(V, V). It is clear that the only transformation that is both

self-adjoint and skew symmetric is 0, so M nl N = (0). But if A e C(V, V),

then

A=A +A' + A-A' A

+A'E=M, n A-A' N 2 2 2 and 2 EN.

This shows that C (V, V) = M @ N. To give examples of elements of M, let

x1,..., x n be an orthonormal basis for (V,(-, *)). For each i, xi Oxi is

self-adjoint, so for scalars ai, B = 2aixiOE xi is self-adjoint. The geometry

associated with the transformation B is interesting and easy to describe. Since lIxill = 1, (xiE xi)2 = xiO xi, so xiEl xi is a projection on span(xi) along (span{xi})' -that is, xiO xi is the orthogonal projection on span{xi) as the null space of xi, xi is the orthogonal complement of its range. Let

Mi= span{xi), i = 1,..., k. Each Mi is a one-dimensional subspace of

(V,(, *)), Mi I Mj if i - j, and Ml (1 M2 ED * e Mn = V. Hence, Vis the

direct sum of n mutually orthogonal subspaces and each x e V has the

unique representation x = 2(x, xi)xi where (x, xi)xi = (xi O xi)x is

the projection of x onto Mi, i = 1,..., n. Since B is linear, the value of Bx is

completely determined by the value of B on each Mi, i = 1,.. ., n. However,

if y e Mj, then y = axj for some a E R and By = aBxj = aEai(xi0 xi)xj = aajxj = ajy. Thus when B is restricted to Mj, B is aj times the identity

transformation, and understanding how B transforms vectors has become

particularly simple. In summary, take x E V and write x = 2(x, xi)xi; then

Bx = Eai(x, xi)xi. What is especially fascinating and useful is that every

self-adjoint transformation in l (V, V) has the representation 2aixiO xi for

some orthonormal basis for V and some scalars al,..., a.. This fact is



PROPOSITION 1.16 21

known as the spectral theorem and is discussed in more detail later in this chapter. For the time being, we are content with the following observation about the self-adjoint transformation B = 2aixi 0 xi: B is positive definite

iff ai > 0, i = 1,..., n. This follows since (x, Bx) = Eai(x, xi)2 and x = 0

iff (x, xi)2 = 0 for all i = 1,..., n. For exactly the same reasons, B is non-negative definite iff ai >0 for i = 1,. . ., n. Proposition 1.16 introduces

a useful property of self-adjoint transformations.

Proposition 1.16. If A1 and A2 are self-adjoint linear transformations in E(V, V) such that (x, A,x) = (x, A2x) for all x, then AI = A2.

Proof It suffices to show that (x, A,y) = (x, A2y) for all x, y E V. But

(x + y, A,(x + y)) = (x, Alx) + (y, Aly) + 2(x, Aly)

= (x +y, A2(x +y))

= (x, A2x) + (y, A2y) + 2(x, A2y).

Since (z, A,z) = (z, A2z) for all z E V, we see that (x, AIy) = (x, A2y).

U

In the above discussion, it has been observed that, if x E V and llxll = 1,

then x O x is the orthogonal projection onto the one-dimensional subspace span{x). Recall that P E = (V, V) is an orthogonal projection if P is a

projection (i.e., p2 = P) and if OL(P)= (6A(P))'. The next result char acterizes orthogonal projections as those projections that are self-adjoint.

Proposition 1.17. If P E C(V, V), the following are equivalent:

(i) P is an orthogonal projection. (ii) p2= p= pi.

Proof. If (ii) holds, then P is a projection and P is self-adjoint. By

Proposition 1.13, %(P) = (6A (P')) I= (?IW(P))I since P = P'. Thus P is

an orthogonal projection. Conversely, if (i) holds, then p2 = P since P is a projection. We must show that if P is a projection and O6(P) = (6(P))', then P = P'. Since V= =6(P) @ 6X(P), consider x, y E V and write x =

Xi + X2, Y = Yi + Y2 with xI, yi E St(P) and x2, Y2 E 9L(P)

= (6A (p))'.

Using the fact that P is the identity on (P), compute as follows:

(P'x, y) = (x, Py) = (xi + X2, PyI) =

(xI, IyY) = (Px1, yI)

= (P(x1 + x2), YI + Y2)

= (Px, Y).




Since P' is the unique linear transformation that satisfies (x, Py) = (P'x, y), we have P = P'. a

It is sometimes convenient to represent an orthogonal projection in terms of outer products. If P is the orthogonal projection onto M, let {xl,..., Xk) be an orthonormal basis for M in (V,(-, *)). Set A = ExiUxi so A is

self-adjoint. If x E M, then x = E(x, xi)xi and Ax = (Exicl xi)x =

E(x, xi)xi = x. If x E M', then Ax = 0. Since A agrees with P on M and

Ml, A = P = ExIO xi. Thus all orthogonal projections are sums of rank

one orthogonal projections (given by outer products) and different terms in the sum are orthogonal to each other (i.e., (xiEl xi)(xj1 xj) = 0 if i * j).

Generalizing this a little bit, two orthogonal projections P, and P2 are called

orthogonal if PI P2 = 0. It is not hard to show that P1 and P2 are orthogonal

to each other iff the range of P1 and the range of P2 are orthogonal to each

other, as subspaces. The next result shows that a sum of orthogonal

projections is an orthogonal projection iff each pair of summands is orthogonal.

Proposition 1.18. Let Pi,... ., Pk be orthogonal projections on (V,(-, *))

Then P = P1 + . + Pk is an orthogonal projection iff PiPPj = 0 for i *j.

Proof. See Halmos (1958, Section 76). El

We now turn to a discussion of orthogonal linear transformations on an

inner product space (V, (- ,-)). Basically, an orthogonal transformation is

one that preserves the geometric structure (distance and angles) of the inner product. A variety of characterizations of orthogonal transformations is possible.

Proposition 1.19. If (V,(-, *)) is a finite dimensional inner product space and if A e E(V, V), then the following are equivalent:

(i) (Ax, Ay) = (x, y) for all x, y E V.

(ii) IlAxll = llxll for all x E V. (iii) AA' = A'A = I.

(iv) If (xl,..., x") is an orthonormal basis for (V,(-, *)), then (Ax1, . . ., Ax,) is also an orthonormal basis for (V, (- , )).

Proof. Recall that (i) is our definition of an orthogonal transformation. We prove that (i) implies (ii), (ii) implies (iii), (iii) implies (i), and then show that (i) implies (iv) and (iv) implies (ii). That (i) implies (ii) is clear since



PROPOSITION 1.20 23

IIAxI12 = (Ax, Ax). For (ii) implies (iii), (x, x) = (Ax, Ax) = (x, A'Ax) implies that A'A = I since A'A and I are self-adjoint (see Proposition 1.16).

But, by the uniqueness of inverses, this shows that A' = A -I so I = AA-'

= AA' and (iii) holds. Assuming (iii), we have (x, y) = (x, A'Ay) = (Ax, Ay) and (i) holds. If (i) holds and (xl,..., x") is an orthonormal basis for (V, (*, * )), then Bij = (xi, xj) = (Axi, Axj), which implies that (Ax,,...,

Ax,,) is an orthonormal basis. Now, assume (iv) holds. For x E V, we have

x = 2(x, xj)xi and 11xI12 = Y(x, xi)2. Thus

IlAxI2- 2

(Ax, Ax) = (?(x, xi)Axi, E(x, xj)Axj)

= E (x, xi)(x, xj)(Axi, Axj) = ? E (x, xi)(x, xi) 8i i j i j

= y(X, Xi)2 = 1IX112.

i

Therefore (ii) holds. O

Some immediate consequences of the preceding proposition are: if A is orthogonal, so is A - ' = A' and if A, and A2 are orthogonal, then AIA2 is

orthogonal. Let 0(V) denote all the orthogonal transformations on the inner product space (V, (., * )). Then 0(V) is closed under inverses, I E 0(V), and (9(V) is closed under products of linear transformations. In other words, 0(V) is a group of linear transformations on (V, (*, *)) and 0(V) is called the orthogonal group of (V, (, ,*)). This and many other groups of

linear transformations are studied in later chapters. One characterization of orthogonal transformations on (V, (., )) is that

they map orthonormal bases into orthonormal bases. Thus given two orthonormal bases, there exists a unique orthogonal transformation that maps one basis onto the other. This leads to the following question. Suppose (xl1. . ., Xk) and .y1,.. ., Yk} are two finite sets of vectors in (V(., *)). Under

what conditions will there exist an orthogonal transformation A such that Axi = yi for i = 1,..., k? If such an A e ((V) exists, then (xi, xj) =

(Axi, Axj) = (y1, yj) for all i, j = 1,..., k. That this condition is also

sufficient for the existence of an A e 0(V) that maps xi to yi, i = 1,. . ., k,

is the content of the next result.

Proposition 1.20. Let (xl,..., Xk) and {.Y,.. ., Yk) be finite sets in (V, (I, )). The following are equivalent:

(i) (xi, xj) = (yi, yj) for i, j = I,..., k.

(ii) There exists an A e 0(V) such that Axi = yi for i = I,..., k.



Proof. That (ii) implies (i) is clear, so assume that (i) holds. Let M =

span(x ,..., Xk). The idea of the proof is to define A on M using linearity and then extend the definition of A to V using linearity again. Of course, it

must be verified that all this makes sense and that the A so defined is

orthogonal. The details of this, which are primarily computational, follow. First, by (i), Eaixi = 0 iff Eaiy, = 0 since

(Eaixi~,Ea^x) =

EEaia,(xi, x1)

- XXa-aj(yi, yj) = (Eaiyi, Eajyj). Let N = span{y1,. ..,Yk} and define B

on M to N by B(Eaixi) Eaiyi. B is well defined since Eaixi = Yf3ixi implies that Eaiyi = E2ifyi and the linearity of B on M is easy to check.

Since B maps M onto N, dim(N) < dim(M). But if B(Xaixi) = 0, then

Eaiyi = 0, so Ea,xA = 0. Therefore the null space of B is (0) c M and

dim(M) = dim(N). Let M' and N' be the orthogonal complements of M

and N, respectively, and let {u,,..., u,) and {v,,..., v,} be orthonormal

bases for M' and N', respectively. Extend the definition of B to V by first

defining B(ui) = vi for i = 1,..., s and then extend by linearity. Let A be

the linear transformation so defined. We now claim that jIAw112 = 11w112 for all w E V. To see this write w = w1 + w2 where w, E M and w2 e M'.

Then Awl e N and Aw2 e N'. Thus tIAwII2 = IlAw, + Aw2ll2 = IlAw1ll2 +

ltAw2ll2. But w1 = Yaixi for some scalars ai. Thus

IlAw111 = (A( aixi), A(Y2ajxj)) = YE aiaj(Axi, Axj)

E E Laia1(y1, yj) = E Eaiaj(xi, xj)

(Eaixi, Eajxj) = ||W1112.

Similarly, llAw2ll2 = llw2ll2. Since l1wll2 = llw1l12 + 11w2112, the claim that llAwl12 = jjwI12 is established. By Proposition 1.19, A is orthogonal. o

* Example 1.6. Consider the real vector space Rn with the standard basis and the usual inner product. Also, let CE, , be the real vector space of all n x n real matrices. Thus each element of En, n de

termines a linear transformation on R' and vice versa. More pre

cisely, if A is a linear transformation on Rn to Rn and [A] denotes

the matrix of A in the standard basis on both the range and domain

of A, then [Ax] = [A]x for x E Rn. Here, [Ax] E Rn is the vector

of coordinates of Ax in the standard basis and [A]x means the

matrix [A] = (aij) times the coordinate vector x E R . Conversely,

if [A] E en n and we define a linear transformation A by Ax = [A]x, then the matrix of A is [A]. It is easy to show that if A is a linear

THE CAUCHY-SCHWARZ INEQUALITY 25

transformation on Rn to R' with the standard inner product, then

[A'] = [A]' where A' denotes the adjoint of A and [A]' denotes the

transpose of the matrix [A]. Now, we are in a position to relate the

notions of self-adjointness and skew symmetry of linear transforma tions to properties of matrices. Proofs of the following two asser tions are straightforward and are left to the reader. Let A be a linear

transformation on Rn to R" with matrix [A].

(i) A is self-adjoint iff [A] = [A]'. (ii) A is skew-symmetric iff [A]' = - [A].

Elements of En n that satisfy B= B' are usually called symmetric matrices, while the term skew-symmetric is used if B' = - B, B E

(En n. Also, the matrix B is called positive definite if x'Bx > 0 for all x e R", x * 0. Of course x'Bx is just the standard inner product of x with Bx. Clearly, B is positive definite iff the linear transforma tion it defines is positive definite.

If A is an orthogonal transformation on Rn to R", then [A] must satisfy [A][A]' = [A]'[A] = In where In is the n x n identity matrix.

Thus a matrix B e en,, n is called orthogonal if BB' = B'B = In. An

interesting geometric interpretation of the condition BB' = B'B = In follows. If B = (bij), the vectors bi

E Rn with coordinates bij, i = 1,..., n, are the column vectors of B and the vectors ci e Rn

with coordinates bij, j = 1,..., n, are the row vectors of B. The

matrix BB' has elements c'c1 and the condition BB' = In means that

c'cj = Sij-that is, the vectors cl,..., cn form an orthonormal basis for Rn in the usual inner product. Similarly, the condition B'B = In holds iff the vectors bl,..., bn form an orthonormal basis for Rn. Hence a matrix B is orthogonal iff both its rows and columns determine an orthonormal basis for Rn with the standard inner product.

1.4. THE CAUCHY-SCHWARZ INEQUALITY

The form of the Cauchy-Schwarz Inequality given here is general enough to be applicable to both finite and infinite dimensional vector spaces. The examples below illustrate that the generality is needed to treat some standard situations that arise in analysis and in the study of random



variables. In a finite dimensional inner product space (V, (., *)), the inequal ity established in this section shows that Kx, y)l < llxllllyll where 11x112 =

(x, x). Thus - I < (x, y)/llxllIIylI I 1 and the quantity (x, y)/lIxllllyIl is defined to be the cosine of the angle between the vectors x and y. A variety of applications of the Cauchy-Schwarz Inequality arise in later chapters.

We now proceed with the technical discussion. Suppose that V is a real vector space, not necessarily finite dimensional.

Let [-, -I denote a non-negative definite symmetric bilinear function on V x V- that is, [ , - ] is a real-valued function on V x V that satisfies (i)

[x, y] = [y, xI, (ii) [a1x1 + a2x2, y] = a,[x,, y] + a2[x2, y], and (iii) [x, x]

> 0. It is clear that (i) and (ii) imply that [x, a1y1 + a2Y2] = a1[x, YI] +

a2[x, Y2]. The Cauchy-Schwarz Inequality states that [x, y]2 < [X, x][y, y]. We also give necessary and sufficient conditions for equality to hold in this

inequality. First, a preliminary result.

Proposition 1.21. Let M = {xl[x, x] = 0). Then M is a subspace of V.

Proof. If x e M and a E R, then [ax, ax] = a2[x, x] = 0 so ax E M.

Thus we must show that if x, x2 e M, then xl +x2eM. For aeR,

O [< [x + aX2' xI + ax2] = [xI, XI] + 2a[xl, X2] + a2[x2, X2] = 2a[x1,x2] since xl, x2 E M. But if 2a[x,, x2] > 0 for all a E R, [xI, x2] = 0, and this

implies that 0 = [xI + ax2, xI + ax2] for all a E R by the above equality.

Therefore, xl + ax2 E M for all a when xl, x2 E M and thus M is a

subspace. []

Theorem 1.1. (Cauchy-Schwarz Inequality). Let [-,-] be a non-negative definite symmetric bilinear function on V x V and set M = (xlIx, x] = 0). Then:

(i) [X, y]2 < [x, x][y, y] for x, y E V. (ii) [X, y12 = [x, xJ[y, y] iff ax + fly E M for some real a and /3 not

both zero.

Proof. To prove (i), we consider two cases. If x E M, then 0 < [y + ax, y

+ ax] = [y, y] + 2a[x, y] for all a E R, so [x, y] = 0 and (i) holds. Simi

larly, if y e M, (i) holds. If x t M andy O M, let xI = x/[x, x]'/2 and let

Yi = y/[y, y]1/2. Then we must show that I[x,, yj] < 1. This follows from

the two inequalities

0 < [x, -y, xl -yl] = 2- 2[x,, y]

PROPOSITION 1.21 27

and

0 < [xI + yl, xl + y,] = 2 + 2[x, yl].

The proof of (i) is now complete. To prove (ii), first assume that [x, y]2 = [x, x][y, y]. If either x E M or

y E M, then ax + fly E M for some a, /3 not both zero. Thus consider

x q M and y q M. An examination of the proof of (i) shows that we can

have equality in (i) iff either 0 = [xl - yI, xl - yI] or 0 = [xl + y1, xl

+yI] and, in either case, this implies that ax + fly E M for some real a, /3

not both zero. Now, assume ax + fly E M for some real a, /3 not both zero.

If a = 0 or /t = 0 or x E M or y e M, we clearly have equality in (i). For

the case when a,B * 0, x t M, and y 0 M, our assumption implies that

xI + YY1 E M for some y * 0, since M is a subspace. Thus there is a real

-y * 0 such that 0 = [xl + YY,, xl + YYi] = 1 + 2y[x,, Yl] + y2. The equa

tion for the roots of a quadratic shows that this can hold only if j[xI, y] I 1.

Hence equality in (i) holds.

* Example 1.7. Let (V, (-, *)) be a finite dimensional inner product space and suppose A is a non-negative definite linear transforma

tion on V to V. Then [x, y] (x, Ay) is a non-negative definite

symmetric bilinear function. The set M = (xI(x, Ax) = 0) is equal to 6X(A)-this follows easily from Theorem 1.1(i). Theorem 1.1 shows that (x, Ay)2 < (x, Ax)(y, Ay) and provides conditions for equality. In particular, when A is nonsingular, M = (0) and equality holds iff x and y are linearly dependent. Of course, if A = I, then

we have (x, y)2 < 11X11211y112, which is one classical form of the Cauchy-Schwarz Inequality.

* Example 1.8. In this example, take V to be the set of all continu ous real-valued functions defined on a closed bounded interval, say a to b, of the real line. It is easily verified that

a [xI, X2] -lxI(t)X2(t dt

is symmetric, bilinear, and non-negative definite. Also [x, x] > 0 unless x = 0 since x is continuous. Hence M = {0}. The

Cauchy-Schwarz Inequality yields

t) dt b 2(t) dt a

* Example 1.9. The following example has its origins in the study of the covariance between two real-valued random variables. Consider a probability space (S2, 'F, PO) where Q is a set, 'F is a a-algebra of

subsets of Q, and PO is a probability measure on IF. A random

variable X is a real-valued function defined on Q such that the

inverse image of each Borel set in R is an element of IF; symboli

cally, X- 1(B) E 1Y for each Borel set B of R. Sums and products of

random variables are random variables and the constant functions on Q are random variables. If X is a random variable such that

JIX(w)lPo(dw) < + oo, then X is integrable and we write &X for

JX(w)Po(dw). Now, let V be the collection of all real-valued random variables

X, such that &X2 < + oo. It is clear that if X E V, then aX E V for

all real a. Since (XI + X2)2 < 2(X2 + X22), if XI and X2 are in V,

then X, + X2 is in V. Thus V is a real vector space with addition

being the pointwise addition of random variables and scalar multi plication being pointwise multiplication of .random variables by scalars. For XA, X2 E V, the inequality IXIX21 < X12 + X22 implies

that X, X2 is integrable. In particular, setting X2 = 1, Xl is integra

ble. Define [*, *] on V x V by [XI, X2] = &;(XIX2). That [*, ] is

symmetric and bilinear is clear. Since [XI, XI] - &;X12 > 0, [-, ] is

non-negative definite. The Cauchy-Schwarz Inequality yields (&;XIX2)2 < &;X2X22, and setting X2 = 1, this gives ($X1)2 <

&X2. Of course, this is just a verification that the variance of a

random variable is non-negative. For future use, let var(X,) &-X2 - (& Xl)2 . To discuss conditions for equality in the

Cauchy-Schwarz Inequality, the subspace M = {XI[X, X] = 0) needs to be described. Since [X, XA] = X2, X E M iff X is zero,

except on set of PO measure zero-that is, X = 0 a.e. (PO). There

fore, (6 X A2)2 = 6X&2X22 iff aX, + 13X2 = 0 a.e. (PO) for some

real a, ,B not both zero. In particular, var( XI) = 0 iff X, - SX, = 0

a.e. (PO). A somewhat more interesting non-negative definite symmetric

bilinear function on V X V is

cov(Xl, X2) -9XIX2 - "I"29

and is called the covariance between X, and X2. Symmetry is clear

and bilinearity is easily checked. Since cov(X,, X,) = X412 -

(;X 1)2 = var(X,), cov(, }) is non-negative definite and M, =

{ Xlcov( X, X) = 0) is just the set of random variables in V that have

THE SPACE f (V, W) 29

variance zero. For this case, the Cauchy-Schwarz Inequality is

(CoV{X1, X2))2 I var(Xl)var(X2).

Equality holds iff there exist a, /3, not both zero, such that var(aX, + 13X2) = 0; or equivalently, a(Xl - EX1) + ,B(X2 - &X2) = 0

a.e. (PO) for some a, /B not both zero. The properties of cov{-, *}

given here are used in the next chapter to define the covariance of a

random vector.

1.5. THE SPACE f (V, W)

When (V, (., ')) is an inner product space, the adjoint of a linear transfor

mation in f,(V, V) was introduced in Section 1.3 and used to define some

special linear transformations in E(V, V). Here, some of the notions dis cussed in relation to l (V, V) are extended to the case of linear transforma

tions in 1 (V, W) where (V,(., -)) and (W,[., ]) are two inner product

spaces. In particular, adjoints and outer products are defined, bilinear functions on V x W are characterized, and Kronecker products are intro duced. Of course, all the results in this section apply to P,(V, V) by taking

(W, [., []) = (V, (, )) and the reader should take particular notice of this

special case. There is one point that needs some clarification. Given (V, (., .)) and (W, [ *, * ]), the adjoint of A E C (V, W), to be defined below, depends

on both the inner products (-, ) and [., .]. However, in the previous

discussion of adjoints in f (V, V), it was assumed that the inner product was

the same on both the range and the domain of the linear transformation (i.e., V is the domain and range). Whenever we discuss adjoints of A E

F (V, V) it is assumed that only one inner product is involved, unless the contrary is explicitly stated- that is, when specializing results from i_(V, W) to f(V, V), we take W = V and [ *, -] = (-, *).

The first order of business is to define the adjoint of A E t(V, W) where

(V,(-, .)) and (W,[., *]) are inner product spaces. For a fixed w E W,

[w, Ax] is a linear function of x E V and, by Proposition 1.10, there exists a

unique vector y(w) E V such that [w, Ax] = (y(w), x) for all x E V. It is

easy to verify that y(a1w1 + a2w2) = aly(wI) + a2y(w2). Hence y(-) de

termines a linear transformation on W to V, say A', which satisfies [w, Ax] =(A'w,x)forallwe Wandxe V.

Definition 1.19. Given inner product spaces (V,(, )) and (W,[, ]), if A E f (V, W), the unique linear transformation A' E (W, V) that satisfies

[w, Ax] = (A'w, x) for all w E W and x E V is called the adjoint of A.




The existence and uniqueness of A' was demonstrated in the discussion preceeding Definition 1.19. It is not hard to show that (A + B)' = A' + B', (A')' = A, and (aA)' = aA'. In the present context, Proposition 1.13 be comes Proposition 1.22.

Proposition 1.22. Suppose A E f(V, W). Then:

(i) @A(A) = (6X(A')

(ii) 6A(A) = %(AA'). (iii) %X(A)= = (A'A). (iv) r(A) = r(A').

Proof. The proof here is essentially the same as that given for Proposition 1. 13 and is left to the reader. o

The notion of an outer product has a natural extension to E(V, W).

Definition 1.20. For x E (V,(-, *) and w e (W,J[, ]1), the outer product,

wO x is that linear transformation in l (V, W) given by (wO x)(y)

(x, y)w for all y E V.

If w = 0 or x = 0, then wO) x = 0. When both w and x are not zero, then

wOE x has rank one, 'R5(w ) x) = span(w}, and IL (wE) x) = (span{x})' .

Also, a minor modification of the proof of Proposition 1.14 shows that, if A E F (V, W), then r(A) = 1 iff A = wo) x for some nonzero w and x.

Proposition 1.23. The outer product has the following properties:

(i) (a1wI + a2W2)C)X =

a1WIl X + a2W2E0x.

(ii) wO(a1x1 + a2x2) = a1wE) xi + a2wO X2.

(iii) (wO) x)' = xO w E fe(W, V).

If (V,(., -)-), (V2,(., )2), and (13,(., )3) are inner product spaces with

xi EV 1, x2, Y2 E V2, and y3 E V3, then

(iv) (y3 O y2)(x2E) x1) = (x2, Y2)2Y3 O x E C (VI, 1V3).

Proof. Assertions (i), (ii), and (iii) follow easily. For (iv), consider x E VI. Then (x2E) x,)x = (x1, x)Ix2, so (Y3E0 Y2)(X2 E X )X = (Xl, X)I(y3E0 Y2)X2 = (XI, X)(y2, X2)2y3 E 13. However, (x2, Y2)2(Y3 CO x)x = (x2, y2)2(xI, x)ly3. Thus (iv) holds. o



PROPOSITION 1.24 31

There is a natural way to construct an inner product on f&(V, W) from inner products on V and W. This construction and its relation to outer products are described in the next proposition.

Proposition 1.24. Let (xl,..., xm) be an orthonormal basis for (V,(, *)) and let (w1, . .., w%) be an orthonormal basis for (W,., , 1). Then:

(i) (wiO xjli = 1,.. ., n, j = ,..., m) is a basis for $ (V, W).

Let aij = [wi, Axj]. Then:

(ii) A = EEaijwiOxj and the matrix of A is [A] = (aij) in the given

bases.

If A-=Eaijw,O xj and B = D2 bijwif xj, define (A, B) D2aijbij. Then:

(iii) K , is an inner product on f (V, W) and (wil xjli = 1,..., n, j = 1,. . ., m} is an orthonormal basis for (I (V, W), ( )).

Proof. Since dim(l (V, W)) = mnn, to prove (i) it suffices to prove (ii). Let B = Ea ijwi Xj-. Then

[Wk, Bx1] = E aij[wk, (win xj)x,] - EaIjik8Il =

akl, i j i j

so[wi, Bxj] = [wi, Axj] for i = I,..., n andj = 1,..., im. Therefore, [w, Bx] = [w, Ax] for all w E W and x E V, which implies that [w, (B - A)x] = 0.

Choosing w = (B - A)x, we see that (B - A)x = 0 for all x E V and,

therefore, B = A, To show that the matrix of A is [A] = (aij), recall that the

matrix of A consists of the scalars bkj defined by Axj = 4kbkjwk. The inner

product of wi with each side of this equation is

aij - [wi, Axj] = Ebkj [ wi, Wk =b k

and the proof of (ii) is complete. For (iii), K *) is clearly symmetric and bilinear. Since (A, A) = Eaj,

the positivity of K - * ) follows. That (wi E xiji

= 1, ..., n, j = 1,. .., m) is

an orthonormal basis for (f (V, W), K *)) follows immediately from the definition of K *, * ). O

A few words are in order concerning the inner product K ) on l (V, W). Since (wiO xjIi = 1,.. ., n, j = 1,..., im) is an orthonormal basis,



we know that if A e (V, W), then

A = E E (A, w,C x1)w,J x;,

since this is the unique expansion of a vector in any orthonormal basis. However, A = EE[wi, Axj]wi 0 xj by (ii) of Proposition 1.24. Thus (A, w,EiJ x1) = [w1, Ax.] for i = 1,..., n and j = 1,. . ., m. Since both sides

of this relation are linear in wi and xj, we have (A, wO0 x) = [w, Ax] for all

w e W and x e V. In particular, if A =i wE x, then

(V xi, wO x) = [w, (wO x)x] = [w, (.x,x)w] = w, w (x, x).

This relation has some interesting implications.

Proposition 1.25. The inner product ( ,*> on 5 (V, W) satisfies

(i) <wOv x , wOE x) = [wi7, w ](x, x)

for all w, w E W and x, x E V, and ( , ) is the unique inner product with

this property. Further, if (z1, ..., Zn) and y , Y)m} are any orthonormal

bases for W and V, respectively, then {ziO yjIi 1,..., n, j = 1,..., m} is

an orthonormal basis for (I4(V, W), ( , * )).

Proof. Equation (i) has been verified. If (, ) is another inner product on P (V, W) that satisfies (i), then

(wiElxj,IWi2 Ex1} = (wi,OxJI,wi2'xj2)

for all i1, i2 - 1,..., n and j1, j2 1,..., m where (x],..., xm) and

(wI, .., wn) are the orthonormal bases used to define ( *, * ). Using (i) of

Proposition 1.24 and the bilinearity of inner products, this implies that {A, B) = (A, B) for all A, B E E(V, W). Therefore, the two inner products

are the same. The verification that (ziEl y1Ii = 1,..., n, j = 1,..., m} is an

orthonormal basis follows easily from (i). o

The result of Proposition 1.25 is a formal statement of the fact that

K - ) does not depend on the particular orthonormal bases used to define

it, but ( *, * ) is determined by the inner products on V and W. Whenever V

and W are inner product spaces, the symbol , always means the inner

product on l (V, W) as defined above.

PROPOSITION 1.25 33

* Example 1.10. Consider V = RK and W = R' with the usual inner

products and the standard bases. Thus we have the inner product ( * I-) on m, n-the linear space of n X m real matrices. For

A = (aij) and B = {bi) in Cm, n

n m

(A, B) = F. aijbij. i=1 j=1

If C = AB': n X n, then

cii=Eabijbj, i= 1,...,In, i ~ i=

so (A, B) = Ecii. In other words, (A, B) is just the sum of the

diagonal elements of the n x n matrix AB'. This observation leads to the definition of the trace of any square matrix. If C: k x k is a

real matrix, the trace of C, denoted by tr C, is the sum of the

diagonal elements of C. The identity (A, B) = (B, A) shows that trAB' = trB'A for all A, B E Cm n. In the present example, it is

clear that wO x = wx' for x E Rm and w E R , so wO x is just the

n x 1 matrix w times the 1 x m matrix x'. Also, the identity in

Proposition 1.25 is a reflection of the fact that

tr wxc'xw' = w7'wx'xx

for w, w E R" and x, E e Rm.

If (V,(-, .)) and (W,[., ]) are inner product spaces and A e E f(V, W),

then [Ax, w] is linear in x for fixed w and linear in w for fixed x. This

observation leads to the following definition.

Definition 1.21. A function f defined on V x W to R is called bilinear if:

(i) f(alxl + a2X2, W) = alf(xl, w) + a2f(x2, w)

(ii) f(x, a1w1 + a2W2) = alf(x, wI) + a2f(x, W2).

These conditions apply for scalars a, and a2; x, x1, x2 E V and w, w1, w2

E W.

Our next result shows there is a natural one-to-one correspondence between bilinear functions and f (V, W).




Proposition 1.26. If f is a bilinear function on V X W to R, then there

exists an A eE (V, W) such that f(x, w) = [Ax, w] for all x E V and w e W. Conversely, each A E E(V, W) determines the bilinear function [Ax,w]on VX W.

Proof. Let {xl,..., xm) be an orthonormal basis for (V, (, .)) and (w1,..., w%) be an orthonormal basis for (W, [., 1). Set aij = f (xj, wi) for i = 1,..., n

and j = 1, . . ., m and let A = EEaijwi O xj. By Proposition 1.24, we have

aij [Axj,w] =f((xi,w).

The bilinearity of f and of [Ax, w] implies [Ax, w] = f(x, w) for all x E V

and w E W. The converse is obvious. El

Thus far, we have seen that f (V, W) is a real vector space and that, if V

and W have inner products ( *, - ) and [ *, * ], respectively, then f (V, W) has a

natural inner product determined by (, ) and [*, -]. Since (V, W) is a

vector space, there are linear transformations on E(V, W) to other vector spaces and there is not much more to say in general. However, C(V, W) is

built from outer products and it is natural to ask if there are special linear

transformations on E(V, W) that transform outer products into outer products. For example, if A E P&(V, V) and B E P4W, W), suppose we

define B ? A on E(V, W) by (B ? A)C = BCA' where A' denotes the

transpose of A C P (V, V). Clearly, B ? A is a linear transformation. If

C = wEl x, then (B X A)(wo x) = B(wo x)A' EC (V, W). But for v e V,

(B (w C x)A') v = B(w C x)(A'v) = B((x, A'v)w)

= (Ax, v)Bw

= ((Bw) O (Ax))v.

This calculation shows that (B ? A)(w O x) = (Bw) O(Ax), so outer prod

ucts get mapped into outer products by B ? A. Generalizing this a bit, we

have the following definition.

Definition 1.22. Let (V,,(., )X), (V2,(*, *)2), (W,,[*, * ],), and (W2,[*, *12) be inner product spaces. For A EE f(V,, V2) and B e E (W,, W2), the

Kronecker product of B and A, denoted by B 0 A, is a linear transformation

on (V,, W,) to E(V2, W2), defined by

(B X A)C BCA'

for all C CE (V,, W,).



PROPOSITION 1.27 35

In most applications of Kronecker products, V1 = V2 and W, = W2, so

B 0 A is a linear transformation on f (VI, W1) to E(V,, WI). It is not easy to

say in a few words why the transpose of A should appear in the definition of

the Kronecker product, but the result below should convince the reader that the definition is the "right" one. Of course, by A', we mean the linear transformation on V2 to VI, which satisfies (x2, Ax1)2 = (A'x2, x0)1 for

xe VI andx2 e V2.

Proposition 1.27. In the notation of Definition 1.22,

(i) (B X A)(w1 D v,) = (Bw1)EJ(Av1) E (V2, W2).

Also,

(ii) (B ? A)' = B' XA',

where (B 0 A)' denotes the transpose of the linear transformation B 0 A

on (E(V1, W1), < )1) to ( (V2, W2), ( * * 2)

Proof To verify (i), for v2 E V2, compute as follows:

[(B 0 A)(w, [1 FV)](V2) = B(w1 D v,)A'v2 = B(vl, A'v2)2w1

= (Av,, v2)Bw, = [(Bw1)Cl(Av,)](v2)

Since this holds for all v2 E V2, assertion (i) holds. The proof of (ii) requires we show that B' 0 A' satisfies the defining equation of the adjoint-that is, for Cl e C (V1, WI) and C2 E I(V2, W2),

(C2, (B 0 A)CI)2 = ((B' 0 A')C2, COP1.

Since outer products generate E(V,, W1), it is enough to show the above holds for C1 = wIEJxX with w1 E

- W1 and xl E V1. But, by (i) and the

definition of transpose,

(C2,(B XA)(wOx1))2 = (C2,Bw1EOJAx,)2 = [C2Ax1,Bw1]2

= [B'C2Ax,, WI],

= (B'C2A, wl ? XI),

= ((B' 0 A')C2, w1 O x1),

and this completes the proof of (ii). o

We now turn to the case when A - E (V, V) and B E E (W, W) so B ? A

is a linear transformation on (V, W) to l (V, W). First note that if A is self-adjoint relative to the inner product on V and B is self-adjoint relative to the inner product on W, then Proposition 1.27 shows that B ? A is

self-adjoint relative to the natural induced inner product on l (V, W).

Proposition 1.28. For Ai e- E(V,V), i = 1,2, and Bi E f,(W,W), i = 1,2,

we have:

(i) (B, ? A1)(B2 ? A2) = (B1B2) X (AIA2). (ii) If A and B' exist, then (B1 A1) = B' A,.

(iii) If A, and B, are orthogonal projections, then B, O A, is an

orthogonal projection.

Proof. The proof of (i) gocs as follows: For C E C (V, W),

(B1 ? A1)(B2 ? A2)C = (B1 ? Al)(B2CA2) = BIB2CA'A'

= B1B2C(A,A2)' = ((B1B2) ? (AIA2))C.

Now, (ii) follows immediately from (i). For (iii), it needs to be shown that (B1 X A1)2 = B1 0 Al = (BI 0 Al)'. The second equality has been verified.

The first follows from (i) and the fact that B2 = BI and A2 = A. l

Other properties of Kronecker products are given as the need arises. One

issue to think about is this: if C E E(V, W) and B E f(W, W), then BC

can be thought of as the product of the two linear transformations B and C.

However, BC can also be interpreted as (B 0 I)C, I E &(V, V)-that is, BC is the value of the linear transformation B 0 I at C. Of course, the

particular situation determines the appropriate way to think about BC.

Linear isometries are the final subject of discussion in this section, and

are a natural generalization of orthogonal transformations on (V,( ,)). Consider finite dimensional inner product spaces V and W with inner

products (, *) and [, -] and assume that dim V < dim W. The reason for

this assumption is made clear in a moment.

Definition 1.23. A linear transformation A e C&(V, W) is a linear isometry

if (vI, v2) = [Av,, Av2l for all vI, v2 E V.

If A is a linear isometry and v E V, v * 0, then 0 < (v, v) = [Av, Av].

This implies that %)t(A) = {0), so necessarily dim V < dim W. When W = V

PROPOSITION 1.29 37

and [*, -l = (*, *), then linear isometries are simply orthogonal transforma

tions. As with orthogonal transformations, a number of equivalent descrip tions of linear isometnes are available.

Proposition 1.29. For A E ?(V, W) (dim V s dim W), the following are

equivalent:

(i) A is a linear isometry.

(ii) A'A = Ie E(V,V).

(iii) [Av, Av] (v, v), v E V.

Proof. The proof is similar to the proof of Proposition 1.19 and is left to

the reader. o

The next proposition is an analog of Proposition 1.20 that covers linear isometries and that has a number of applications.

Proposition 1.30. Let vi,..., Vk be vectors in (V,(I, *)), let w1,..., Wk be vectors in (W, [-, ]1), and assume dim V < dim W. There exists a linear

isometry A E C(V, W) such that Avi = wI, i 1,..., k, iff (vi, Vj) [wI, w]

for i,j= I,...,k.

Proof. The proof is a minor modification of that given for Proposition 1.20 and the details are left to the reader. o

Proposition 1.31. Suppose A E $ (V, W,) and B E 1 (V, W2) where dim W2

< dim W,, and ( , .), [ , -I, and [., 12 are inner products on V, W,, and

W2. Then A'A = B'B iff there exists a linear isometry I E E(W2, WI) such

that A = -B.

Proof. If A = PB, then A'A = B'"'*B = B'B, since V'I = I E

e (W2, W2). Conversely, suppose A'A = B'B and let (vl,..., vm) be a basis

for V. With x =Avi E W, and y, = Bvi e W2, i = l,..., m, we have [xi, x;]} = [Avi, Avj]1

= (v,, A'Avj) = (vi, B'Bvj)

= [Bvi, Bvy]2 = [Yi, Yj]12 for

i, j - 1,..., m. Applying Proposition 1.30, there exists a linear isometry I E' E(W2,W,)suchthat *yi xi fori = ,..., m.Therefore, *Bvi =Avi for i = 1, . . ., m and, since {v, . .., vm) is a basis for V, 4B = A. El

* Example 1.11. Take V= RK and W = Rn with the usual inner products and assume m < n. Then a matrix A = (aij}: n X m is a

linear isometry iff A'A = Im where Im is the m X m identity matrix.

If a,,..., am denote the columns of the matrix A, then A'A is just the m x m matrix with elements a'aj, i, j = 1,..., m. Thus the

condition A'A = I,,, means that aaj = ij so A is a linear isometry

on Rm to Rn iff the columns of A are an orthonormal set of vectors

in Rn. Now, let Fm, n be the set of all n X m real matrices that are

linear isometries- that is, A E &mn

iff A'A = I,,,. The set Cm n is

sometimes called the space of m-frames in Rn as the columns of A

form an m-dimensional orthonormal "frame" in R . When m = 1,

9, n is just the set of vectors in Rn of length one, and when m = n,

,,nn is the set of all n X n orthogonal matrices. We have much more

to say about Cm n in later chapters. An immediate application of Proposition 1.31 shows that, if

A: n1 X m and B: n2 X m are real matrices with n2 < n1, then

A'A =B'B iff A = IB where ': nn X n2 satisfies *'I

= In2. In

particular, when n1 = n2, A'A = B'B iff there exists an orthogonal matrix *I: n 1 X n l such that A = IB.

1.6. DETERMINANTS AND EIGENVALUES

At this point in our discussion, we are forced, by mathematical necessity, to introduce complex numbers and complex matrices. Eigenvalues are defined as the roots of a certain polynomial and, to insure the existence of roots,

complex numbers arise. This section begins with complex matrices, determi nants, and their basic properties. After defining eigenvalues, the properties of the eigenvalues of linear transformations on real vector spaces are

described. In what follows, 4T denotes the field of complex numbers and the symbol

i is reserved for -l=. If a E I, say a = a + ib, then a-= a-ib is the

complex conjugate of a. Let (Jn be the set of all n-tuples (henceforth called

vectors) of complex numbers-that is, x E (4n iff

x = : ; Xj E- IT j= 1,.. n.

Xn

The number xi is called the jth coordinate of x, j =1, . . ., n. For x, y CE " ,

x + y is defined to be the vector with coordinates xj + yj, j = 1,.. ., n, and

for a E (, ax is the vector with coordinates axj, j = 1,..., n. Replacing R

by 4T in Definition 1.1, we see that 4" satisfies all the axioms of a vector

DETERMINANTS AND EIGENVALUES 39

space where scalars are now taken to be complex numbers, rather than real

numbers. More generally, if we replace R by IT in (II) of Definition 1.1, we have the definition of a complex vector space. All of the definitions, results, and proofs in Sections 1.1 and 1.2 are valid, without change, for complex vector spaces. In particular, 4dU is an n-dimensional complex vector space and the standard basis for cr" is (,e...., En) where -j has its jth coordinate

equal to one and the remaining coordinates are zero.

As with real matrices, an m x n array A = {ajk} for j = 1,. . ., m, and k = 1,..., n where a1k E ( is called an m x n complex matrix. If A =

{ajk): m x n and B = (bkl): n x p are complex matrices, then C = AB is

the m X p complex matrix with with entries cjg = Ekajkbkl forj = 1,.. ., m

and 1 = l,..., p. The matrix C is called the product of A and B (in that

order). In particular, when p = 1, the matrix B is n x 1 so B is an element

of 4'. Thus if x E (n (x now plays the role of B) and A: m x n is a

complex matrix, Ax E -r. Clearly, each A: m x n determines a linear

transformation on ?" to ?rn via the definition of Ax for x E Xn. For an

m x n complex matrix A = (ajk), the conjugate transpose of A, denoted by A*, is the n x m matrix A* =

(akj), k = 1,.. ., n,j = 1 ... ., m, where akj is

the complex conjugate of ak1 E . In particular, if x E 4", x* denotes the conjugate transpose of x. The following relation is easily verified:

y*Ax = x*A*y

where y e C, x E - 4, and A is an m x n complex matrix. Of course, the

bar over y*Ax denotes the complex conjugate of y*Ax E 4(. With the preliminaries out of the way, we now want to define determi

nant functions. Let CE denote the set of all n X n complex matrices so Cn is an n2-dimensional complex vector space. If A E (Cn write A =

(a,, a2,...,

an) where a1 is the jth column of A.

Definition 1.24. A function D defined on en and taking values in (V is called a determinant function if

(i) D(A) = D(al,..., an) is linear in each column vector a1 when the other columns are held fixed. That is,

D(a,,..., aaj + Pbj,... , an )= aD(a,,..., aj,... 9 an)

+ #D(al,., bj,**. , an)

for a, /E E (V.

(ii) For any two indices j and k, j < k,

D(al,., aj,..., ak,..., an) = -D(a1,..., ak,..., aj,..., an).

Functions D on Cn to ?t that satisfy (i) are called n-linear since they are

linear in each of the n vectors aI,..., an when the remaining ones are held fixed. If D is n-linear and satisfies (ii), D is sometimes called an alternating n-linear function, since D(A) changes sign if two columns of A are inter changed. The basic result that relates all determinant functions is the following.

Proposition 1.32. The set of determinant functions is a one-dimensional complex vector space. If D is a determinant function and D 7t 0, then

D(I) * 0 where I is the n X n identity matrix in en.

Proof. We briefly outline the proof of this proposition since the proof is instructive and yields the classical formula defining the determinant of an n X n matrix. Suppose D(A) = D(al,..., an) is a determinant function.

For each k = 1,. .., n, ak = EjajkEj where { 'n,..., E") is the standard basis

for ?rn and A = {ajk): n x n. Since D is n-linear and al =ajlej,

D(aj,..., an) = E ajD(ej, a2,..., an).

Applying this same argument for a2= 2 aj2=;,

D(a1,..., an) = E ,ajllaj22D(ej, E12, a3,. .. an). il J2

Continuing in the obvious way,

D(a1,..., an) = E aj,Ia122... annD D(e1, 12,.. e.,

il. * * *in

where the summation extends over all j1,. ijn with 1 < ji < n for 1 =

1,..., n. The above formula shows that a determinant function is de

termined by the nn numbers D(?Ej,. . ., E j) for 1 < ji < n, and this fact

followed solely from the assumption that D is n-linear. But since D is

alternating, it is clear that, if two columns of A are the same, then

D(A) = 0. In particular, if two indices]j and jk are the same, then D(Ej.,..., ej1) = 0. Thus the summation above extends only over those indices where

j,.. .,jn are all distinct. In other words, the summation extends over all

permutations of the set (1, 2,. .., n). If 'n denotes a permutation of 1, 2,.. ., n,

then

PROPOSITION 1.33 41

where the summation now extends over all n! permutations. But for a fixed

permutation iT(l),..., r(n) of 1,..., n, there is a sequence of pairwise interchanges of the elements of -T(1),..., T(n), which results in the order 1, 2,.. ., n. In fact there are many such sequences of interchanges, but the

number of interchanges is always odd or always even (see Hoffman and

Kunze, 1971, Section 5.3). Using this, let sgn(r) = I if the number of

interchanges required to put iT(l), .. ., v(n) into the order 1, 2, .. ., n is even

and let sgn(7r) = - 1 otherwise. Now, since D is alternating, it is clear that

D(e,T(1),..-, e.f(n)) = sgn( T)D(e1,..., 'n).

Therefore, we have arrived at the formula D(a1,..., an)= D(I)?7,sgn(7T)a,1(l)l... a,T(n)n since D(I) = D(E1,..., E'n). It is routine to

verify that, for any complex number a, the function defined by

Dj(al a . .. a n)--aL, Sgn(7r)a,r(,), . air(n)n

is a determinant function and the argument given above shows that every determinant function is a D. for some a E (4. This completes the proof; for more details, the reader is referred to Hoffman and Kunze (1971, Chapter 5). 0

Definition 1.25. If A E E"n, the determinant of A, denoted by det(A) (or det A), is defined to be D1(A) where D, is the unique determinant function with D1(I) = 1.

The proof of Proposition 1.32 gives the formula for det(A), but that is not of much concern to us. The properties of det(.) given below are most easily established using the fact that det(*) is an alternating n-linear function of the columns of A.

Proposition 1.33. For A, B E CEn:

(i) det(AB) = det A det B.

(ii) det A* = det A.

(iii) det A * 0 iff the columns of A are linear independent vectors in the complex vector space C.

If All: n, x n,, A12: n, X n2, A21: n2 x n,, and A22: n2 x n2 are complex




matrices, then:

(iv) det(A A2) det(~ ~2 det A, I det A22. (A21 A 22 dt 0 A 22 eAldt2

(v) If A is a real matrix, then det(A) is real and det(A) = 0 iff the

columns of A are linearly dependent vectors over the real vector space Rn.

Proof. The proofs of these assertions can be found in Hoffman and Kunze (1971, Chapter 5). 0

These properties of det(*) have a number of useful and interesting implications. If A has columns a1,..., an, then the range of the linear

transformation determined by A is just span{a1,..., an). Thus A is invert ible iff span{aI,..., an) = 4" iff det A * 0. If det A * 0, then 1 =

det AA- ' = det A det A- ', so det A- = l/det A. Consider B1I: n1 X nI, B12: n, X n2, B21: n2 X nI, and B22: n2 x n2- complex matrices. Then it

is easy to verify the identity:

( All A12 (B1l B12 A _ (A11BI1 + A12B21 A11B,2 + A12B22

A21 A22 1B21 B22 J A21B1l + A22B21 A21B,2 + A22B22

where All, A12, A21, and A22 are defined in Proposition 1.33. This tells us

how to multiply the two (n, + n2) X (n, + n2) complex matrices in terms

of their blocks. Of course, such matrices are called partitioned matrices.

Proposition 1.34. Let A be a complex matrix, partitioned as above. If

det AI I* 0, then:

(All A12_ (i) det ( A2' A2) = det A,1det(A22 -A A-A

If det A22 *0 , then:

(ii) det (A A - det A22det(All1-A12A22 2 A22 A22

Proof. For (i), first note that

d(In7 -Ai I

12)



PROPOSITION 1.35 43

by Proposition 1.33, (iv). Therefore, by (i) of Proposition 1.33,

1 A12 IA11 A2 In A-1 det(A

= = det' I A , \A21 A 22j A21 A22 0 In 2 J

{All O = detj 0 A

l A21 A22 - A2-Al1A12

= det Al det(A22 -A2 AI7lA12)

The proof of (ii) is similar. Ol

Proposition 1.35. Let A: n X m and B: m x n be complex matrices. Then

det(I,, + AB) = det(Im + BA).

Proof Apply the previous proposition to

(,B Im )

We now turn to a discussion of the eigenvalues of an n x n complex matrix.

The definition of an eigenvalue is motivated by the following considera tions. Let A E CE". To analyze the linear transformation determined by A,

we would like to find a basis xl,..., xn of ?7n such that Axj = Xjxj, j = 1,..., n, where Xi

E '. If this were possible, then the matrix of the

linear transformation in the basis (x,,..., xn) would simply be

X2

An

where the elements not indicated are zero. Of course, this says that the linear transformation is Xi times the identity transformation when restricted to span(xj). Unfortunately, it is not possible to find such a basis for each linear transformation. However, the numbers X1,..., AX which are called eigenvalues after we have an appropriate definition, can be interpreted in another way. Given A e IT, Ax = Xx for some nonzero vector x iff (A -

XI)x = 0, and this is equivalent to saying that A - AI is a singular matrix,




that is, det(A - AI) = 0. In other words, A - AI is singular iff there exists x * 0 such that Ax = Ax. However, using the formula for det(.), a bit of calculation shows that

det(A-AI) = (-l) An + an- A-' + + aA + ao

where ao, a, ... ., an -, are complex numbers. Thus det(A - AI) is a poly

nomial of degree n in the complex variable A, and it has n roots (counting multiplicities). This leads to the following definition.

Definition 1.26. Let A e C-n and set

p(X) = det(A - AI).

Then nth degree polynomial p is called the characteristic polynomial of A and the n roots of the polynomial (counting multiplicities) are called the eigenvalues of A.

If p(A) = det(A - AI) has roots AL,..., A n then it is clear that

n

p(A) = H (AX - A) j=1

since the right-hand side of the above equation is an n th degree polynomial with roots A,,..., An and the coefficient of An is (- 1). In particular,

n

P(M) = H Aj = det(A)

so the determinant of A is the product of its eigenvalues. There is a particular case when the characteristic polynomial of A can be

computed explicitly. If A E C,n, A = (ajk} is called lower triangular if

ajk = 0 when k > j. Thus A is lower triangular if all the elements above the

diagonal of A are zero. An application of Proposition 1.33 (iv) shows that

when A is lower triangular, then

n

det(A) = Hl ajj.

But when A is lower triangular with diagonal elements ajj,] = j ...., n, then A - AI is lower triangular with diagonal elements (ajj - A), j = 1,..., n.



PROPOSITION 1.36 45

Thus

n

p(A) = det(A - XI) = Hl (ajj - A),

so A has eigenvalues al1,..*, a,nn,. Before returning to real vector spaces, we first establish the existence of

eigenvectors (to be defined below).

Proposition 1.36. If A is an eigenvalue of A E Cn, then there exists a

nonzero vector x E 47 such that Ax = Ax.

Proof. Since A is an eigenvalue of A, the matrix A - AI is singular, so the dimension of the range of A - AI is less than n. Thus the dimension of the

null space of A - AI is greater than 0. Hence there is a nonzero vector in

the null space of A - AI, say x, and (A - AI)x = 0. z

Definition 1.27. If A e C,n, a nonzero vector x e Cn is called an eigenvec

tor of A if there exists a complex number A E 47 such that Ax = Ax.

If x * 0 is an eigenvector of A and Ax = Ax, then (A - AI)x = 0 so

A - AI is singular. Therefore, A must be an eigenvalue for A. Conversely, if A E 4 is an eigenvalue, Proposition 1.36 shows there is an eigenvector x such that Ax = Ax.

Now, suppose V is an n-dimensional real vector space and B is a linear transformation on V to V. We want to define the characteristic polynomial, and hence the eigenvalues of B. Let (v,,..., v,) be a basis for V so the

matrix of B is [B] - (bjk) where the b k'S satisfy BVk = EjbIkV1. The char acteristic polynomial of [B] is

P(X) = det([B] - AI)

where I is the n X n identity matrix and X E 4?. If we could show that p(A)

does not depend on the particular basis for V, then we would have a reasonable definition of the characteristic polynomial of B.

Proposition 1.37. Suppose (v,, ..., v,) and (Yl,..., yn) are bases for the real vector space V, and let B EE f(V, V). Let [B] = (bjk) be the matrix of B in the basis (v1,..., v") and let [B],

= (ajk) be the matrix of B in the basis

(Yi,... , yn). Then there exists a nonsingular real matrix C = (cjk) such that

[B]1 = C-'[B]C.




Proof. The numbers ajk are uniquely determined by the relations

BYk ?ajkYj k =1 ... , n.

Define the linear transformation Cl on V to V by C1vj = yj, j = 1,..., n. Then C1 is nonsingular since C1 maps a basis onto a basis. Therefore,

BC,vk = -aIkClvI = CI(ajkVj)

and this yields

CQ'BCIVk = E ajkVjI.

Thus the matrix of C1 1BC1 in the basis (v1,..., v,} is (a1k). From Proposi

tion 1.5, we have

[B]1 = {aJk)= [Cj-'BC1] = [Cl'][B][C11 = [CI]'1[B][CI]

where [C1] is the matrix of C1 in the basis (v1,..., v"). Setting C = [C], the conclusion follows. o

The above proposition implies that

p(X) = det([B] - XI) = det(C-'([B] - XI)C)

= det(C-'[B]C - XI) = det([B], - XI).

Thus p(X) does not depend on the particular basis we use to represent B, and, therefore, we call p the characteristic polynomial of the linear transfor

mation B. The suggestive notation

p(X) = det(B - XI)

is often used. Notice that Proposition 1.37 also shows that it makes sense to define det(B) for B E C(V, V) as the value of det[BI in any basis, since the

value does not depend on the basis. Of course, the roots of the polynomial p(A) = det(B - XI) are called the eigenvalues of the linear transformation

B. Even though [B] is a real matrix in any basis for V, some or all of the

eigenvalues of B may be complex numbers. Proposition 1.37 also allows us



PROPOSITION 1.38 47

to define the trace of A E f (V, V). If (v1,..., v") is a basis for V, let

trA tr[A] where [A] is the matrix of A in the given basis. For any

nonsingular matrix C,

tr[A] = trCC-'[A] = trC-'[A]C,

which shows that our definition of trA does not depend on the particular basis chosen.

The next result summarizes the properties of eigenvalues for linear transformations on a real inner product space.

Proposition 1.38. Suppose (V, (-, )) is a finite dimensional real inner product space and let A E C(V, V).

(i) If A E (V is an eigenvalue of A, then X is an eigenvalue of A.

(ii) If A is symmetric, the eigenvalues of A are real (iii) If A is skew-symmetric, then the eigenvalues of A are pure imagin

ary (iv) If A is orthogonal and A is an'eigenvalue of A, then AX = 1.

Proof If A E: E(V, V), then the characteristic polynomial of A is

p(A) = det([A] - AI), A E (V,

where [A] is the matrix of A in a basis for V. An examination of the formula

for det(.) shows that

p(A) = (- l)'X7 + an_ AV-' + * * + aiA + aiO

where ao,. .., a - _are real numbers since [A] is a real matrix Thus if

p(A) = 0, then p(X) =Ap() = 0 so whenever p(A) = 0, p(A) = 0. This establishes assertion (i).

For (ii), let A be an eigenvalue of A, and let (v,,..., v.) be an orthonor

mal basis for (V, (., *)). Thus the matrix of A, say [A], is a real symmetric

matrix and [A] - AI is singular as a matrix acting on (". By Proposition

1.36, there exists a nonzero vector x E F' such that [A]x = Ax. Thus

x*[A]x = Ax*x. But since [A] is real and symmetric,

x*[A]x = x*[A]*x = x*[A]x = Ax*x = Xx*x.

Thus Xx*x = Ax*x and, since x * 0, X = A so A is real.




To prove (iii), again let [Al be the matrix of A in the orthonormal basis

(v1,..., vn) so [A]' = [Al* = -[A]. If A is an eigenvalue of A, then there exists x e or, x * 0, such that [A]x = Ax. Thus x*[A]x = Ax*x and

Ax*x = x*[A]x = -x*[A]x = -Ax*x.

Since x * 0, A = - A, which implies that A = ib for some real number

b-that is, A is pure imaginary and this proves (iii).

If A is orthogonal, then [A] is an n X n orthogonal matrix in the

orthonormal basis {v1,..., vn). Again, if A is an eigenvalue of A, then

[A]x = Ax for some x E a, x * 0. Thus Ax* = x*[A]* = x*[A]' since [A] is a real matrix. Therefore

AXx*x = x*[A]'[A]x = x*x

as [A]'[A] = I. Hence AX 1 and the proof of Proposition 1.38 is complete. 0

It has just been shown that if (V, (, )) is a finite dimensional vector space and if A E I&(V, V) is self-adjoint, then the eigenvalues of A are real.

The spectral theorem, to be established in the next section, provides much

more useful information about self-adjoint transformations. For example, one application of the spectral theorem shows that a self-adjoint trans

formation is positive definite iff all its eigenvalues are positive. If A e E(V, W) and B E E(W, V), the next result compares the eigen

values of AB (E f(W, W) with those of BA e E (V, V).

Proposition 1.39. The nonzero eigenvalues of AB are the same as the

nonzero eigenvalues of BA, including multiplicities. If W = V, AB and BA

have the same eigenvalues and multiplicities.

Proof. Let m = dim V and n = dim W. The characteristic polynomial of

BA is

p1(A) = det(BA - AIm).

Now, for A * 0, compute as follows:

det(BA - AIm) = det(-A)( BA + Imi) _ x~~~(A

= (-A)mdet( A + Im) = (-A)mdet A + In)

= (-A)mdet( IA)(AB -AI) - (-A)' det(AB - AI).



PROPOSITION 1.40 49

Therefore, the characteristic polynomial of AB, say P2(X) = det(AB - Ij, is related to pI(X) by

pi(X) - (X),P2()X),

A E A * 0.

Both of the assertions follow from this relationship.

1.7. THE SPECTRAL THEOREM

The spectral theorem for self-adjoint linear transformations on a finite dimensional real inner product space provides a basic theoretical tool not only for understanding self-adjoint transformations but also for establishing a variety of useful facts about general linear transformations. The form of

the spectral theorem given below is slightly weaker than that given in

Halmos (1958, see Section 79), but it suffices for most of our purposes. Applications of this result include a necessary and sufficient condition that a self-adjoint transformation be positive definite and a demonstration that positive definite transformations possess square roots. The singular value decomposition theorem, which follows from the spectral theorem, provides a useful decomposition result for linear transformations on one inner product space to another. This section ends with a description of the relationship between the singular value decomposition theorem and angles between two subspaces of an inner product space.

Let (V, (-, -)) be a finite dimensional real inner product space. The

spectral theorem follows from the two results below. If A E fE(V, V) and M

is subspace of V, Mis called invariant under A if A(M) = (Axlx e M) c M.

Proposition 1.40. Suppose A E E(V, V) is self-adjoint and let M be a subspace of V. If A(M) c M, then A(M') c M'.

Proof. Suppose v E A(M1). It must be shown that (v, x) = 0 for all

x E M. Since v E A(M'), v = Av1 for v1 E M' . Therefore,

(v, x) =

(Av,, x) =

(v,, Ax) = 0

since A is self-adjoint and x E M implies Ax E M by assumption. O

Proposition 1.41. Suppose A E fE(V, V) is self-adjoint and A is an eigen value of A. Then there exists a v E V, v * 0, such that Av = Xv.




Proof. Since A is self-adjoint, the eigenvalues of A are real. Let (v1, . . ., vn) be a basis for V and let [A] be the matrix of A in this basis. By Proposition

1.36, there exists a nonzero vector z E t' such that [A]z = Az. Write

z = zI + iz2 where z1 e R' is the real part of z and z2 E Rn is the imaginary

part of z. Since [A] is real and A is real, we have [A]z, = Az, and

[A]z2 = Az2. But, z1 and Z2 cannot both be zero as z * 0. For definiteness,

say z1 0 and let v c V be the vector whose coordinates in basis (v1,..., vn} are z1. Then v * 0 and [A][v] = A[vJ. Therefore Av = Av. El

Theorem 1.2 (Spectral Theorem). If A E E(V, V) is self-adjoint, then there exists an orthonormal basis (x1,.. ., xn) for V and real numbers AX...., A,n such that

n A = FAixioExix

Further, A1,..., A,n are the eigenvalues of A and Ax, = Aixi, i = 1,.. ., n.

Proof The proof of the first assertion is by induction on dimension. For n = 1, the result is obvious. Assume the result is true for integers 1, 2,..., n

- 1 and consider A E E(V, V), which is self-adjoint on the inner product

space (V,-(, -)), n = dim V. Let A be an eigenvalue of A. By Proposition 1.41, there exists v E V, v * 0, such that Av = Av. Set xn = v/IIvjj and

An = A. Then Axn = Anxn. With M = span(xn}, it is clear that A(M) C M

so A(M') c M' by Proposition 1.40. However, if we let A1 be the restriction of A to the (n - 1)-dimensional inner product space (M' , (*, *)), then A1 is clearly self-adjoint. By the induction hypothesis there is an

orthonormal basis (x,. . x,, x ) for M' and real numbers A,.. ., A,_ such that

n-I

A1= X AixiUx

It is clear that (xl,. .., x.,) is an orthonormal basis for V and we claim that

n

A = AEx Oxi

To see this, consider vo E V and write vO = vI + v2 with vI E M and

V2 E M'. Then

n-I

Avo = Av, + Av2 = A,nl + A1v2 = Xnvl + E Ai(xioxi)v2.



PROPOSITION 1.42 51

However,

n n-I

E(A 0 Xi ()V + V2) = Xn(VI,xn)xn + E2Ai(xiOxi)v2

since v, e M and v2 E M'. But (vI, xn)xn = v1 since v1 E span(xn). Therefore A = El Xx1 0 xi, which establishes the first assertion.

For the second assertion, if A = En Xx 0 xi where {xl,..., x,} is an

orthonormal basis for (V, (*, )), then

Axj = EAi(xiO xi)xj = EAi(xi, xj)xi = A1jx1 i i

Thus the matrix of A, say [AJ, in this basis has diagonal elements A1,..., AI n

and all other elements of [A] are zero. Therefore the characteristic poly nomial of A is

n

p(A) = det([A] - AI) = Il(Xi - A),

which has roots A,,..., A,n. The proof of the spectral theorem is complete.

When A = EAXx i O xi, then A is particularly easy to understand. Namely,

A is Ai times the identity transformation when restricted to span(xi}. Also, if x E V, then x = X(xi, x)xi so Ax = EAi(xi, x)xi. In the case when A is an

orthogonal projection onto the subspace M, we know that A = Ekxio xi where k = dim M and {x,,..., Xk) is an orthonormal basis for M. Thus A

has eigenvalues of zero and one, and one occurs with multiplicity k = dim M.

Conversely, the spectral theorem implies that, if A is self-adjoint and has only zero and one as eigenvalues, then A is an orthogonal projection onto a

subspace of dimensional equal to the multiplicity of the eigenvalue one. We now begin to reap the benefits of the spectral theorem.

Proposition 1.42. If A E P,(V, V), then A is positive definite iff all the eigenvalues of A are strictly positive. Also, A is positive semidefinite iff the eigenvalues of A are non-negative.

Proof. Write A in spectral form:

n A =EAXx1 xi



where (x,,..., xn) is an orthonormal basis for (V,(-, *)). Then (x, Ax) =

ENi(xi, x)2. If Xi > 0 for i = 1,..., n, then x * 0 implies that EXi(xi, x)2

> 0 and A is positive definite. Conversely, if A is positive definite, set x = x; and we have 0 < (x;,

Ax.) = Aj. Thus all the eigenvalues of A are

strictly positive. The other assertion is proved similarly. o

The representation of A in spectral form suggests a way to define various functions of A. If A = XX1x1O xi, then

A2 = (A ixixix)(EXjxjo

x1) = E EXXj(xi0 xi)(xj1 xj)

i i iij

= E EAiXj(xi, Xj)xiO Xj = ExixiU x i

More generally, if k is a positive integer, a bit of calculation shows that

Ak = ixiO xi, k = 1,2.

For k = 0, we adopt the convention that A' = I since Ex1O xi = I. Now if

p is any polynomial on R, the above equation forces us to define p(A) by

p(A) = Yp(Xi)xi?xi.

This suggests that, if f is any real-valued function that is defined at

A,..., An, we should definef(A) by

f(A) = Yf(Ai)xiC xi.

Adopting this suggestive definition shows that if A1,..., A, are the eigen values of A, then f(AL ),. . . , f(XA,,) are the eigenvalues of f(A). In particular,

if A,i 0 for all i and f(t) - U', t *0, then it is clear that f(A) =A

Another useful choice for f is given in the following proposition.

Proposition 1.43. If A E E(V, V) is positive semidefinite, then there exists a B E

- C(V, V) that is positive semidefinite and satisfies B2 = A.

Proof. Choose f(t) = t'/2, and let

n

B -f(A) = EA'j2x,oxi.

PROPOSITION 1.43 53

The square root is well defined since AX > 0 for i = 1,. . ., n as A is positive

semidefinite. Since B has non-negative eigenvalues, B is positive definite. That B2 = A is clear. El

There is a technical problem with our definition of f(A) that is caused by the nonuniqueness of the representation

n A =

XxAD xi

I

for self-adjoint transformations. For example, if the first n1 Xi's are equal

and the last n - n, X 's are equal, then

A = xi( ExiEl xi) + An E x O x

However, I2 xiExi is the orthogonal projection onto Ml span(x,,..., xnl}. If Y1..., yn is any other orthonormal basis for (V,(., *)) such that

span(xl,..., xnl} = span(yl,.. ., yn,), it is clear that

n, n n

A = 1EYC ? Yi + An E YiC? Yi = EXiYC ? Yi @ I n1+? I

Obviously, A1,.. ., Xn are uniquely defined as the eigenvalues for A (count ing multiplicities), but the orthonormal basis (x,,. . ., xn) providing the spectral form for A is not unique. It is therefore necessary to verify that the

definition of f(A) does not depend on the particular orthonormal basis in the representation for A or to provide an alternative representation for A. It is this latter alternative that we follow. The result below is also called the spectral theorem.

Theorem 1.2a (Spectral Theorem). Suppose A is a self-adjoint linear transformation on V to V where n = dim V. Let XA > ... Ar be the distinct eigenvalues of A and let ni be the multiplicity of AX, i = 1,..., r.

Then there exists orthogonal projections PF,..., Pr with P,Pj = 0 for i j, n= rank(Pi), and E Pn = I such that

r

A = EXiPi.

Further, this decomposition is unique in the following sense. If ,u, > ... >



Ak and Ql ... I Qk are orthogonal projections such that QiQj = 0 for i ]j,

EQi = I, and

k A =

then k = r, t X-i, and Qi = Pi for i = 1,..., k.

Proof. The first assertion follows immediately from the spectral represen tation given in Theorem 1.2. For a proof of the uniqueness assertion, see

Halmos (1958, Section 79). o

Now, our definition of f(A) is

r

f (A) = Ef(Xi)P I

when A = E'XiPi. Of course, it is assumed that f is defined at XA,..., Xr. This is exactly the same definition as before, but the problem about the nonuniqueness of the representation of A has disappeared. One application of the uniqueness part of the above theorem is that the positive semidefinite square root given in Proposition 1.43 is unique. The proof of this is left to

the reader (see Halmos, 1958, Section 82). Other functions of self-adjoint linear transformations come up later and

we consider them as the need arises. Another application of the spectral theorem solves an interesting extremal problem. To motivate this problem, suppose A is self-adjoint on (V, ( , *)) with eigenvalues X I> A2 > ... > An.

Thus A = EAixix xi where (xl,.. ., x,,} is an orthonormal basis for V. For

x e V and llxll = 1, we ask how large (x, Ax) can be. To answer this, write

(x, Ax) = EXi(x, (xi Exi)x) = Xi (x, xi)2, and note that 0 < (x, xi)2 and

1 = lix 12 = E1(xI, x)2. Therefore, Al > EAi(x, xi)2 with equality for x = x,. The conclusion is

sup (x, Ax) =A x, lIxil= I

where Al is the largest eigenvalue of A. This result also shows that A1 (A)-the largest eigenvalue of the self-adjoint transformation A -is a convex func

tion of A. In other words, if A, and A2 are self-adjoint and a E [0, 1], then

A1(aA, + (1 - a)A2) < aA1(Al) + (1 - a)A1(A2). To prove this, first

notice that for each x E V, (x, Ax) is a linear, and hence convex, function

of A. Since the supremum of a family of convex functions is a convex

PROPOSITION 1.44 55

function, it follows that

X1(A)= sup (x, Ax) x,IIxII= I

is a convex function defined on the real linear space of self-adjoint linear transformations. An interesting generalization of this is the following.

Proposition 1.44. Consider a self-adjoint transformation A defined on the n-dimensional inner product space (V, ( *, - )) and let XI > X2 > * >X Anbe the ordered eigenvalues of A. For 1 k < n, let fk be the collection of all k-tuples {v,..., Vk) such that (v, ..., Vk) is an orthonormal set in (V, (, *)). Then

k k

sup E(vi, Avi) =x. (V1. V .k)Ek 1

Proof Recall that < , is the inner product on f, (V, V) induced by the inner product (*, *) on V, and (x, Ax) = (x O x, A) for x e V. Thus

k k k

Vi Avi) = Evi Di, A) v1,viC viA I I\ I/

Write A in spectral form, A = X1 xi. For (v0,..., Vk) E 6k' 0k = EkrVD vi is the orthogonal projection onto span{v,..., Vk). Thus for (v1,..., Vk) E '

K Evfli ) =(Pk IXiX)

n n

- EPk, EXi xiE ) = ZXx(Xi Pkxi)

n n

Since Pk is an orthogonal projection and 1xi=X = 1, iP= 1,..., n, 0

(xi, Pkxi) < 1. Also,

n n I n

(xi, Pkxi) = Pk, xiExi)= Pk, Xi[Xi Pk, I)

because ynXykExV xi = I E (V, V). But Pk = EIv,J Vi, so

k k

(Pk, I) = (vil vi, I) (vi, vi) = k.

1 1

Therefore, the real numbers ai = (xi, Pkxi), i = 1,..., n, satisfy 0 < ai < 1

and , = k. A moment's reflection shows that, for any numbers a,...., an satisfying these conditions, we have

n k

EXiai <, Ei I I

since Al > .. > AXn. Therefore,

/k \ k

(Evi vi, A) <Ei

for {v1,..., vk) E 6

However, setting vi = xi, i = 1,..., k, yields equality in the above inequality.

For A E E(V, V), which is self-adjoint, define trkA = EkAX where X, >?

***> A,n are the ordered eigenvalues of A. The symbol trkA is read "trace

sub-k of A." Since (EkV EJ v., A) is a linear function of A and trkA is the

supremum over all {v1,..., vk) E 6Bk it follows that trkA is a convex

function of A. Of course, when k = n, trk A is just the trace of A.

For completeness, a statement of the spectral theorem for n x n symmet

ric matrices is in order.

Proposition 1.45. Suppose A is an n X n real symmetric matrix. Then there

exists an n x n orthogonal matrix F and an n x n diagonal matrix D such

that A = FDF'. The columns of F are the eigenvectors of A and the

diagonal elements of D, say A1,. . ., AXn, are the eigenvalues of A.

Proof. This is nothing more than a disguised version of the spectral

theorem. To see this, write n

A = EXixixi

where xi E R", Ai E R, and {x,,..., xn) is an orthonormal basis for Rn with

the usual inner product (here xi 0

xi is xix' since we have the usual inner

product on Rn). Let F have columns xI,..., xn and let D have diagonal

elements AX,..., AIn. Then a straightforward computation shows that

n

fA,ixix; = rDF.

The remaining assertions follow immediately from the spectral theorem. o

PROPOSITION 1.46 57

Our final application of the spectral theorem in this chapter deals with a

representation theorem for a linear transformation A E I (V, W) where (V, ( , -)) and (W, [ *, *]) are finite dimensional inner product spaces. In this

context, eigenvalues and eigenvectors of A make no sense, but something can be salvaged by considering A'A E fC(V, V). First, A'A is non-negative definite and 6X(A'A) = OL(A). Let k = rank(A) = rank(A'A) and let X, >

*. > X > 0 be the nonzero eigenvalues of A'A. There must be exactly k

positive eigenvalues of A'A as rank(A) = k. The spectral theorem shows that

k A'A = EZixiL xi

where (xl,..., x,,} is an orthonormal basis for V and A'Axi ixi for = 1,..., k, A'Axi = 0 for i = k + 1,..., n. Therefore, %X(A) = %(A'A)

= (span(x1,..., Xk}).

Proposition 1.46. In the notation above, let wi = (1/ rX1)Axi for i = 1,..., k. Then (w1,..., Wk} is an orthonormal basis for AR (A) c W and A = EkX1Wi 1x1.

Proof Since dim @(A) = k, {w,..., Wk) is a basis for QA (A) if (w1,..., Wk)

is an orthonormal set. But

[wi, wj] = (XiX) /2 [Axi, Axj] = (Xij)-1/2(Xi, A'Ax1)

= (XiXj) -2Xj(xi, x) =Sij

and the first assertion holds. To show A = E*X wi E xi, we verify the two

linear transformations agree on the basis (xl,... ., xn). For 1 < j < k, Axj = X wj by definition and

E X wi ?x Xi Xj= E X- (Xi,g X ) Wi X wj

For k + 1 <j < n, Ax. = 0 since %y(A) = (span(x1,..., xk)). Also

aEjXW?Xi k) xL=EX(Xi,xjI)W O

asj >k.O


Some immediate consequences of the above representation are (i) AA' =

I AiwiO Wi, (ii) A'=' - ,x, x wi and A'wi = xixi for i- 1,...,k. In

summary, we have the following.

Theorem 13 (Singular Value Decomposition Theorem). Given A E (V, W) of rank k, there exist orthonormal vectors x,,..., Xk in V and

w1,.. ., Wk in W and positive numbers ,u ,. . .k such that

k

A = uwi O xi.

Also, 6@(A) = span{wl,..., Wk), 6%(A) = (span(x1,..., xk))', Axi = Aiwi, = 1,..., k, A' -= iXi ?Lw, A'A = 4yxi ?rXi, AA' = Z i w,. The

numbers [2, 2are the positive eigenvalues of both AA' and A'A.

For matrices, this result takes the following form.

Proposition 1.47. If A is a real n x m matrix of rank k, then there exist

matrices F: n x k, D: k x k, and I: k x m that satisfy FrT = Ik' 5' =

Ik, D is a diagonal matrix with positive diagonal elements, and

A = rDA.

Proof Take V = RI, W = Rn and apply Theorem 1.3 to get

k

A = Miwix'

where x,,..., Xk are orthonormal in R", wm,..., Wk are orthonormal in Rn,

and 0i > O, i = 1,..., k. Let F have columns w , Wk let I have rows

xl,..., X, and let D be diagonal with diagonal elements IL,. * ., [Lk An easy

calculation shows that k

fjwixit = IFDP.

In the case that A e E(V, V) with rank k, Theorem 1.3 shows that there

exist orthonormal sets (x,,..., Xk) and (w,,..., Wk) of V such that

k

A = iwiL xi

where ,i > 0, i = 1,..., k. Also, 6Ai(A) = span{wl,..., Wk} and 6X(A)=



PROPOSITION 1.48 59

(span(xI,..., xk})k'. Now, consider two subspaces M1 and M2 of the inner product space (V, (,*)) and let P1 and P2 be the orthogonal projections onto M, and M2. In what follows, the geometrical relationship between the two subspaces (measured in terms of angles, which are defined below) is related to the singular value decomposition of the linear transformation P1P2 e f&(V, V). It is clear that 6it(P2PI) c M2 and OL(P2PI) D MI . Let

k = rank(P2Pi) so k < dim(Mi), i = 1, 2. Theorem 1.3 implies that

k

P2PI = EYtiWiJOXi I

where 1ui > 0, i = 1,..., k, 6Y(P2P1) = span{wl,..., wk) c M2, and

('X(p2PM) = span{x1,..., xk) C M,. Also, (wl,..., wk) and {x1,..., xk) are orthonormal sets. Since P2P1x1 = - and (P2PI)'P2PI

= P1P2P2PI =

P P2P1 = k2Xi 0 x, we have

I.j (xi, w1) (Xi, P2P1XJ) = (PIXi, P2PIXJ) = (Xi, PIP2PIXj)

- =Xi, (?tZX?XIXj - (X.,X1) X i2.

Therefore, for i, j 1,..., k,

(Xi, wj) =)ijyj

since 0Lj > O. Furthermore, if x E Ml n (span(x,,..., Xk))' and w E M2, then (x, w) = (P1x, P2w) = (P2P1x, w) = 0 since P2P1x - 0. Similarly, if

w E M2 n (span(w,,..., wk))L and x E M,, then (x, w) 0. The above discussion yields the following proposition.

Proposition 1.48. Suppose Ml and M2 are subspaces of (V,(., *)) and let P1 and P2 be the orthogonal projections onto M, and M2. If k =

rank(P2PI), then there exist orthonormal sets (xl,..., Xk) c Ml, (wi,..., wk) c M2 and

positive numbers u I >* > I1k such that:

(i) P2P1 = E liwi xi.

(ii) P P2pi = ktL2Xi 0Xi.

(iii) P2 P1P2 = Ek12 Wi C wi.

(iv) O < uj < I and (xi, wj) =

8ijj for i, j =1,..., k.

(v) If x E M, n (span{x1,..., xk))' and w e M2, then (x, w) = 0. If w e M2 rn (span(wl,..., wk))' and x e Ml, then (x, w) = 0.

Proof. Assertions (i), (ii), (iii), and (v) have been verified as has the relationship (xi, wj) = 6ij,uj. Since 0 < pj = (xj, wj), the Cauchy-Schwarz Inequality yields (xj, wj) < 11xj11 ILwjII = 1. oiE

In Proposition 1.48, if k = rank P2P1 = 0, then M1 and M2 are orthogo

nal to each other and P1P2 = P2P1 = 0. The next result provides the

framework in which to relate the numbers i,u > ... > ,Uk to angles.

Proposition 1.49. In the notation of Proposition 1.48, let MH1 = M1, M21 = M2,

Ml= (span(x1,.., x-1 ) x n M1,

and

M2i= (span(w1,.., wi-1)) nrM2

for i = 2,..., k + 1. Also, for i = 1,..., k, let

Dii= (xix E Mi,Ixll = 1)

and

D2i = {wlw E M2i,11W11 = 1)

Then

sup sup (x,w) = (Xi,Wi) = y

xeD1, wGD2,

for i = I,..., k. Also, MI(k+ 1) 1 M2 and M2(k+ ) ? M.

Proof. Since xi E D1i and wi E D2i, the iterated supremum is at least

(xi, wi) and (xi, wi) = ,ij by (iv) of Proposition 1.48. Thus it suffices to show

that for each x E DUi and w E D2i, we have the inequality (x, w) < ,ui. However, for x E D1i and w E D20

(x, w) = (PIx, P2W) = (P2P1X, W) < 1IP2PixII IIWII = 1IP2P1XII

since llwll = 1 as w E Dii. Thus

(x, w) 1 IIP2PlxII = (P2P1x, P2P1x)1'2 - (P1P2P1x, X)1/2

1/2 k1/2

= E =j2((XjE Xj)X, X) [ 2 1t2(Xj, X)2

SincejxD x,=l 0orj 1 , 1.Alo,thmj=1 t

Since x EE Dli, (x, Xj) = for j = ,.,i-1. Also, the numbers aj

PROPOSITION 1.49 61

(x, x)2 satisfy 0 < aj < 1 and Y4aJ < 1 as llxll = 1. Therefore,

k 1/2 k 1/2

(x, w) A r 2I}(Xj, x) = [atL2aj.2 (g) =

The last inequality follows from the fact that A, > ,Ak > 0 and the

conditions on the aj's. Hence,

sup sup (x, w) = (xi, wi) =-Li xEDli wED2i

and the first assertion holds. The second assertion is simply a restatement of

(v) of Proposition 1.48. 0

Definition 1.28. Let M1 and M2 be subspaces of (V,(-,)). Given the numbers ,u >1 * * > Ak > 0, whose existence is guaranteed by Proposition

1.48, define 0i E [0, r/2) by

COS@-- i =1, ...,~k.

Let t = min(dim M1,dim M2) and set 0i = 7T/2 for i = k + 1,. . ., t. The

numbers 0, < 02 < ***< Ot are called the ordered angles between Ml and

M2.

The following discussion is intended to provide motivation, explanation, and a geometric interpretation of the above definition. Recall that if y, and Y2 are two vectors in (V, (., -)) of length 1, then the cosine of the angle between y, and Y2 is defined by cos 0 = (y1, Y2) where 0 < 0 < so. However, if we want to define the angle between the two lines span(y,} and span(y2}, then a choice must be made between two angles that are complements of

each other. The convention adopted here is to choose the angle in [0, '7/21. Thus the cosine of the angle between span{y,) and span{y2} is just l(y,, Y2)1. To show this agrees with the definition above, we have Mi = span{yi) and

Pi = yi E yi is the orthogonal projection onto Mi, i = 1, 2. The rank of P2PI is either zero or one and the rank is zero iff yi 1L y2. If Yi ? y2, then the

angle between Ml and M2 is fr/2, which agrees with Definition 1.28. When the rank of P2P1 is one, P1P2Pl = ( Y2)2yl0yl whose only nonzero eigenvalue is (Yi, y2)2. Thus A2 = (Y1, y2)2 So s = i(yl, Y2)l = cos 0,, and again we have agreement with Definition 1.28.

Now consider the case when M, = span{y,), IlyIY, = 1, and M2 is an

arbitrary subspace of (V,(-, -)). Geometrically, it is clear that the angle


between M1 and M2 is just the angle between M1 and the orthogonal projection of M1 onto M2, say M2* = span{P2y,) where P2 is the orthogonal projection onto M2. Thus the cosine of the angle between M1 and M2 is

Cos - = (Y1, IP2Y,Il =1P2Y111

If P2y, = 0, then Ml I M2 and cos 6 = 0 so 0 = s/2 in agreement with

Definition 1.28. When P2y, * 0, then P1P2P1 = (y1,P2y1)yl EJly whose only nonzero eigenvalue is (y1, P2y1) = (P2y, P2y1) = =1 Therefore, p, = iJP2y1J and again we have agreement with Definition 1.28.

In the general case when dim( Mi) > 1 for i = 1, 2, it is not entirely clear

how we should define the angles between M, and M2. However, the

following considerations should provide some justification for Definition 1.28. First, if x e M, and w E M2, llxli = liwli = 1. The cosine of the angle between span x and span w is i(x, w)i. Thus the largest cosine of any angle (equivalently, the smallest angle in [0, r/2]) between a one-dimensional subspace of Ml and a one-dimensional subspace of M2 is

sup sup i(x, w)i = sup sup (x, w). xEDll wED21 xeDll wED21

The sets D11 and D2, are defined in Proposition 1.49. By Proposition 1.49, this iterated supremum is ,u, and is achieved for x = x, E DI1 and w = w, E D21. Thus the cosine of the angle between span{x1) and span{w,} is l,.

Now, remove spanx l) from M1 to get Ml 2 = (span(x 1)) I

Ml and remove

span(w1} from M2 to get M22 = (span{w,))' nAM2. The second largest cosine of any angle between Ml and M2 is defined to be the largest cosine of any angle between Ml2 and M22 and is given by

Sup sup (X,W) = (X2, W2) = 2 xeD12 WED22

Next span(x2) is removed from M,2 and span{w2) is removed from M22, yielding M,3 and M23. The third largest cosine of any angle between Ml and

M2 is defined to be the largest cosine of any angle between MA13 and M23, and so on. After k steps, we are left with MlI(k?+ ) and M2(k+ 1)I which are

orthogonal to each other. Thus the remaining angles are '7/2. The above is

precisely the content of Definition 1.28, given the results of Propositions 1.48 and 1.49.

The statistical interpretation of the angles between subspaces is given in a later chapter. In a statistical context, the cosines of these angles are called



PROBLEMS 63

canonical correlation coefficients and are a measure of the affine depen dence between the random vectors.

PROBLEMS

All vector spaces are finite dimensional unless specified otherwise.

1. Let V"+ l be the set of all nth degree polynomials (in the real variable

t) with real coefficients. With the usual definition of addition and scalar multiplication, prove that VJ'+ is an (n + 1)-dimensional real

vector space.

2. For A e E(V, W), suppose that M is any subspace of V such that

M E %(A) = V.

(i) Show that 6'(A) = A(M) where A(M) = (wlw = Ax for some

x E M).

(ii) If x1,..., Xk is any linearly independent set in V such that span(x1,..., xk) n 9L(A) = (0), prove that Ax1,..., Axk is lin early independent.

3. For A E $(V, W), fix w0 E W and consider the linear equation Ax =

wo. If w0 e 1P(A), there is no solution to this equation. If w0 E @(A), let x0 be any solution so Axo = w0. Prove that 6L(A) + x0 is the set of

all solutions to Ax = w0.

4. For the direct sum space V, ED V2, suppose Aij E f&(Vj, VJ) and let

T (All A,2

tA21 A221

be defined by

(A A )(vI, v2J =

(A1v + A12v2, A21vI + A22V2)

for{v1, v2} E V1 ED V2.

(i) Prove that T, is a linear transformation. (ii) Conversely, prove that every T1 E E (V1 ED V2, V1 E V2) has such

a representation. (iii) If

(Al1 A12 T kA21 A22




and

(B1l B12 U= ~B2 B2J

prove that the representation of TU is

AH1B11 + A12B21 AH1B12 + A12B22

( A21B11 + A22B21 A21B2+ A22B22+

5. Let xl,..., Xr9 Xr+I be vectors in V with x1,..., xr being linearly

independent. For w,,..., wr, wr+ I in W, give a necessary and sufficient

condition for the existence of an A E E(V, W) that satisfies Axi = Wi, i = 1,..., r+ 1.

6. Suppose A E f&(V, V) satisfies A2 = cA where c * 0. Find a constant

k so that B = kA is a projection.

7. Suppose A is an m x n matrix with columns a,,..., an and B is an

n x k matrix with rows b',..., bn. Show that AB = aib

8. Let xl,..., Xk be vectors in R , set M = span(x1,..., Xk}, and let A be

the n x k matrix with columns x,. . ., Xk so A E t(Rk, Rn).

(i) Show M = 61 (A).

(ii) Show dim(M) = rank(A'A).

9. For linearly independent x1, . . ., Xk in (V,(., 9 )), let Y1, ., Yk be the vectors obtained by applying the Gram-Schmidt (G-S) Process to

XI,--.g x ~~~~~~~~~~~~~~~~~Shwthtif =A i= x1,. .., Xk. Show that if z. ., k, where A E 6(V), then

the vectors obtained by the G-S Process from z1,- -, Zk are Ay,,..., Ayk. (In other words, the G-S Process commutes with orthogonal

transformations.)

10. In (V,(*, *)), let x1,..., xk be vectors with xl * 0. Form yi, . ., y by = x1/11x111 andy/ = xi - (xi, y')yq, i = 2,..., k:

(i) Show span(x1,..., Xr} = span{y ,1 . y.,y,) for r = 1, 2,..., k.

(ii) Show y1 I span(yi,...,y,} so span(y',. ..,y1) = span(y') ED span{y2,...,y})forr=2,...,k.

(iii) Now, form y2,. . , y 2 from y2,... , as the y"s were formed

from the x 's (reordering if necessary to achieve y2 * 0). Show

span{x1,..., Xk} = span{y }) ED span{y2) @ span{y2. ,. . .,

(iv) Let m = dim(span{x1,..., Xk)). Show that after applying the

above procedure m times, we get an orthonormal basis I1

y22 y, Yfor span(x , . ,xk)



PROBLEMS 65

(v) If xi,..., Xk are linearly independent, show that span{xx,..., Xr} - span{yty, ..,y} y) forr= ,.. .,k.

11. Let xl,..., xm be a basis for (V,(., *)) and w,,..., w% be a basis for

(W,[, *]). For A, B E E(V,W), show that [Axi,wj] = [Bxi,wj] for

i = 1, .. ., m and j=1,. . ., n implies that A = B.

12. For xi E (V, ( , )) and yi E (W, [ ., -]), i = 1, 2, suppose that x1El y1 = x20E Y2 * 0. Prove that xl = cx2 for some scalar c * 0 and then

YC = c 'Y2.

13. Given two inner products on V, say (,*) and [-, *], show that there exist positive constants cl and c2 such that cJ[x, x] < (x, x) < c2[x, x],

x e V. Using this, show that for any open ball in (V,(, .)), say

B = (xj(x, x)'/2 < a), there exist open balls in (V,[, ]), say Bi =

{xj[x, x]'/2 < pi), i = 1,2, such that B, c B c B2.

14. In (V,(-, *)), prove that lix + yll < llxil + IlyIl. Using this, prove that

h(x) = llxll is a convex function.

15. For positive integers I and J, consider the IJ-dimensional real vector space, V, of all real-valued functions defined on (1, 2,.. ., I) x

(1, 2,..., J) . Denote the value of y E V at (i, j) by yij. The inner

product on V is taken to be (y, 9) = EEyijiij. The symbol 1 c V

denotes the vector all of whose coordinates are one.

(i) DefineA on Vto VbyAy =y.l wherey..= (IJ)f-'Yyij. Show that A is the orthogonal projection onto span{l).

(ii) Define linear transformations BI, B2, and B3 on V by

(Bly)ij = Yi- Y..

(B2Y)ij = Y- ..

(B3Y)ij = Yij-YA. YJ + Y.

where

yi.= =J EYI

and

y.B = a t l

Show that B,, B2, and B3 are orthogonal projections and the

following holds:

ABk=O, k= 1,2,3

B1B2 = BIB3 = B2B3 = 0

(A + B1 + B2 + B3)y = y, y E V.

(iii) Show that

AlyiI2 = IlAyl12 + JIB,yhI2 + hIB2yhI2 + hIB3yII2.

16. For F c (9(V) and M a subspace of V, suppose that r(M) C M. Prove

that r(Ml) c M .

17. Given a subspace M of (V, (, )), show the following are equivalent:

(i) }(x, y)l < clixii for all x E M.

(ii) iIPMYIi < C. Here c is a fixed positive constant and PM is the orthogonal projection onto Al.

18. In (V, (-, *)), suppose A and B are positive semidefinite. For C, D c

(9(V, V) prove that (tr ACBD')2 < tr ACBC'tr ADBD'.

19. Show that XF is a 2n-dimensional real vector space.

20. Let A be an n X n real matrix. Prove:

(i) If A0 is a real eigenvalue of A, then there exists a corresponding

real eigenvector. (ii) If A0 is an eigenvalue that is not real, then any corresponding

eigenvector cannot be real or pure imaginary.

21. In an n-dimensional space (V, (*, * )), suppose P is a rank r orthogonal

projection. For a, /3 E R, let A = aP + /3(I - P). Find eigenvalues,

eigenvectors, and the characteristic polynomial of A. Show that A is

positive definite iff a > 0 and , > 0. What is A1-l when it exists?

22. Suppose A and B are self-adjoint and A - B > 0. Let XA> - * * > AX,

and A, > -*. > A, be the eigenvalues of A and B. Show that Ai > Ai, i= l,...,n.

23. If S,TcE (V,V) and S>0, T>O, prove that (S,T) =0 implies

T = 0.

24. ForA e (I (V,V),( ,.)), show that (A, I) = trA.

PROBLEMS 67

25. Suppose A and B in L (V, V) are self-adjoint and write A > B to mean

A - B > 0.

(i) If A > B, show that CAC' > CBC' for all C E f?&(V, V).

(ii) Show I > A iff all the eigenvalues of A are less than or equal to

one. (iii) Assume A > 0, B > 0, and A > B. Is A1/2 > B1/2? Is A2 > B2?

26. If P is an orthogonal projection, show that tr P is the rank of P.

27. Let xl,..., xn be an orthonormal basis for (V, ( , )) and consider the

vector space ( (V, V), ( . , )). Let M be the subspace of i (V, V)

consisting of all self-adjoint linear transformations and let N be the

subspace of all skew symmetric linear transformations. Prove: (i) {xiO xj + xjC xi Ii < j} is an orthogonal basis for M.

(ii) (xi1 xj - xj xi li < j) is an orthogonal basis for N.

(iii) M is orthogonal to N and M ED N = f (V, V).

(iv) The orthogonal projection onto M is A -* (A + A')/2, A E

P(V, V).

28. Consider en, n with the usual inner product (A, B) = trAB', and let Sn be the subspace of symmetric matrices. Then ((n , *)) is an inner

product space. Show dim Sn = n(n + 1)/2 and for S, T E S,n, (S, T) = Lisiitii +

2EEi<jsijtij.

29. For A E fS(V, W), one definition of the norm of A is

IIhAIl = sup IlAvll 11v11= I

where is the given norm on W. (i) Show that 1ilAIII is the square root of the largest eigenvalue of A'A.

(ii) Show that JllaAlll = lallIlAlJl, a E R and IIIA + BIll < IIAIII + IIJBIlI. 30. In the inner product spaces (V, (., *)) and (W, [., *]), consider A E

f-(V, V) and B E f (W, W), which are both self-adjoint. Write these in

spectral form as

m A = ixi x xi

n B T y O, we ci w,.

(Note: The symbol El has a different meaning in these two equations


since the definition of El depends on the inner product.) Of course, X1,..., xm,[wl.W. ., w%] is an orthonormal basis for (V, (, ')) (W, [, 1)1. Also, (xi o wjli = 1,..., m, j = 1,.. ., n) is an orthonormal basis for

(f&(W, V), ( , )), and A ? B is a linear transformation on i&(W, V)

to E(W, V).

(i) Show that (A X B)(xiOwj) = XifjL(xi0wj) so Ai[; is an eigen

value of A ? B.

(ii) Show that A ? B =E ij(xi lwj)O(xiOwj) and this is a spectral decomposition for A X B. What are the eigenvalues and corresponding eigenvectors for A X B?

(iii) If A and B are positive definite (semidefinite), show that A ? B is positive definite (semidefinite).

(iv) Show that tr A ? B = (tr A)(tr B) and det A ? B =

(det A) (det B)

31. Let x,,..., xp be linearly independent vectors in Rn, set M = span{x ,..., xp), and let A: n x p have columns xl,..., xp. Thus

(t (A) = M and A'A is positive definite.

(i) Show that 4 = A(A'A)- 1/2 is a linear isometry whose columns

form an orthonormal basis for M. Here, (A'A)- 1/2 denotes the

inverse of the positive definite square root of A'A.

(ii) Show that A+4 = A(A'A)-'A' is the orthogonal projection on to M.

32. Consider two subspaces, M1 and M2, of Rn with bases xl,.., xq and

Y .,Yr, Let A(B) have columns x, * ,Xq (y,,. *Yr). Then P1 = A(A'A) 'A' and P2 = B(B'B)- 'B' are the orthogonal projections onto

M, and M2, respectively. The cosines of the angles between M, and M2

can be obtained by computing the nonzero eigenvalues of P1P2P1.

Show that these are the same as the nonzero eigenvalues of

(A'A) 'A'B(B'B) 'B'A: q X q

and of

(B'B) -'B'A (A'A) - A'B: r x r.

33. In R4, set xl = (1,0,0,0), x2 = (0, 1,0,0), y' = (1, 1, 1, 1), and y2 =

(1, -1, 1, -1). Find the cosines of the angles between M, =

span(xI, x2) and M2 = span{y,, Y2).

34. For two subspaces M1 and M2 of (V, (-,)), argue that the angles

between M, and M2 are the same as the angles between F(M,) and

F(M2) for any I' e ((V).



NOTES AND REFERENCES 69

35. This problem has to do with the vector space V of Example 1.9 and V

may be infinite dimensional. The results in this problem are not used in the sequel. Write Xl 1 X2 if XI = X2a.e. (PO) for XI and X2 in V. It

is easy to verify = is an equivalence relation on V. Let M = (XI X e

V, X = 0 a.e. (PO)) so Xl = X2 iff XI - X2 E M. Let L2 be the set of

equivalence classes in V. (i) Show that L2 is a real vector space with the obvious definition of

addition and scalar multiplication. Define (., *) on L2 by (yi, Y2) = &X1X2 where Xi is an element of the

equivalence class yi, i = 1, 2.

(ii) Show that ( is well defined and is an inner product on L2. Now, let go be a sub a-algebra of 6W. For y E L2, let Py denote the

conditional expectation given i0 of any element in y. (iii) Show that P is well defined and is a linear transformation on L2

to L2. Let N be the set of equivalence classes of all qo measurable functions in V. Clearly, N is a subspace of L2.

(iv) Show that p2 = p, P is the identity on N, and 6(P) = N. Also

show that P is self-adjoint-that is (yl, Py2) = (Py1, Y2).

Would you say that P is the orthogonal projection onto N?

NOTES AND REFERENCES

1. The first half of this chapter follows Halmos (1968) very closely. After

this, the material was selected primarily for its use in later chapters. The

material on outer products and Kronecker products follows the author's tastes more than anything else.

2. The detailed discussion of angles between subspaces resulted from unsuccessful attempts to find a source that meshed with the treatment of canonical correlations given in Chapter 10. A different development can be found in Dempster (1969, Chapter 5).

3. Besides Halmos (1958) and Hoffman and Kunze (1971), I have found the book by Noble and Daniel (1977) useful for standard material on

linear algebra.

4. Rao (1973, Chapter 1) gives many useful linear algebra facts not

discussed here.



CHAPTER 2

Random Vectors

The basic object of study in this book is the random vector and its induced distribution in an inner product space. Here, utilizing the results outlined in Chapter 1, we introduce random vectors, mean -vectors, and covariances. Characteristic functions are discussed and used to give the well known factorization criterion for the independence of random vectors. Two special classes of distributions, the orthogonally invariant distributions and the weakly spherical distributions, are used for illustrative purposes. The vector spaces that occur in this chapter are all finite dimensional.

2.1. RANDOM VECTORS

Before a random vector can be defined, it is necessary to first introduce the

Borel sets of a finite dimensional inner product space (V, (, -)). Setting

xII = (x, x)/'2, the open ball of radius r about x0 is the set defined by Sr(x0) {xlllx - xoll < r).

Definition 2.1. The Borel a-algebra of (V,(, )), denoted by 6?5(V), is the smallest a-algebra that contains all of the open balls.

Since any two inner products on V are related by a positive definite

linear transformation, it follows that @5 (V) does not depend on the inner

product on V- that is, if we start with two inner products on V and use

these inner products to generate a Borel a-algebra, the two a-algebras are

the same. Thus we simply call ff (V) the Borel a-algebra of V without

mentioning the inner product.

70

PROPOSITION 2.1 71

A probability space is a triple (S2, C, PO) where i2 is a set, F is a a-algebra

of subsets of S1, and PO is a probability measure defined on IF.

Definition 2.2. A random vector X e V is a function mapping 2 into V

such that X-'(B) E F for each Borel set B E 6@I(V). Here, X-'(B) is the

inverse image of the set B.

Since the space on which a random vector is defined is usually not of

interest here, the argument of a random vector X is ordinarily suppressed.

Further, it is the induced distribution of X on V that most interests us. To

define this, consider a random vector X defined on Q to V where (Q, 6C, PO) is a probability space. For each Borel set B E @J6(V), let Q(B) =

Po( X- (B)). Clearly, Q is a probability measure on @ (V) and Q is called

the induced distribution of X-that is, Q is induced by X and PO. The

following result shows that any probability measure Q on f1(V) is the induced distribution of some random vector.

Proposition 2.1. Let Q be a probability measure on @(V) where V is a

finite dimensional inner product space. Then there exists a probability space (Q, IF, PO) and a random vector X on Q to V such that Q is the induced

distribution of X.

Proof. Take i2 = V, 1Y= 03(V), PO = Q, and let X(w) = X for w e V.

Clearly, the induced distribution of X is Q. n

Henceforth, we write things like: "Let X be a random vector in V with

distribution Q," to mean that X is a random vector and its induced

distribution is Q. Alternatively, the notation E(X) = Q is also used-this is read: "The distributional law of X is Q."

A function f defined on V to W is called Borel measurable if the inverse image of each set B E 6A) (W) is in @3 (V). Of course, if X is a random vector in V, then f( X) is a random vector in W when f is Borel measurable. In

particular, when f is continuous, f is Borel measurable. If W = R and f is Borel measurable on V to R, then f( X) is a real-valued random variable.

Definition 2.3. Suppose X is a random vector in V with distribution Q and f is a real-valued Borel measurable function defined on V. If JvIf(x)lQ(dx) < + x, then we say that f(X) has finite expectation and we write &f (X) for Jvf(x)Q(dx).

In the above definition and throughout this book, all integrals are Lebesgue integrals, and all functions are assumed Borel measurable.

72 RANDOM VECTORS

* Example 2.1. Take V to be the coordinate space RI with the usual

inner product (-, ) and let dx denote standard Lebesgue measure on R'. If q is a non-negative function on Rn such that Jq(x) dx = 1,

then q is called a density function. It is clear that the measure Q given by Q(B) = JBq(x) dx is a probability measure on Rn so Q is

the distribution of some random vector X on Rn. If E 1,...,rE is the standard basis for R', then (Ej, X) Xi is the ith coordinate of X.

Assume that Xi has a finite expectation for i = 1,..., n. Then

6;Xi = fR-(Ei, x)q(x) dx -1i is called the mean value of Xi and the

vector i& Ee Rn with coordinates I,u, l n is the mean vector of X. Notice that for any vector x E Rn, &(x, X) = &;(Ex iE, X)= Exi&(ei, X) = Exiui = (x, A). Thus the mean vector L satisfies the

equation S(x, X) = (x, ,u) for all x E Rn and ,u is clearly unique. It

is exactly this property of .t that we use to define the mean vector of

a random vector in an arbitrary inner product space V.

Suppose X is a random vector in an inner product space (V, (, -)) and

assume that for each x E V, the random variable (x, X) has a finite

expectation. Let f(x) = &(x, X), so f is a real-valued function defined on V. Also, f(a1x1 + a2x2) = 6(a1x1 + a2x2, X) = &[a,(x,, X) +

a2(x2, X)] = a,&(x,, X) + a2&(x2, X) = alf(x1) + a2f(x2). Thus f is a

linear function on V. Therefore, there exists a unique vector y E V such that

f(x) = (x, ,i) for all x E V. Summarizing, there exists a unique vector

e E V that satisfies &(x, X) = (x, ,u) for all x E V. The vector ,u is called

the mean vector of X and is denoted by & X. This notation leads to the

suggestive equation S(x, X) = (x, ;X), which we know is valid in the

coordinate case.

Proposition 2.2. Suppose X E (V, (., )) and assume X has a mean vector

1. Let (W, [*, -]) be an inner product space and consider A e E (V, W) and

wo E W. Then the random vector Y = AX + wo has the mean vector

A,i + wo-that is, SY = A&X + wo.

Proof. The proof is a computation. For w EW,

S[w, Y] = &[w, AX+ wo] = S[w, AX] + [w, wo]

= &(A'w, X) + [w, wo] = (A'w, M) + [w, wo]

= [w, A,u] + [w,wo] = [w, A,s + wo].

Thus Ali + wo satisfies the defining equation for the mean vector of Y and

by the uniqueness of mean vectors, EY = A,u + wo. a



PROPOSITION 2.3 73

If X1 and X2 are both random vectors in (V, (,)), which have mean vectors, then it is easy to show that &(X1 + X2) = 6XI + &X2. The follow

ing proposition shows that the mean vector,u of a random vector does not

depend on the inner product on V.

Proposition 2.3. If X is a random vector in (V, (, *)) with mean vector i satisfying &(x, X) = (x, ,) for all x E V, then ,u satisfies &f(x, X) =

f(x, ,u) for every bilinear function f on V x V.

Proof. Every bilinear function f is given byf(x1, x2) = (xl, Ax2) for some A E I_(V, V). Thus &f(x, X) = &(x, AX) = (x, A,t) = f(x, i) where the

second equality follows from Proposition 2.2. El

When the bilinear function f is an inner product on V, the above result

establishes that the mean vector is inner product free. At times, a convenient

choice of an inner product can simplify the calculation of a mean vector.

The definition and basic properties of the covariance between two real-valued random variables were covered in Example 1.9. Before defining the covariance of a random vector, a review of covariance matrices for

coordinate random vectors in Rn is in order.

* Example 2.2. In the notation of Example 2.1, consider a random vector X in Rn with coordinates Xi = (Ei, X) where e1,..., En is the standard basis for Rn and (-,) is the standard inner product.

Assume that &X2 < + o, i-1,..., n. Then cov(Xi> Xj) =a ij ex

ists for all i, j = 1,..., n. Let 2 be the n X n matrix with elements

oij. Of course, aii is the variance of X, and aij is the covariance

between Xi and Xj. The symmetric matrix 2 is called the covariance matrix of X. Consider vectors x, y E RI with coordinates xi and yi, i=l,..., n. Then

cov{(x, X), (y, X)) = cov( .ixi, Ey2X9}

= E Exiyjcov(Xi, xj) =

E ExiY i j i j

= (x,2 y).

Hence cov((x, X), (y, X)) = (x, 2y). It is this property of 2 that is used to define the covariance of a random vector.

With the above example in mind, consider a random vector X in an inner product space (V, (*, *)) and assume that &(x, X)2 < oo for all x E V. Thus

74 RANDOM VECTORS

(x, X) has a finite variance and the covariance between (x, X) and (y, X)

is well defined for each x, y E V.

Proposition 2.4. For x, y E V, define f(x, y) by

f(x, y) = cov((x, X), (y, X)}.

Then f is a non-negative definite bilinear function on V x V.

Proof. Clearly, f(x, y) = f(y, x) and f(x, x) = var((x, X)} > 0, so it re

mains to show that f is bilinear. Since f is symmetric, it suffices to verify that

f(a IxI + a2x2, Y) = aIf(xI, y) + a2f(x2, y). This verification goes as fol

lows:

f (a1xl + a2x2, y) = cov{( a1x + a2X2, X), (y, X))

= cov{a1(x1, X) + a2(x2, X), (y, X))

= a, cov{(x1, X), (y, X)) + a2cov{(x2, X), (y, X))

= a,f(xI, y) + a2f(x2, y).

By Proposition 1.26, there exists a unique non-negative definite linear transformation I such that f(x, y) = (x, Yy).

Definition 2.4. The unique non-negative definite linear transformation : on V to V that satisfies

cov((x, X), (y, X)) = (x, Yy)

is called the covariance of X and is denoted by Cov(X).

Implicit in the above definition is the assumption that &(x, X)2 < + x for all x E V. Whenever we discuss covariances of random vectors, S(x, X)2 is always assumed finite.

It should be emphasized that the covariance of a random vector in

(V,(-, .)) depends on the given inner product. The next result shows how

the covariance changes as a function of the inner product.

Proposition 2.5. Consider a random vector X in (V,(., -)) and suppose

Cov(X) = L:. Let [-, -I be another inner product on V given by [x, y] =

(x, Ay) where A is positive definite on (V, (., *)). Then the covariance of X

in the inner product space (V,[-, *]) is EA.

PROPOSITION 2.6 75

Proof. To verify that 2A is the covariance for X in (V,[, ]), we must

show that cov([x, X], [y, X]} = [x, lAy] for all x, y E V. To do this, use

the definition of [ *, * I and compute:

cov([x, X], [y, X]) = cov((x, AX), (y, AX)) = cov((Ax, X), (Ay, X)}

= (Ax, lAy) = (x, A2Ay) = [x, 2Ay]. O

Two immediate consequences of Proposition 2.5 are: (i) if Cov(X) exists in one inner product, then it exists in all inner products, and (ii) if Cov(X) = 2

in (V,(., *)) and if l is positive definite, then the covariance of X in the

inner product [x, y] (x, 2- 'y) is the identity linear transformation. The result below often simplifies a computation involving the derivation of a covariance.

Proposition 2.6. Suppose Cov(X) = 2 in (V,(-, *)). If 2, is a self-adjoint

linear transformation on (V, (*, *)) to (V, (, *)) that satisfies

(2.1) var((x, X)) = (x, >21x) for x e V,

then Y., = E.

Proof. Equation (2.1) implies that (x, 2 ,x) = (x, Ex), x e V. Since 2:

and Y. are self-adjoint, Proposition 1.16 yields the conclusion Y. = 2:. U

When Cov( X) = 2 is singular, then the random vector X takes values in

the translate of a subspace of (V, (*, )). To make this precise, let us consider the following.

Proposition 2.7. Let X be a random vector in (V,(-, -)) and suppose

Cov(X) = 2 exists. With ,i = &X and 'R(2) denoting the range of 2Y,

P(X E6 @(Yz) + )=1

Proof The set 6 (Y.) + ,u is the set of vectors of the form x + ,u for x E 6it(2); that is 6lt(2) + ,u is the translate, by p,, of the subspace 6t(2). The statement P{X E 6 (1) + ,u) = 1 is equivalent to the statement P(X -

IL E 6(2)) = 1. The random vector Y = X - ,u has mean zero and, by

Proposition 2.6, Cov(Y) = Cov(X) = 2 since var((x, X - ,u)) = var((x, X)) for x e V. Thus it must be shown that P(Y E 6A (2)) = 1. If 2 is nonsingu

lar, then 6I (2) = V and there is nothing to show. Thus assume that the null

space of 2, L (2), has dimension k > 0 and let (x1,..., Xk) be an orthonor

mal basis for %9(2). Since 6A (2) and O(Y) are perpendicular and 6i(2) D



76 RANDOM VECTORS

i(2:)= V, a vector x is not in 64(Y) iff for some index i = 1,..., k,

(xi, x)* 0. Thus

P{Y i t(2:)) = P((xi, Y) * 0 for some i = 1,. ..,k)

k

< EP((Xi, Y) - 0).

But (xi, Y) has mean zero and var{(x1, Y)) = (xi, Yxi) = 0 since xi E

9L(E). Thus (xi, Y) is zero with probability one, so P((xi, Y) * 0} = 0.

Therefore P(Y ? 6fR(2)) = 0. o

Proposition 2.2 describes how the mean vector changes under linear transformations. The next result shows what happens to the covariance

under linear transformations.

Proposition 2.8. Suppose X is a random vector in (V,(-, )) with Cov(X) = E. If A E ?(V, W) where (W, [ *, * ]) is an inner product space, then

Cov(AX + w0) = AYA'

for all w0 E W.

Proof. By Proposition 2.6, it suffices to show that for each w E W,

var[w, AX + w0] = [w, AYA'w]. However,

var[w, AX + w0] = var([w, AX] + [w, wo]) = var[w, AX]

= var(A'w, X) = (A'w, X A'w) = [w, AX:A'w].

Thus Cov(AX + w0) = AY2A'. o

2.2. INDEPENDENCE OF RANDOM VECTORS

With the basic properties of mean vectors and covariances established, the

next topic of discussion is characteristic functions and independence of

random vectors. Let X be a random vector in (V, (-, * )) with distribution Q.

Definition 2.5. The complex valued function on V defined by

+(v) Sei(vX) ei(vX)Q(dx)

is the characteristic function of X.

INDEPENDENCE OF RANDOM VECTORS 77

In the above definition, e1t = cos t + i sin t where i = - 1 and t e R.

Since eit is a bounded continuous function of t, characteristic functions are well defined for all distributions Q on (V, (*, *)). Forthcoming applications of characteristic functions include the derivation of distributions of certain functions of random vectors and a characterization of the independence of two or more random vectors.

One basic property of characteristic functions is their uniqueness, that is, if Q1 and Q2 are probability distributions on (V, (-, *)) with characteristic functions 40 and +2, and if I(x) = +2(x) for all x E V, then Q1 = Q2. A

proof of this is based on the multidimensional Fourier inversion formula, which can be found in Cramer (1946). A consequence of this uniqueness is that, if XI and X2 are random vectors in (V, (-, .)) such that f, ((x, XI)) =

f_((x, X2)) for all x E V, then E(Xl) = Q?4X2). This follows by observing

that l ((x, X,)) = I ((x, X2)) for all x implies the characteristic functions of

X, and X2 are the same and hence their distributions are the same.

To define independence, consider a probability space (S2, C, P0) and let X E (V(., *)) and YE: (W, [ , ]1) be two random vectors defined on U2.

Definition 2.6. The random vectors X and Y are independent if for any Borel sets B, E 63?(V) and B2 E 6i3(W),

PO(x-'(B1) n Y-'(B2)) = P0{X-'(B))P0{Y (B2))

In order to describe what independence means in terms of the induced distributions of X E (V, ( , )) and Y e (W, [ *, * 1), it is necessary to define

what is meant by the joint induced distribution of X and Y. The natural vector space in which to have X and Y take values is the direct sum V ED W

defined in Chapter 1. For (vi, wi) e V E W, i = 1,2, define the inner

product (., ), by

((VI, WI), {V2 W I -= (vI, V2) + [wI,w2.

That (., ) is an inner product on V @ W is routine to check. Thus {X, Y} takes values in the inner product space V @ W. However, it must be shown

that (X, Y} is a Borel measurable function. Briefly, this argument goes as follows. The space V @ W is a Cartesian product space-that is, V E W

consists of all pairs {v, w) with v E V and w E W. Thus one way to get a

a-algebra on V ED W is to form the product a-algebra63 ( V) x @ (W), which

is the smallest a-algebra containing all the product Borel sets B1 x B2 c V

@ W where B1 e (13(V) and B2 e 6,B(W). It is not hard to verify that

inverse images, under (X, Y), of sets in 63(V) x @3(W) are in the a-algebra 6W. But the product a-algebra @(V) x ffi (W) is just the a-algebra fi (V @ W)

defined earlier. Thus (X, Y) E V ED W is a random vector and hence has an



78 RANDOM VECTORS

induced distribution Q defined on @(V E W). In addition, let Q1 be the induced distribution of X on 63 (V) and let Q2 be the induced distribution of Y on 6,(W). It is clear that Q1(B1) = Q(BI X W) for B1 e 6Q(V) and

Q2(B2) = Q(V x B2) for B2 E f'i(W). Also, the characteristic function of (X,Y)E V EWis

p({v, w)) = &;exp[i({v, w), {X, Y))1] = &;exp(i(v, X) + i4w, Y])

and the marginal characteristic functions of X and Y are

p1 (v) = &ei(v,X)

and

p2(w) = &ei[w,Y]

Proposition 2.9. Given random vectors X E (V, (*, *)) and Y E (W, [*, * the following are equivalent:

(i) X and Y are independent. (ii) Q(B1 x B2) = Q1(B1)Q2(B2) for all B1 E 6(V) and B 62 e

(iii) 4((v, w)) = pI(v)02(w) for all v E V and w e W.

Proof. By definition,

Q(B1 x B2) =

Po((X, Y) E B1 x B2) = PO{X e B1, Y E B2).

The equivalence of (i) and (ii) follows immediately from the above equation. To show (ii) implies (iii), first note that, if f, and f2 are integrable complex

valued functions on V and W, then when (ii) holds,

f fl(v)f2(w)Q(dv, dw) = fft1 (v)f2(w)QI(dv)Q2(dw) v@ wvw

= Iff(v)QI(dv) f2(w)Q2(dw)

by Fubini's Theorem (see Chung, 1968). Taking f1(v) = ei(vI, v) for vI, v e V, and f2(w) = ei[w,wl for w1, w E W, we have

p((vI, w)) = Jexp(i(vI, v) + i[wI, w])Q(dv, dw)

= f exp[i(v, v)] Ql(dv) f exp(i[WI, W])Q2(dW)



PROPOSITION 2.9 79

Thus (ii) implies (iii). For (iii) implies (ii), note that the product measure Q1 x Q2 has characteristic function 4142. The uniqueness of characteristic functions then implies that Q = Q1 x Q2. o

Of course, all of the discussion above extends to the case of more than two random vectors. For completeness, we briefly describe the situation.

Given a probability space (s2, C5, PO) and random vectors Xj E (Vj, (., *)),

j = ,..., k, let Qj be the induced distribution of Xj and let 0j be the

characteristic function of Xj. The random vectors XI,..., Xk are independent if for all Bj e9 3(Vj),

k

Po(Xi E Bi,j = 1,..., k) = Po{XjE B). =1=

To construct one random vector from X1,..., Xk, consider the direct sum

VI E ... E Vk with the inner product ( k, )= k(., %)j. In other words, if

{v1,..., vk) and (w1,..., Wk) are elements of VI ED ... E Vk, then the inner

product between these vectors is j4(v1, wj)j. An argument analogous to that given earlier shows that {XI,..., Xk} is a random vector in V, eD ... E Vk

and the Borel a-algebra of V E ... E@ Vk is just the product u-algebra

63 ( VI ) x ... x- 63 ( Vk ). If Q denotes the induced distribution of { XI, . . ., Xk),

then the independence of XI,..., Xk is equivalent to the assertion that

k Q(B1 x * - X Bk) = H Qj(Bj)

j=1

for all Bj c J(J(Vj), j = 1, . . ., k, and this is equivalent to

k k

g exp i (vi, XA)1] = _ (vi)

Of course, when XI,..., Xk are independent and fj is an integrable real

valued function on Vj,j = 1,..., k, then

k k

6 H fi (Xi)= H 6fj (Xi) j=l IJ=1

This equality follows from the fact that

k

Q(B1 x . .. x Bk) = H Qj(B1) J=1

and Fubini's Theorem.



80 RANDOM VECTORS

* Example 2.3. Consider the coordinate space RP with the usual inner product and let QO be a fixed distribution on RP. Suppose

Xl,..., Xn are independent with each Xi e RP, i = 1,..., n, and

E(Xi) = Q0. That is, there is a probability space (S2, C, P0), each Xi is a random vector on Q with values in RP, and for Borel sets,

n

Po(Xi E Bi, i = 1,..., n) Q HQ(B,).

Thus {XI,. . ., X,,) is a random vector in the direct sum RP E ... e*

RP with n terms in the sum. However, there are a variety of ways to

think about the above direct sum. One possibility is to form the coordinate random vector

xi

X2 y_ X E Rnp

Xn

and simply consider Y as a random vector in Rnp with the usual

inner product. A disadvantage of this representation is that the independence of XI,..., Xn becomes slightly camouflaged by the notation. An alternative is to form the random matrix

p,n

Xn

Thus X has rows X,', i = 1,..., n, which are independent and each

has distribution Q0. The inner product on l, n is just that inherited

from the standard inner products on Rn and RP. Therefore X is a

random vector in the inner product space (l , , K )). In the

sequel, we ordinarily represent X,,..., Xn by the random vector X c E . The advantages of this representation are far from clear

at this point, but the reader should be convinced by the end of this

book that such a choice is not unreasonable. The derivation of the

-mean and covariance of X E Ep n given in the next section should

provide some evidence that the above representation is useful.



PROPOSITION 2.10 81

2.3. SPECIAL COVARIANCE STRUCTURES

In this section, we derive the covariances of some special random vectors. The orthogonally invariant probability distributions on a vector space are shown to have covariances that are a constant times the identity transforma tion. In addition, the covariance of the random vector given in Example 2.3 is shown to be a Kronecker product. The final example provides an

expression for the covariance of an outer product of a random vector with

itself. Suppose (V,(, *)) is an inner product space and recall that 6(V) is the

group of orthogonal transformations on V to V.

Definition 2.7. A random vector X in (V, (*, )) with distribution Q has an orthogonally invariant distribution if lE(X) E(rx) for all r E (9(v), or equivalently if Q(B) = Q(FB) for all Borel sets B and F e (9(V).

Many properties of orthogonally invariant distributions follow from the following proposition.

Proposition 2.10. Let x0 E V with lxol1 = 1. If E(X) = f&(FX) for F E

6(V), then for x e V, (9((x, X)) = 6((IxII(x0, X)).

Proof. The assertion is that the distribution of the real-valued random variable (x, X) is the same as the distribution of lxIt(xo, X). Thus knowing the distribution of (x, X) for one particular nonzero x E V gives us the distribution of (x, X) for all x E V. If x = 0, the assertion of the proposi

tion is trivial. For x * 0, choose F e ((V) such that Fx0 = x/llxll. This is

possible since x0 and x/llxll both have norm 1. Thus

fP?((x, x)) = (ilixi( l x, x)) = (IIxII(ixo, x)) = 1,(lIxII(xo, rx))

= (9(Ilxll(xo, X)) where the last equality follows from the assumption that f, ( X) = I (rX) for

all F E (9(V) and the fact that '

e (9(V) implies F' e (9(V). o

Proposition 2.11. Let x0 E V with lixoll = 1. Suppose the distribution of X is orthogonally invariant. Then:

(i) ?>(x)=- le'(x x) = ,,0(ljxjx).

(ii) If &X exists, then &;X = 0.

(iii) If Cov(X) exists, then Cov(X) = a2I where a2 = var{(xo, X)), and I is the identity linear transformation.



82 RANDOM VECTORS

Proof Assertion (i) follows from Proposition 2.10 and

&;ei(x,X) = -;ei11x11(x0,X) = &;ei(lIxllxo,X) =

40(iixiix).

For (ii), let ,u = SX. Since 1(X) = IP(rX), it = c;x= rx = rsx= r=

for all F e (9(V). The only vector it that satisfies Iu = F,u for all IF e ((V) is y = 0. To prove (iii), we must show that a

21 satisfies the defining equation for Cov(X). But by Proposition 2.10,

var((x, X)) = var{llxll(xo, X)) = iixI2var(xo, X} = 02(x, x) = (x, u2Ix)

so Cov(X) = a2I by Proposition 2.6. C]

Assertion (i) of Proposition 2.11 shows that the characteristic function 0 of an orthogonally invariant distribution satisfies 0(lx) = +(x) for all x E V and F e ((V). Any function f defined on V and taking values in

some set is called orthogonally invariant if f(x) = f(Fx) for all r e ((V). A

characterization of orthogonal invariant functions is given by the following proposition.

Proposition 2.12. A function f defined on (V,(-, *)) is orthogonally in variant ifff(x) = f(llxllxo) where xo E V, llxoll = 1.

Proof. If f(x) = f(llxllxo), then f(rx) = f(ilFxllxo) = f(llxllxo) = f(x) so f is orthogonally invariant. Conversely, suppose f is orthogonally invariant and xo E V with llxoii = 1. For x = 0, f(0) = f(ilxllxo) since llxll 0. If

x * 0, let F e ((V) be such that Fxo = x/llxll. Then f(x) = f(Fjixjix0) =

f(ilxllxo).

If X has an orthogonally invariant distribution in (V, (-, *)) and h is a function on R to R, then

f(x) &h((x, X))

clearly satisfies f(Fx) = f(x) for rF e9(V). Thus f(x) =f(llxllxo) =

&h(iixii(x0, X)), so to calculatef(x), one only needs to calculatef(axo) for a E (0, oc). We have more to say about orthogonally invariant distributions

in later chapters. A random vector X e V(., *) is called orthogonally invariant about xo if

X - x0 has an orthogonally invariant distribution. It is not difficult to

show, using characteristic functions, that if X is orthogonally invariant

about both xo and xl, then xo = x1. Further, if X is orthogonally invariant



PROPOSITION 2.13 83

about xo and if &X exists, then l;(X - xo) = 0 by Proposition 2.11. Thus

x -= & X when &X exists.

It has been shown that if X has an orthogonally invariant distribution and if Cov(X) exists, then Cov(X) = a 2I for some a2 > 0. Of course there

are distributions other than orthogonally invariant distributions for which the covariance is a constant times the identity. Such distributions arise in

the chapter on linear models.

Definition 2.8. If X E (V, (., .)) and

Cov(X) = a2I for some a2 > 0,

X has a weakly spherical distribution.

The justification for the above definition is provided by Proposition 2.13.

Proposition 2.13. Suppose X is a random vector in (V, (-, )) and Cov(X) exists. The following are equivalent:

(i) Cov(X) = a2I for some a2> 0. (ii) Cov(X) = Cov(rX) for all r e C(V).

Proof. That (i) implies (ii) follows from Proposition 2.8. To show (ii) implies (i), let I = Cov(X). From (ii) and Proposition 2.8, the non-negative definite linear transformation I must satisfy E = F'21 for all F e C(V). Thus for all x e V, llxll = 1,

(x, ix) = (x, rFrtx) = (rFx, Er'x).

But F'x can be any vector in V with length one since F' can be any element

of 6(V). Thus for all x, y, llxII = IIYII = 1,

(XI EX) -(Y' IA)

From the spectral theorem, write I = E'Xx 0 xi and choose x = Xj and

y = Xk. Then we have

Aj =

(xj, YxX) = (Xk, YXk) Ak

for allj, k. Setting a2 = Xl,

I= na2x1Oxi = 2 x Cxi ai

That a2 > 0 follows from the positive semidefiniteness of 2. O



84 RANDOM VECTORS

Orthogonally invariant distributions are sometimes called spherical distri butions. The term weakly spherical results from weakening the assumption that the entire distribution is orthogonally invariant to the assumption that just the covariance structure is orthogonally invariant (condition (ii) of Proposition 2.13). A slight generalization of Proposition 2.13, given in its algebraic context, is needed for use later in this chapter.

Proposition 2.14. Supposef is a bilinear function on V x Vwhere (V, (., *)) is an inner product space. If f[Fx1, Fx2I = f [xI, x2] for all xl, x2 E V and F e ((V), thenf [xi, x2] = c(x, x2) where c is some real constant. If A is a linear transformation on V to V that satisfies F'AF = A for all F E 0(V), then A = cI for some real c.

Proof. Every bilinear function on V x V has the form (x,, Ax2) for some linear transformationA on Vto V. The assertion that f [rx, rx2] = f [xl, x2]

is clearly equivalent to the assertion that r'AF = A for all r e 0(V). Thus it suffices to verify the assertion concerning the linear transformation A.

Suppose F'AF = A for all r 0(V). Then for x1, x2 E V,

(xI, Ax2) = (xl, F'AFx2) = (rx,, AFx2).

By Proposition 1.20, there exists a r such that

rX = X2 r X2 =xi IIXIII lIX21x lX211 IIx III

when xl and x2 are not zero. Thus for xl and x2 not zero,

(x1, Ax2) =

(Fx,, Arx2) = (x2, Ax,)

= (Ax,, x2).

However, this relationship clearly holds if either x, or x2 is zero. Thus for all xl, x2 E V, (xI, Ax2) = (Ax,, x2), so A must be self-adjoint. Now, using the spectral theorem, we can argue as in the proof of Proposition 2.13 to

conclude that A = cI for some real number c. o

* Example 2.4. Consider coordinate space Rn with the usual inner product. Let f be a function on [0, oo) to [0, oo) so that

J (1X112) dx = 1. Tn

Thus f(IIXI12) is a density on Rn. If the coordinate random vector



PROPOSITION 2.14 85

X E Rn has f(jlxl2) as its density, then for r E on (the group of n X n orthogonal matrices), the density of rXis again f(Ix 1( 2). This follows since IIrxII = Ilxi and the Jacobian of the linear transforma tion determined by r is equal to one. Hence the distribution determined by the density is Q invariant. One particular choice forf is f(u) = (2irfn/2e -1/2u and the density for X is then

n

f(iixi12) = (2 ffn/2 exp[-2S1X2] = H (2Y12 exp[-2x7].

Each of the factors in the above product is a density on R (corresponding to a normal distribution with mean zero and vari ance one). Therefore, the coordinates of X are independent and each has the same distribution. An example of a distribution on Rn that is weakly spherical, but not spherical, is provided by the density (with respect to Lebesgue measure)

p(x) = 2 nexp[-_,xXij]

where x E Rn, x' = (x1, x2,..., xn). More generally, if the random

variables X1,..., Xn are independent with the same distribution on R, and a2 = var(X,), then the random vector X with coordinates X,,..., X,is easily shown to satisfy Cov(X) = a2In where In is the n X n identity matrix.

The next topic in this section concerns the covariance between two random vectors. Suppose Xi e (Vi, (., .)) for i = 1, 2 where Xl and X2 are

defined on the same probability space. Then the random vector (XI, X2} takes values in the direct sum VI E V2. Let [.,-] denote the usual inner

product on VI ED V2 inherited from ( -)j, i = 1,2. Assume that ii= Cov( Xi), i = 1, 2, both exist. Then, let

f(x1, x2) = cov{(xI, X1)I, (X2, X2)2}

and note that the Cauchy-Schwarz Inequality (Example 1.9) shows that

I f(xI, x2)2 . (Xl, E11X1)1(X2, 722X2)2

Further, it is routine to check that f(., * ) is a bilinear function on VI x V2

so there exists a linear transformation 2,2 E E (V2, VI) such that

f(xI, X2) =

(XI I 7 12X2)1



86 RANDOM VECTORS

The next proposition relates 21 , 2 12, and 222 to the covariance of ( XI, X2)

in the vector space (V,1E V2e [, ] ).

Proposition 2.15. Let 2 = Cov(X1, X2). Define a linear transformation A on VI @D V2 to VI @3 V2 by

A(x,, x2} = (211x1 + 2 12x2, 22x1 + 222x2)

where '12 is the adjoint of 212' Then A = 2.

Proof. It is routine to check that

[A{x1, X2), (X3, x4)] = [(xI, X2}, A{x3, x4)]

so A is self-adjoint. To show A = 2, it is sufficient to verify

[(xI, x2), A(x1, X2)] = [(XI, X2), 2(XI, X2}]

by Proposition 1.16. However,

[{X1, X2), 2({X1, X2)] = var[(x,, X2}, (XI, X2)]

= var((xl, X1)1 + (X2, X2)2)

= var(x1, XI), + var(x2, X2)2

+2cov{((X, X1)1X (X2, X2)2)

= (XI, 2YI1X1)l + (X2, 2 22X2)2 + 2(x,, 2:,2X2)1

= (xI, 211x1)1 + (X2, 222X2)2

+(XI, 212X2)1 + ( 12X1X X2)2

= [{X1, X2}, ({E1x l + E12X2, 2f2X1 + 222X2)]

= [{X1, X2}, A{x1, X2}]. El

It is customary to write the linear transformation A in partitioned form as

({_' 212)(x x2) = (211x. + 212x2, 2'2x1 + 222x2}.



PROPOSITION 2.16 87

With this notation,

COV(X1, X2) =(it

12

212 222!

Definition 2.9. The random vectors Xl and X2 are uncorrelated if 212 = 0.

In the above definition, it is assumed that Cov( Xi) exists for i = 1, 2. It is

clear that X1 and X2 are uncorrelated iff

COV((X1 lX1)1 (X2 X2)2) = O for all xi e Vi, i = 1, 2.

Also, if X1 and X2 are uncorrelated in the two given inner products, then

they are uncorrelated in all inner products on V1 and V2. This follows from

the fact that any two inner products are related by a positive definite linear

transformation. Given Xi E (Vi, (., .)i) for i = 1, 2, suppose

(I 11 I21 Cov{X1,I X2} =if j

( 2i2 2221

We want to show that there is a linear transformation B E f (V2, V1) such

that X, + BX2 and X2 are uncorrelated random vectors. However, before

this can be established, some preliminary technical results are needed. Consider an inner product space (V, (., *)) and suppose A E C(V, V) is

self-adjoint of rank k. Then, by the spectral theorem, A = Ek2Xx O xi where

xi * 0, i = 1,..., k, and (xl,..., Xk} is an orthonormal set that is a basis

for 6(A). The linear transformation

- IA J i i

is called the generalized inverse of A. If A is nonsingular, then it is clear that A - is the inverse of A. Also, A - is self-adjoint and AA= A -A = 2kx El x, which is just the orthogonal projection onto 6(A). A routine computation shows that A-AA-= A- and AA-A = A.

In the notation established previously (see Proposition 2.15), suppose

(XI, X2) E VI e V2 has a covariance

2 = CoV(X,I X2) = (I 12)

Proposition 2.16. For the covariance above, 6,(222) C %((21) and 212 = 2122z222'



88 RANDOM VECTORS

Proof For X2 E 6X(122), it must be shown that 212x2 = O. Consider

x, E V1 and a E R. Then 222(ax2) = 0 and since l: is positive semidefi

nite,

0 < [(X1, aX2), X2{X, aX2)] = [{X1, aX2), {>21IXI + a>22x2, >21}f ]

= (Xl, >21Xx)1 + a(X, 2XA2)1 + a(X2, >2X1)2

= (xI, E>,x1) + 2a(x,, ,22x2),

As this inequality holds for all a E R, for each x, E V, (x,, 1,2x2), = O. Hence 21,2x2 = 0 and the first claim is proved. To verify that 212 =

2122-22222 it suffices to establish the identity 212(I - 2212222) = 0. How ever, I - >222>22 iS the orthogonal projection onto X22) Since ??(>222) C 6X (>12)4 it follows that >212(I

- >22222)

= 0. [

We are now in a position to show that Xl - >212>2: X2 and X2 are

uncorrelated.

Proposition 2.17. Suppose {XI, X2) E VI E V2 has a covariance

> = CoV(X1, X2) (i2f 212).

Then X, - >22>2X2 and X2 are uncorrelated, and Cov(X, - 2,2222X2) = >211 - l212>212121 2 where >221 i22

Proof. For xi e Vi, i = 1, 2, it must be verified that

Cov{(xI, XI - >212>212X2)1, (X2, X2)2) }

This calculation goes as follows:

cov{(X,, X1 - 2222X2)1, (X2, X2)2)

=cov((X1, XA), (X2, X2)2)

-cOV4((2212XI2, X2)2, (X2, X2)2}

= (X1, X212X2)1 - (122222X1, >222X2)2

= (XI >X212X2 )1- (X1, X212222>222X2)1

= (X, (212 -

>212122 >

222)X2)1 = 0.

PROPOSITION 2.18 89

The last equality follows from Proposition 2.15 since 212 = 2122222. To

verify the second assertion, we need to establish the identity

var(x,, Xl - (X, ( -)l

But

var(x,, Xl - = var(x1, X1) + var(xl, N12j2-X2)l 7- 12 y- 22 X2{(1 )I x )l, (l 121222)1

= (X1, 2Icox1 )1 +

(XI, X2 1) -2(XI,~~(X g '11212x22

=(X], ( 71 1-> l2 + X922212 )X1 )1 =

2(x1, 2 - 2X)

In the above, the identity I22222 = Y.2

has been used.

We now return to the situation considered in Example 2.4. Consider independent coordinate random vectors Xl,..., Xn with each Xi E RP, and

suppose that ;SXi =

j e RP, and Cov(Xi)= = for i = 1,..., n. Form the

random matrix X e e with rows Xl,..., X". Our purpose is to describe the mean vector and covariance of X in terms of E and ,u. The inner product

on ep ,n ( *, is that inherited from the standard inner products on the

coordinate spaces RP and Rn. Recall that, for matrices A, B E ep n

(A, B) = trAB' = trB'A = trA'B = trBA'.

Let e denote the vector in Rn whose coordinates are all equal to 1.

Proposition 2.18. In the above notation,

(i) S; X =

eu'

(ii) COv(X) = In C)l

Here I,, is the n x n identity matrix and e denotes the Kronecker product.

Proof. The matrix e,i' has each row equal to ,u' and, since each row of X

has mean IL', the first assertion is fairly obvious. To verify (i) formally, it must be shown that, for A E fP- n

&(A, X) = (A, epi').



90 RANDOM VECTORS

Let a',..., a', ai E RP, be the rows of A. Then

S(A, X) = &trAX' = &E2a'X1 = Y4a%Xi = EVa[L = trAtLe' = (A, eM').

Thus (i) holds. To verify (ii) it suffices to establish the identity

var(A, X) = (A, (I 0 E)A)

for A e &p n. In the notation above,

var(A, X) = var(E"a'Xi) = E" var(a'Xi) + E cov(a'Xi, a)Xj) =Ela,>a i*i

= tr A'Al = tr A:A' = tr A (A)' = (A, (In ? Y-)A).

The third equality follows from var(a'X) = al ai and, for i * j, alXi and

aj,Xj are uncorrelated. o

The assumption of the independence of X1,..., Xn was not used to its full extent in the proof of Proposition 2.18. In fact the above proof shows

that, if X1,..., Xn are random variables in RP with &;Xi = ,u, i = 1,..., n,

then &X = e,u'. Further, if XI,..., Xn in RP are uncorrelated with Cov(X,) -= , i = 1,..., n, then Cov(X) = In 0 E. One application of this formula

for Cov(X) describes how Cov(X) transforms under Kronecker products. For example, if A E en and B E E , then (A X B)X= AXB' is a

random vector in ep ,,,. Proposition 2.8 shows that

Cov((A 0 B)X) = (A ? B)Cov(X)(A ? B)'.

In particular, if Cov(X) = In 0 2, then

Cov((A 0 B)X) = (A X B)(In ? 2)(A ? B)' = (AA') X (BIB').

Since A 0 B = (A 0 Ip)(In I0 B), the interpretation of the above covari

ance formula reduces to an interpretation for A 0 Ip and In 0 B. First,

(In 0 B)X is a random matrix with rows X,'B' = (BXi)', i = 1,..., n. If

Cov(X,) = T, then Cov(BX,) = B2B'. Thus it is clear from Proposition

2.18 that Cov((In X B)X) = 0 (BEB'). Second, (A 0 Ip) applied to Xis

the same as applying the linear transformation A to each column of X.

When Cov(X) = In 0 , the rows of X are uncorrelated and, if A is an

n X n orthogonal matrix, then

Cov((A X Ip)X) = In ' Y = Cov(X).



PROPOSITION 2.19 91

Thus the absence of correlation between the rows is preserved by an

orthogonal transformation of the columns of X. A converse to the observation that Cov((A ? Ip)X) = In X 2 for all

A E 6(n) is valid for random linear transformations. To be more precise,

we have the following proposition.

Proposition 2.19. Suppose (Vj, (-, -)), i = 1,2, are inner product spaces

and X is a random vector in (C (VI, V2), ( ,*)). The following are equiva

lent:

(i) Cov(X) = I2 ? Y

(ii) Cov((F X I1)X) = Cov(X) for all F E 0(V2).

Here, Ii is identity linear transformation on Vi, i = 1,2, and 2 is a

non-negative definite linear transformation on V, to V1.

Proof Let I = Cov( X) so 'I is a positive semidefinite linear transforma tion on E(V,, V2) to (9(V1, V2) and I is characterized by the equation

cov{(A, X), KB, X)} = (A, IB)

for all A, B E f (VI, V2). If (i) holds, then we have

Cov((r ? i1)X) = (r X il)Cov(X)(r X il)'

= (F ? I1)(I2 ? 20)(' X II) = (rI2r') ? (Il2I1)

= I2 2 2 = COV(X),

so (ii) holds. Now, assume (ii) holds. Since outer products form a basis for f,(VI, V2), it

is sufficient to show there exists a positive semidefinite 2 on VI to V, such that, for x,x2 e V, andy,,y2e V2,

(Y 0XI,I'(Y20X2)) =

(y1EJx1,(I2 X 1)(y20x2)).

Define H by

H(xII X2, YI, Y2) cov{(y C? XI, X), (Y2 ? X2 X)}

for x,, x2 E VI and yi, Y2 E V2. From assumption (ii), we know that I



92 RANDOM VECTORS

satisfies P = (r X Il)(r X I)' for all r e ((V2). Thus

H(x,, X2, Y,, Y2) = (Yfl0xl,'(Y2Ex2))

= (yEl xI,(r ? I)4(r X I0 Y2)'(y2LX2))

= ((r( X )'(y1oxI),4'(r X I? AY20X2))

= ((F'yl) )XII'(r'y2)Llx2) = H(xl, X2, ry ,I r'y2)

for all r E (9(V2). It is clear that H is a linear function of each of its four

arguments when the other three are held fixed. Therefore, for x, and x2

fixed, G is a bilinear function on V2 X V2 and this bilinear function satisfies the assumption of Proposition 2.14. Thus there is a constant, which depends on x, and x2, say c[xl, x2], and

H(xl, X2, Yi, Y2) = C[x1, X2J(Y1 Y2)2

However, for yi = Y2 * 0, H, as a function of xl and x2' is bilinear and

non-negative definite on V1 x VI. In other words, c[x1, x2] is a non-nega

tive definite bilinear function on V1 x V1, so

C[X1, X2] = (X1, X2)1

for some non-negative definite E. Thus

H(xl, X2, Yi, Y2) = (X1, EX2)1(Yl Y2)2 = (Y1I? XI, (I2 ? )(Y20 X2)),

SO4 = I2 2-E

The next topic of consideration in the section concerns the calculation of

means and covariances for outer products of random vectors. These results

are used throughout the sequel to simplify proofs and provide convenient

formulas. Suppose Xi is a random vector in (Vi,(-, -) ) for i = 1,2 and let

ii = Xi, and Y.ii = Cov(X1) for i = 1,2. Thus {X1; X2} takes values in

VI ED V2 and

(III E12 Cov( XI X2) =

Ei I 22 1 2 22)

where 212 is characterized by

COV((X1, Xl),, (X2, X2)2) = (X1, E12X2)1



PROPOSITION 2.20 93

for xi E Vi, i = 1,2. Of course, Cov{X1, X2) is expressed relative to the

natural inner product on V1

D V2 inherited from (V,, (*, *),) and (V2, (*, *)2)

Proposition 2.20. For Xi E (Vi, (', *)), i = 1, 2, as above,

&X1LJ X2 = -12 + t1 fl2

Proof. The random vector XI O X2 takes values in the inner product space (f(V2, VI), ( , )). To verify the above formula, it must be shown that

<(A, X 1O X2) = (A, I12) + (A, A I J >2)

for A e C(V2, VI). However, it is sufficient to verify this equation for A = xI O x2 since both sides of the equation are linear in A and every A is a

linear combination of elements in f(V2, Vl) of the form xl O x2, xi EVi, i= 1,2.Forx Ox2iE(V2,V1),

(XIc]X2, XlOX2) = &(X1, X)MX21 X2 )2

= COV((Xl, X0)1, (X2, X2)2) + &(X1, X1)l6(X2, X2)2

= (XI, I12X2)1 + (XI, I1)1(MX2 2)2

= (XIl X2,1 12) +

(XIl X2, 1 ? O2)

A couple of interesting applications of Proposition 2.20 are given in the following proposition.

Proposition 2.21. For X1, X2 in (V,(, *)), let Ai = &Xi, Iii = Cov(Xi) for i = 1, 2. Also, let 2I2 be the unique linear transformation satisfying

COV((X1, X1), (X2, X2)) = (X1 5 212X2)

for all xl, x2 E V. Then:

(i) c;Xl Oxi =

III + Al 0 U.

(ii) &(Xl, X2) = (I, -12) + (jt1, A2)

(iii) F9(Xl5 XI) = (I, 211) + (Al1 1).

Here I e E (V, V) is the identity linear transformation and ( ,* is the inner product on (V, V) inherited from (V, ( *, )).

94 RANDOM VECTORS

Proof. For (i), take Xl = X2 and (V1,(, )X) = (V2,(., )2) = (V,(, )) in

Proposition 2.20. To verify (ii), first note that

6XI O X2 = 12 + Al ?2

by the previous proposition. Thus for I e Q4V, V),

6<I x, Xi X2) = (I, 12) + (I, A ? 2).

However, (I, X El ZX2) = (XI, X2) and (I, p O IL2) = (IL IL2) so (ii) holds.

Assertion (iii) follows from (ii) by taking X, = X2. ol

One application of the preceding result concerns the affine prediction of one random vector by another random vector. By an affine function on a

vector space V to W, we mean a function f given by f(v) = Av + wo where

A e V(V, W) and w0 is a fixed vector in W. The term linear transformation

is reserved for those affine functions that map zero into zero. In the

notation of Proposition 2.21, consider Xi E (PV,(, i)i for i = 1,2, let ,It =

Xi, i = 1, 2, and suppose

E CoV(X1, X2} = (X i 212)

exists. An affine predictor of X2 based on Xl is any function of the form

AX1 + x0 where A EE e(VI, V2) and x0 is a fixed vector in V2. If we assume

that ,L , I2, and I are known, then A and x0 are alowed to depend on these known quantities. The statistical interpretation is that we observe X,, but

not X2, and X2 is to be predicted by AX, + x0. One intuitively reasonable

criterion for selecting A and x0 is to ask that the choice of A and x0

minimize

SJX2 - (AX, + X0)112

Here, the expectation is over the joint distribution of X1 and X2 and 11 iS the norm in the vector space (V2,(. ,)2). The quantity S&JX2 - (AX, +

x0)I12 is the average distance of X2 - (AXI + x0) from 0. Since AX1 + x0 is

supposed to predict X2, it is reasonable that A and xo be chosen to minimize

this average distance. A solution to this minimization problem is given in

Proposition 2.22.

Proposition 2.22. For X1 and X2 as above,

GJJX2 - (AX, + x0)II2 ( (I2, Y22 - 7l2112

with equality for A = E2: - and xo = I2- "i221

PROPOSITION 2.22 95

Proof. The proof is a calculation. It essentially consists of completing the

square and applying (ii) of Proposition 2.21. Let Yi = Xi - ,ui for i = 1,2. Then

61JX2 - (AX, + x 2)II2 = SIY2 -AY1 + 2 - At,- X112 = 211Y2

+ 26 Y2 -A Y,, ,U- AIL-X) 12AA, _

X0112 - - AY1~~~ + IItL2 - - XO) .A =SJJY2

- A y,11 2 -

112AA, -

X0112

The last equality holds since &;(Y2 - A Y1) = 0. Thus for each A e E (V1, V2),

&11X2 - (AX1 + xo)I2 >, SY2 - A Y12

with equality for xo = A2 - A,u. For notational convenience let 12I = 212 Then

611Y2 - AY1j2 = GuY2 - -21 YI + (2211 - 2A)YIIJ2

= gttY2- ~2,l,,YII + &j1(22,2j1-A)Y,112 Y2- 1 21 y 11Y 2+611(221yl -

+ 26&(Y2 - Y21-1j 1 Y,j , (+(21211j - A)Y I )2

= IY2 - 221 ly- 1- 2 + ;11("2111-A)1 12

6 s;hY2 - 1212 Y1j211

The last equality holds since S;(Y2 - 2 21 1 YI) = 0 and Y2 - 22121jY1 is

uncorrelated with Y, (Proposition 2.17) and hence is uncorrelated with (z2, ,,- A)Y1. By (ii) of Proposition 2.21, we see that &(Y2 -

221111 Y1, (121111 - A)Y1)2 = 0. Therefore, for each A E E(VI, V2),

SGuY2 - AYIII2 > yY2 -2 2

with equality for A = 121j11. However, Cov(Y2 - 2212jY1) = 1 22 - Y2l H122 and &(Y2 - 2 21 Y1) = 0 so (iii) of Proposition 2.21 shows that

GulY2 - =21Y11 2 (I2, -22 - Y21-112->

Therefore,

6IIX2 - (AX, + x0)|j (12, '222- 11

with equality for A = 12J1j1 and2x1 = 2 - 2121 1*1



96 RANDOM VECTORS

The last topic in this section concerns the covariance of XE] X when X is a random vector in (V, ( , *)). The random vector XE] Xis an element of the vector space (f&(V, V), ( , )). However, XO X is a self-adjoint linear transformation so XO X is also a random vector in (M, ( , )) where M, is the linear subspace of self-adjoint transformations in P,(V, V). In what follows, we regard XE] X as a random vector in (Ms, K *, )). Thus the

covariance of XO3 X is a positive semidefinite linear transformation on (Ms, ( *, * )). In general, this covariance is quite complicated and we make some

simplifying assumptions concerning the distribution of X.

Proposition 2.23. Suppose X has an orthogonally invariant distribution in (,(., )) where IXIt4 < +oo. Let vI and v2 be fixed vectors in V with

Ivii = 1, i = 1,2, and (vI, v2) = 0. Set cl = var((v1, X)2) and c2 =

cov{(v , X)2, (v2, X)2). Then

Cov(XO X) = (cl - c2)I ? I + c2T],

where T1 is the linear transformation on M, given by T1(A) = (I, A)I. In

other words, for A, B E M,

cov((A, XO X), (B, XE] X)) = (A, ((cl - c2)I ? I + c2T1)B)

= (Cl - C2)(A, B) + C2(I, A)(I, B).

Proof. Since (cl - c2)I 0 I + c2T1 is self-adjoint on (Ms, ( ,*), Pro

position 2.6 shows that it suffices to verify the equation

var(A, XO X> = (c, - c2)(A, A) + c2(I, A>2

for A e M, in order to prove that

Cov(XE X) = (C1 - C2)I 1 I+ C2T1.

First note that, for x e V,

var<(xOx, XE] X) = var(x, X)2 = Iixi14var( x

X) = IIxIl4var(v,, X)2

This last equality follows from Proposition 2.10 as the distribution of X is

PROPOSITION 2.24 97

orthogonally invariant. Also, for x, x2 e V with(x,, x2) =O,

cov((xl, X)2, (X2, X)2) = ixlI12IIx21I2cov{( j' 1, X), ( jVx) 2 X

= lix1112ix2ii2cov((v1, X)2, (V2, X)2).

Again, the last equality follows since l&(X) = I&(4X) for I e @(V) so

coV { ( It ,x) ( X)2} = cov((2c jl x)2 , l IX211X)}

and I can be chosen so that

+ i=v, i=1,2.

For A e M, apply the spectral theorem and write A = Enaix i xi where

xI,.. ., xn is an orthonormal basis for (V,(, *)). Then

var(A, XO X) = var(KaaixiO xi, XO3 X)

= :a var(xj O Xi, XO X)

+ E E aiaj cov((xi O xi, XLI X), (xj O xj, XO X)) i *1

= la2var(xi, X)2 + ELaia cov{(xi, X)2, (xj, X)2}

= c1:a1 + C2EEaiaj = (cl - c2)Ea2 +

i*j i i j

= (c - c2)(A, A) + c2(I, A)2. El

When X has an orthogonally invariant normal distribution, then the con stant c2 = 0 so Cov( XLI X) = c jI I. The following result provides a

slight generalization of Proposition 2.23.

Proposition 2.24. Let X, v,, and v2 be as in Proposition 2.23. For C E I (V, V), let L = CC' and suppose Y is a random vector in (V, (., *)) with



98 RANDOM VECTORS

lE(Y) = C (CX). Then

Cov(YO Y) = (Cl - C2)1 ? 2 + C2T2

where T2(A) = (A, E)2 for A E M,

Proof. We apply Proposition 2.8 and the calculational rules for Kronecker products. Since (CX) OJ(CX) = (C X C)(XL X),

Cov(YZ Y) = Cov((CXO CX)) = Cov((C X C)(XO X))

= (C ? C)Cov(XO X)(C ? C)'

= (C ? C)((c -C2)I X I + C2T1)(C' ? C')

= (C1 - C2)(C X C)(I X I)(C' e C')

+c2(C ? C)T,(C' ? C')

= (c, - c2)1 X z + c2(C 0 C)T1(C' ? C').

It remains to show that (C 0 C)T1(C' 0 C') = T2. For A E M

(C ? C)T1(C' 0 C')(A) = C 0 C((I, (C' ? C')A)I)

= ((C e C)I, A)(C 0 C)(I) = (CC', A)CC'

= <E, A)2= T2(A).

PROBLEMS

1. If x 1,..., xn is a basis for (V,(-, )) and if (x, X) has finite expecta tion for i = 1,..., n, show that (x, X) has finite expectation for all

x E V. Also, show that if (xi, X)2 has finite expectation for i = 1,..., n, then Cov(X) exists.

2. Verify the claim that if X1 (AX2) with values in V, (V2) are uncorrelated

for one pair of inner products on V, and V2, then they are uncorrelated

no matter what the inner products are on V1 and V2.

3. Suppose Xi E Vi, i = 1, 2 are uncorrelated. If fi is a linear function on

Vi i = 1 2, show that

(2.2) cov{f1(XI), f2(X2)) = 0.

Conversely, if (2.2) holds for all linear functions f1 and f2, then X1 and

X2 are uncorrrelated (assuming the relevant expectations exist).

PROBLEMS 99

4. For X E R', partition X as

X=(x)

with X E R' and suppose X has an orthogonally invariant distribution.

Show that X has an orthogonally invariant distribution on R'. Argue that the conditional distribution of X given X has an orthogonally

invariant distribution.

5. Suppose XI,..., Xk in (V, (-, )) are pairwise uncorrelated. Prove that Cov(yk4Xi

k E Cov(Xi)

6. In Rk, let el,..., ek denote the standard basis vectors. Define a random

vector U in Rk by specifying that U takes on the value Ei with

probability pi where 0 < pi < 1 and Ekpi = 1. (U represents one of k

mutually exclusive and exhaustive events that can occur). Let p E Rk

have coordinates P 1, - - -, Pk- Show that & U = p, Cov(U) = Dp - pp'

where Dp is a diagonal matrix with diagonal entries P1,'. . . E Pk . When

0 < pi < 1, show that Cov(U) has rank k - 1 and identify the null

space of Cov(U). Now, let X,,..., X,, be i.i.d. each with the distribu

tion of U. The random vector Y = E'Xi has a multinomial distribution

(prove this) with parameters k (the number of cells), the vector of probabilities p, and the number of trials n. Show that SY = np,

Cov(Y) = n (Dp - pp').

7. Fix a vector x in Rn and let ST denote a permutation of 1, 2,.. ., n (there

are n! such permutations). Define the permuted vector 'rx to be the

vector whose ith coordinate is x('iT`(i)) where x(j) denotes the jth coordinate of x. (This choice is justified in Chapter 7.) Let X be a

random vector such that Pr{X = 7rx} = l/n! for each possible permu

tation 7T. Find &X and Cov( X).

8. Consider a random vector X E R' and suppose P(X) = (DX) for

each diagonal matrix D with diagonal elements dii = ? 1, i = 1,.. ., n.

If ;I&XI12 < + oo, show that SX = 0 and Cov(X) is a diagonal matrix

(the coordinates of X are uncorrelated).

9. Given X E (V, (*, *)) with Cov( X) = l, let Ai be a linear transforma tion on (V,(., ))to (Wi, [, ]), i= 1, 2. Form Y = (AlX, A2X} with

values in the direct sum W1 @D W2. Show

Cov(Y) = ( I~A 1A

() A22A A2 A2)

in W, ED W2 with its usual inner product.

100 RANDOM VECTORS

10. For X in (V, , *)) with ,i = SX and Y. = Cov(X), show that

&(X, AX) = (A, L) + (i,u A,l) for any A E 0(V, V).

11. In (0,, n9 ( - * )), suppose the n x p random matrix X has the covari

ance In X E for some p x p positive semidefinite Y.. Show that the

rows of X are uncorrelated. If L = &X and A is an n x n matrix, show

that & X'A X = (tr A) Y. + ,I'A,u.

12. The usual inner product on the space of p x p symmetric matrices, denoted by s,, is ( ,*>, given by (A, B) = trAB'. (This is the natural inner product inherited from (0ep,p, ( , *) by regarding Sp as a subspace of Ep, p.) Let S be a random matrix with values in SP, and suppose that P,(FSF') = P(S) for all r E (p. (For example, if X E RP

has an orthogonally invariant distribution and S = XX', then -(rSFr) = I(S).) Show that &S = cIp where c is constant.

13. Given a random vector X in (0(V, W), ( *, - )), suppose that 0(X) =

OrF X +)X) for all F E e(W) and 4 Ee C(V).

(i) If X has a covariance, show &X = 0 and Cov(X) = cIw X Iv where c > 0.

(ii) If Y E 0 (V, W) has a density (with respect to Lebesgue measure)

given by f(y) = p((y, y)), y E 0(V, W), show that 0(Y) =

((Jr X 4)Y) for r E 0(W) and p Ee 0(V).

14. Let XI,..., Xn be uncorrelated random vectors in RP with Cov(Xi) =

2, i = 1 .. ., n. Form the n x p random matrix X with rows X,..., Xn and values in (li, ,, -

*)). Thus Cov(X) = In ? E:.

(i) Form X in the coordinate space R'P with the coordinate inner

product where

X,

In the space R'P show that

Cov(X) 4

where each block is p x p.



PROBLEMS 101

(ii) Now, form X in the space R'P where

X- Zi E Rn

Zp Z,

and Zi has coordinates XI1,. .., X,,i for i = 1,..., p. Show that

UIIIn (12In ... IpIn

( G21In (J22In U2pIn

Cov(X)=

pI In Up2In ..

app In

where each block is n X n, I = (ain).

15. The unit sphere in Rn is the set (xlx e Rn, IxII = 1) = 9?. A random

vector X with values in 9, has a uniform distribution on 9 if D (X) =

1 (rX) for all r E Qn (There is one and only one uniform distribution

on X-this is discussed in detail in Chapters 6 and 7.)

(i) Show that SX = 0 and Cov(X) = (l/n)In. (ii) Let X, be the first coordinate of X and let X E R -' be the

remaining n - 1 coordinates. What is the best affine predictor of

XI based on X? How would you predict XI on the basis of X?

16. Show that the linear transformation T2 in Proposition 2.24 is I0 2U where Ol denotes the outer product of the vector space (Ms, K Here, *, -) is the natural inner product on l (V, V).

17. Suppose XA E R2 has coordinates XI and X2 that are independent with a standard normal distribution. Let S = XX' and denote the elements

of S by s1, S22, and S12 - S21.

(i) What is the covariance matrix of

51

512 E- R3 (22)

(ii) Regard S as a random vector in (52, K *) (see Problem 12). What is Cov(S) in the space (S2, ( *, *

(iii) How do you reconcile your answers to (i) and (ii)?



102 RANDOM VECTORS


1. In the first two sections of this chapter, we have simply translated well

known coordinate space results into their inner product space versions.

The coordinate space results can be found in Billingsley (1979). The

inner product space versions were used by Kruskal (1961) in his work

on missing and extra values in analysis of variance problems.

2. In the third section, topics with multivariate flavor emerge. The reader

may find it helpful to formulate coordinate versions of each proposi tion. If nothing else, this exercise will soon explain my acquired

preference for vector space, as opposed to coordinate, methods and

notation.

3. Proposition 2.14 is a special case of Schur's Lemma?a basic result in

group representation theory. The book by Serre (1977) is an excellent

place to begin a study of group representations.



CHAPTER 3

The Normal Distribution

on a Vector Space

The univariate normal distribution occupies a central position in the statisti cal theory of analyzing random samples consisting of one-dimensional observations. This situation is even more pronounced in multivariate analy sis due to the paucity of analytically tractable multivariate distributions one notable exception being the multivariate normal distribution. Ordin arily, the nonsingular multivariate normal distribution is defined on Rn by specifying the density function of the distribution with respect to Lebesgue

measure. For our purposes, this procedure poses some problems. First, it is desirable to have a definition that does not require the covariance to be

nonsingular. In addition, we have not, as yet, constructed what will be called Lebesgue measure on a finite dimensional inner product space. The definition of the multivariate normal distribution we have chosen cir cumvents the above technical difficulties by specifying the distribution of each linear function of the random vector. Of course, this necessitates a proof that such normal distributions exist.

After defining the normal distribution in a finite dimensional vector space and establishing some basic properties of the normal distribution, we derive the distribution of a quadratic form in a normal random vector. Conditions for the independence of two quadratic forms are then presented followed by a discussion of conditional distributions for normal random vectors. The chapter ends with a derivation of Lebesgue measure on a finite dimensional vector space and of the density function of a nonsingular normal distribution on a vector space.

103



104 THE NORMAL DISTRIBUTION ON A VECTOR SPACE

3.1. THE NORMAL DISTRIBUTION

Recall that a random variable Z0 E R has a normal distribution with mean

zero and variance one if the density function of Z0 is

p(z) = (2,Y) -1/2exp[ - z2], z E R

with respect to Lebesgue measure. We write E(Z0) = N(O, 1) when Z0 has density p. More generally, a random variable Z E R has a normal distribu

tion with mean ,t e R and variance a2 , 0 if E(Z) = E(aZ0 + ,u) where

f(Zo) = N(O, 1). In this case, we write e(Z) = N(M, a2). When a2 = 0, the

distribution N(i, a2) is to be interpreted as the distribution degenerate at u.

If C(Z) = N(p,u a2), then the characteristic function of Z is easily shown to be

+(t) = exp[it -

Ij2t2], t E R.

The phrase "Z has a normal distribution" means that for some jA and some

a > 0, IS(Z) = N(ju, a2). If Z1,.., Zk are independent with E(Zj) =

N(,uj, 2), then fE(2ajZj) = N(a1ajjij, ja4a.2). To see this, consider the

characteristic function

k k

& exp[itYajZj] = & rH exp[itajZj] =

n H exp[itaj z]

J=1 j=1

k

= H exp[itaj1j

-I t 2aj2 ]

I~~~~~~ g

j=l

= exp [it(aL1) - t2 (aja2j)].

Thus the characteristic function of Y2ajZj is that of a normal distribution

with mean Yajlj and variance Yaj2aj2. In summary, linear combinations of

independent normal random variables are normal. We are now in a position to define the normal distribution on a finite

dimensional inner product space (V, (*, )).

Definition 3.1. A random vector X E V has a normal distribution if, for

each x E V, the random variable (x, X) has a normal distribution on R.

To show that a normal distribution exists on (V, (, )), let {x, ... ., x,,) be

an orthonormal basis for (V, (-, *)). Also, let Z,.., Zn be independent



PROPOSITION 3.1 105

N(O, 1) random variables. Then X -= lZixi is a random vector and (x, X) =

Y2(x, xi)Zi, which is a linear combination of independent normals. Thus (x, X) has a normal distribution for each x E V. Since &;(x, X)=

E (xi, x)&Zi = 0, the mean vector of X is 0 E V. Also,

var(x, X) = var(E(x, xi)Zi) = 2:(x, xi)2var(Zi) = E(x, xi)2 = (x, x).

Therefore, Cov(X) = I E t(V, V). The particular normal distribution we have constructed on (V,(., .)) has mean zero and covariance equal to the

identity linear transformation. Now, we want to describe all the normal distributions on (V, (*, )). The

first result in this direction shows that linear transformations of normal random vectors are again normal random vectors.

Proposition 3.1. Suppose X has a normal distribution on (V, (., )) and let A EE f(V, W), wo E W. Then AX + wo has a normal distribution on

(W,[ I]) -D

Proof. It must be shown that, for each w E W, [w, AX + wo] has a normal

distribution on R. But [w, AX + wo] = [w, AX] + [w, wo] = (A'w, X) +

[w, wo]. By assumption, (A'w, X) is normal. Since [w, wo] is a constant,

(A'w, X) + [w, wo] is normal. C]

If X has a normal distribution on (V, (, *)) with mean zero and covari ance I, consider A E (V, V) and ,I E V. Then AX + ,u has a normal

distribution on (V,(., *)) and we know &(AX + ,u) = A(&X) + it = ,u and

Cov(AX + ,u) = A Cov(X)A' = AA'. However, every positive semidefinite linear transformation I can be expressed as AA' (take A to be the positive semidefinite square root of L). Thus given ju E V and a positive semidefinite Y., there is a random vector that has a normal distribution in V with mean vector ,u and covariance L2. If X has such a distribution, we write P (X) =

N(ju, 4). To show that all the normal distributions on V have been de

scribed, suppose X E V has a normal distribution. Since (x, X) is normal on R, var(x, X) exists for each x E V. Thus ,u = &X and E = Cov(X) both exist and S( X) = N(,u, E). Also, f,((x, X)) = N((x, ,I), (x, Ex)) for x E V. Hence the characteristic function of (x, X) is

+(t) = &exp[it(x, X)] = exp[it(x, ,u) - It2(x, Ex)].

Setting t = 1, we obtain the characteristic function of X:

((x) = Sexp[i(x, X)] = exp[i(x, ,u) - 2(x, Ex)].

Summarizing this discussion yields the following.




Proposition 3.2. Given ,u E V and a positive semidefinite L E PJ(V, V), there exists a random vector X e V with distribution N(M, E) and char acteristic function

((x) = exp[i(x,i,) - 2(x, Ex)].

Conversely, if X has a normal distribution on V, then with A = SX and

= Cov(X), E(X) = N(A, E) and the characteristic function of X is given by (.

Consider random vectors Xi with values in (Vi, (*, . )j) for i = 1, 2. Then

{Xl, X2) is a random vector in the direct sum V1 E V2. The inner product

on VI @ V2 is [, *] where

RVI I V2), (V3 I V4}]- (VI1, VA) + (V2, V4)2,

v1, V3 E V1 and V2, V4 E V2. If Cov(X1) = Eii, i = 1,2, exists, then

S{X1, X2) = (Al 2) where A, = &X4, i = 1, 2. Also,

T O(X 1 2} (2 2 (VI 2tV21 VI 2) ~Cov{XI ,X2)=(221 122~

v@v,ie

as defined in Chapter 2 and 21 -- 2'

Proposition 3.3. If (XI, X2) has a normal distribution on V1 E V2, then Xl and X2 are independent iff 112

= 0.

Proof. If Xl and X2 are independent, then clearly ,2 = 0. Conversely, if

Y12 = 0, the characteristic function of {(X, X2) is

&exp(i [(v , v2), X1, 'X2)]) = exp(i [(v,, v2),( {1, I2)]

-I [V(1, V2}, '{V1I V2)])

= exp(i(v1, I1)0 + i(v2 2)2

-I(V1, E1VI)l -2(V2, 22v2)2}

= exp(i(v1, p1)1 2- (V1, 11v1)1}

x exp(i( V2, 2 )2 - 2 ( V2, 222)2)



PROPOSITION 3.4 107

since I2 =', = 0. However, for v1 E VI, (v1, X1)1 = [(v1,0),{XI, X2}],

which has a normal distribution for all vI E V1. Thus f (X1) = N(/lI, 2I) on

V1 and similarly X(X2) = N(JU2, 12) on V2. The characteristic function of

(X1, X2) is just the product of the characteristic functions of X1 and X2.

Thus independence follows and the proof is complete. oi

The result of Proposition 3.3 is often paraphrased as "for normal random

vectors, X1 and X2 are independent iff they are uncorrelated." A useful consequence of Proposition 3.3 is shown in Proposition 3.4.

Proposition 3.4. Suppose 15(X) = N(,u, L:) on (V,(-, *)), and consider A E f(V, W1), B E t(V,W 2) where (W1,-, ]1) and (W2,[, *]2) are inner

product spaces. AX and BX are independent iff AL2B' = 0.

Proof. We apply the previous proposition to X1 = AX and X2 = BX. That

{XI, X2) has a normal distribution on W1 @ W2 follows from

[wI, XIjI + [w2, X212 =

(A'w1, X) + (B'w2, X) =

(A'w1 + B'w2, X)

and the normality of (x, X) for all x E V. However,

CoV([ W1, XI ] 1 [ W2, X2]2) = cov{(A'w1, X), (B'w2, X))

= (A'w1, EB'w2)

= [ W1 AE:B'w2]

Thus X1 = AX and X2 = BX are uncorrelated iff A:B' = 0. Since (X1, X2) has a normal distribution, the condition AY2B' = 0 is equivalent to the independence of X1 and X2. El

One special case of Proposition 3.4 is worthy of mention. If e(X) =

N(,t, I) on (V, (., *)) and P is an orthogonal projection in e(V, V), then PX

and (I - P)X are independent since P(I - P) = 0. Also, it should be

mentioned that the result of Proposition 3.3 extends to the case of k random vectors-that is, if (XI, X2,..., Xk) has a normal distribution on the direct sum space V, ED V2 E ...* E Vk, then XI, X2,..., Xk are independent iff Xi and Xj are uncorrelated for all i * j. The proof of this is essentially the same

as that given for the case of k = 2 and is left to the reader.




A particularly useful result for the multivariate normal distribution is the

following.

Proposition 3.5. Suppose E(X) = N(,u, l) on the n-dimensional vector space (V,(,-)). Write En2ixi EJxi in spectral form, and let Xi= (xi, X), i =1,..., n. Then X,..., Xn are independent random variables

that have a normal distribution on R with SXi = (xi, ,t) and var(Xi) = AX,

i= 1,..., n. In particular, if 2 = I, then for any orthonormal basis (xl,...,

xn) for V, the random variables Xi = (xi, X) are independent and normal

with &Xi = (xi, ,u) and var(Xi) = 1.

Proof. For any scalars

a,,...,

an in R, Ena1X1

= Ela1(x1, X)=

'I ixi, X), which has a normal distribution. Thus the random vector

X E Rn with coordinates X,,. . ., Xn has a normal distribution in the coordi

nate vector space Rn. Thus X1,..., Xn are independent iff they are uncorre

lated. However,

cov(Xj, Xk} = cov((xj, X), (xk, X)) = (Xj >Xk)

= (Xj, (yn Xiioxi)xk) = XjOjk.

Thus independence follows. It is clear that each Xi is normal with &Xi =

(xi, ,) and var(Xi) = Ai, i = 1,..., n. When L = I, then Enxijxi = I for any orthonormal basis xI,..., xn. This completes the proof.

The following is a technical discussion having to do with representations of the normal distribution that are useful when establishing properties of the

normal distribution. It seems preferable to dispose of the issues here rather than repeat the same argument in a variety of contexts later. Suppose

X E (V, (-, *)) has a normal distribution, say P,(X) = N(,u, Y.), and let Q be

the probability distribution of X on (V,(-, - )). If we are interested in the

distribution of some function of X, say f( X) E (W, [ *, - ]), then the underly

ing space on which X is defined is irrelevant since the distribution Q

determines the distribution of f(X)-that is, if B E `@(W), then

P{f(X) E B) = P(XEf-l(B)) = Q(f-1(B)).

Therefore, if Y is another random vector in (V, ( *, * )) with C ( X) = & (Y), then f( X) and f(Y) have the same distribution. At times, it is convenient to

represent e,(X) by J3(CZ + It) where E (Z) = N(O,I) and CC' = E. Thus



QUADRATIC FORMS 109

(X) = C (CZ + t,) sof(X) and f(CZ + it) have the same distribution. A

slightly more subtle point arises when we discuss the independence of two

functions of X, say fl(X) and f2(X), taking values in (W,[., ],) and

(W2, [* ]2). To show that independence of f1(X) and f2(X) depends only on Q, consider Bi E 9i3(Wi) for i = 1, 2. Then independence is equivalent to

P(f1(X) e B1, f2(X) e B2) = P(f1(X) e BI)P(f2(X) E B2).

But both of these probabilities can be calculated from Q:

P{f1(X) E B1, f2(X) E B2) = P{X EJ f-'(BI)

nf '(B2)}

= Q(fT-'(B,) nf 1(B2))

and

P{fi(X) E Bi) = Q(fl-'(Bi)), i = 1,2.

Again, if B(Y) = e(X), then fl(X) and f2(X) are independent iff fl(Y) and f2(Y) are independent. More generally, if we are trying to prove

something about the random vector X, e (X) = N(,u, 2), and if what we are

trying to prove depends only on the distribution Q, of X, then we can

represent X by any other random vector Y as long as C(Y) = e(X). In

particular, we can take Y = CZ + ,L where e (Z) = N(O, I) and CC' = E.

This representation of X is often used in what follows.

3.2. QUADRATIC FORMS

The problem in this section is to derive, or at least describe, the distribution of (X, AX) where X E (V, (*, *)), A is self-adjoint in f&(V, V) and f (X) =

N(,u, E). First, consider the special case of 2 = I, and by the spectral

theorem, write A = E1 x1ox1. Thus

(X, AX) = (X,(EnXiXiOxi)X) = EXi(Xi, X)2.

But Xi (xi, X), i = 1,...., n, are independent since I = I (Proposition

3.5) and e(Xi) = N((xi, ,u), 1). Thus our first task is to derive the distribu tion of Xj2 when e (Xi) = N((xi, IL), 1).

Recall that a random variable Z has a chi-square distribution with m

degrees of freedom, written E&(Z) = x2, if Z has a density on (0, oo) given




by

z(m/2)-l (z)= ~~~exp -z], z>O0. Pm() =r(m/2)2m/2 2

Here m is a positive integer and r(.) is the gamma function. The character

istic function of a X2 random variable is easily shown to be

&;e tm= -2it)-/ t eR'.

Thus, if e(Z1) = x2, e(Z2) = , and Z1 and Z2 are independent, then

& exp[it(Z, + Z2)] =&;exp[itZj]jexp[itZ2]

= (1 - 2it) /2(l - 2it) n/2 = (1 - 2it)-(m+n)/2

Therefore, P2(Z1 + Z2) = x2.+n. This argument clearly extends to more

than two factors. In particular, if e(Z)= x2, then, for independent ran

dom variables Z4,. , Zm with (Zi) = X2, ef,mZi) = e(Z). It is not difficult to show that if P,(X) = N(O, 1) on R, then e(X2) = x2. However, if e(X) = N(a, 1) on R, the distribution of X2 is a bit harder to derive. To

this end, we make the following definition.

Definition 3.2. *Let Pm' m = 1,2,..., be the density of a x2 random variable and, for X > 0, let

qj = exp) 2] j! (2)

For A = 0, qo = I and q. = 0 for j > 0. A random variable with density

00

h(z) = E qjPm+2j(z), z > 0 j=O

is said to have a noncentral chi-square distribution with m degrees of freedom

and noncentrality parameter A. If Z has such a distribution, we write

E(Z) = X2 When A = 0, it is clear that e (X 2 (O)) = x2 . The weights qj, j = 0, 1,...,

are Poisson probabilities with parameter X/2 (the reason for the 2 becomes

clear in a bit). The characteristic function of a x2 (A) random variable is



PROPOSITION 3.6 111

calculated as follows:

00 0

gexp[itX2 (A)] = E qjf exp(itx)p-I2j(x) dx

j=O0

00

= qj(l - 2it) n(m/2+j) j=0

= (1 - 2it)m/2 ? qj(l - 2it)-j j=0

= (1 - 2it) m/2exp(-X/2) E (A (I - 2it)

=(1 - 2it) -m/2exp-+ X 1-2it]

- (1 - 2it) m/2exp 2[ J2it

From this expression for the characteristic function, it follows that if E(Zi) = x2 (Xi), i = 1, 2, with Z1 and Z2 independent, then fE(Z1 + Z2)

X2n +m2(A, + X2). This result clearly extends to the sum of k independent noncentral chi-square variables. The reason for introducing the noncentral chi-square distribution is provided in the next result.

Proposition 3.6. Suppose E(X) = N(a, 1) on R. Then C(X2) = X2(a2).

Proof The proof consists of calculating the characteristic function of X2. A justification of the change of variable in the calculation below can be given using contour integration. The characteristic function of X2 is

& exp( itX2) = j exp [itx2 - X - a)2] dx

exp[- (1 -2itx2+ ax-'a2]dx

(1 - 2it)12 J| exp[-lw2 + a(l -2it) /

xw- 1a2] dw




=(1 2itY'2 j| exp[- (w - a(l - 2it)/2)

+ a it

7dw 2(1 2it)d

= (1- 2it)/ exp 2 ( - 2it

By the uniqueness of characteristic functions, E(X2) = X2(a2). ?

Proposition 3.7. Suppose the random vector X in (V, (-, *)) has a N(,, I)

distribution. If A EE C(V, V) is an orthogonal projection of rank k, then P-((X, AX)) = x2((A, AA)).

Proof. Let (xI,..., Xk) be an orthonormal basis for the range of A. Thus

A =2,EJJiOxi and

(X, AX) = Ek2(xi, X)2.

But the random variables (xi, X)2, i = 1,..., k, are independent (Proposi tion 3.5) and, by Proposition 3.6, i4X7) = x)((x, )2) From the additive property of independent noncentral chi-square variables,

E (yk (xi,9 X)2) = Xk(l(E (Xi ) A)2

Noting that (,u, Ai) = Ek(Xi, A)2, the proof is complete. U

When Ci(X) = N(,u, E), the distribution of the quadratic form (X, AX), with A self-adjoint, is reasonably complicated, but there is something that can be said. Let B be the positive semidefinite square root of E: and assume

that , e 'Ri(E). Thus , E 'Y(B) since i6,(B) = '3(E). Therefore, for some

vector T E V, ,u = BT. Thus e(X) = E(BY) where E(Y) = N(T, I) and it

suffices to describe the distribution of (BY, ABY) = (Y, BABY). Since A and B are self-adjoint, BAB is self-adjoint. Write BAB in spectral form:

BAB = Y'1x UxOx

where (xl,, xJ) is an orthonormal basis for (V,(., *)). Then

(Y, BABY) = EnAi (xi, Y)2



PROPOSITION 3.8 113

and the random variables (xi, Y), i = 1,..., n, are independent with

f ((Xi, Y)2) = X'((Xi, T)2). It follows that the quadratic form (Y, BABY)

has the same distribution as a linear combination of independent noncentral chi-square random variables. Symbolically,

Y((Y, BABY)) = E ()XiX2 i((Xi T ))

In general not much more can be said about this distribution without some

assumptions concerning the eigenvalues Al,..., An. However, when BAB is an orthogonal projection of rank k, then Proposition 3.7 is applicable and

e ((Y, BABY)) = X2 ((T, BABT)) = X2 ((BT, ABT)) = X2 ((G' Aft)).

In summary, we have the following.

Proposition 3.8. Suppose P&(X) = N(,u, >2) where ,u E 511(>2), and let B be the positive semidefinite square root of 2. If A is self-adjoint and BAB is a

rank k orthogonal projection, then

e ((X, AX)) = X2 ((M, Api)).

We can use a slightly different set of assumptions and reach the same

conclusion as Proposition 3.8, as follows.

Proposition 3.9. Suppose E(X) = N(,u, 2) and let B be the positive semi definite square root of >. Write ,t = ,u + L2 where yI E 64>2:) and I2 E

6(2:). If A is a self-adjoint such that AU2 = 0 and BAB is a rank k

orthogonal projection, then

f_((X, AX)) = Xk((t Ali)).

Proof Since A,u2 = 0, (X, AX)=(X-,U2, A(X-,A2)). Let Y = X-12 so E&(Y) = N(t 1, :5) and E ((X, AX)) = E ((Y, AY)). Since uLi E (R ), Proposition 3.8 shows that

f((Y, AY)) = X2((Ol, AA1)).

However, (t, A,i) = (L , A,i1) as AM2 = 0. ?

3.3. INDEPENDENCE OF QUADRATIC FORMS

Thus far, necessary and sufficient conditions for the independence of different linear transformations of a normal random vector have been given




and the distribution of a quadratic form in a normal random vector has

been described. In this section, we give sufficient conditions for the indepen dence of different quadratic forms in normal random vectors.

Suppose X e (V, (., )) has an N(pL, L) distribution and consider two

self-adjoint linear transformations, Ai, i = 1, 2, on V to V. To discuss the

independence of (X, AIX) and (X, A2X), it is convenient to first reduce the discussion to the case when y = 0 and E = I. Let B be the positive

semidefinite square root of 2 so if Q(Y) = N(0, I), then Q(X) = &(BY + ,u). Thus it suffices to discuss the independence of (BY + ,u, A,(BY + ,u)) and (BY + tp, A2(BY + ,u)) when f (Y) = N(O, I). However,

(BY ? A, Ai(BY + ,u)) = (Y, BAiBY) + 2(BAi,l, Y) + (ti, AiA)

for i = 1,2. Let Ci = BAiB, i = 1,2, and let xi = 2BAitp. Then we want to

know conditions under which (Y, C1Y) + (xl, Y) and (Y, C2Y) + (x2, Y)

are independent when &(Y) = N(0, I). Clearly, the constants (,I, Ai,u),

i = 1, 2, do not affect the independence of the two quadratic forms. It is this

problem, in reduced form, that is treated now. Before stating the principal result, the following technical proposition is needed.

Proposition 3.10. For self-adjoint linear transformations Al and A2 on (V,(, .)) to (V,(-, -)), the following are equivalent:

(i) AIA2 = 0.

(ii) 6A (A1) I 63t(A2).

Proof If AIA2 = 0, then AIA2x

= 0 for all x c V so 6i(A2) c %(A ). Since GDl(A,) 1 6A(AI), 6iK(A2) 1 6Yu(A,). Conversely, if 6J(AI) 6 AR(A2), then 6i(A2) c 6t(A1)l = Gi(AI) and this implies that A,A2x = 0 for all

xE V.Therefore,AIA2 =0. [U

Proposition 3.11. Let Y E (V, (-, .)) have a N(0, I) distribution and sup

pose Zi = (Y, AiY) + (xi, Y) where Ai is self-adjoint and xi E V, i = 1, 2.

If A1A2 = 0, A1x2 = 0, A2x1 = 0, and (xl, x2) = 0, then Z1 and Z2 are

independent random variables.

Proof The idea of the proof is to show that Z1 and Z2 are functions of two

different independent random vectors. To this end, let Pi be the orthogonal projection onto 6(A1) for i = 1, 2. It is clear that PiAiPi = Ai for i = 1, 2.

Thus Z, = (P,Y, AiP,Y) + (xi, Y) for i = 1, 2. The random vector

(P,Y, (x1, Y)) takes values in the direct sum V E R and Z, is a function of



PROPOSITION 3.12 115

this vector. Also, (P2Y, (x2, Y)) takes values in V @ R and Z2 is a function

of this vector. The remainder of the proof is devoted to showing that

(PIY, (x , Y)) and {P2Y, (x2, Y)} are independent random vectors. This is

done by verifying that the random vectors are jointly normal and that they are uncorrelated. Let [ , * ] denote the induced inner product on the direct

sum V ED R. The inner product of the vector ({Y1, al), (Y2, a2)) in (V @ R) ED (V @ R) with {{P,Y, (xI, Y)), (P2Y, (x2, Y))) is

(y1, P1Y) + a,(x,, Y) + (Y2' P2Y) + a2(x2, Y)

= (Plyl + a1x + P2y2 + a2x2, Y),

which has a normal distribution since Y is normal. Thus ({P1Y, (xI, Y)), (P2Y, (x2, Y))} has a normal distribution. The independence of these two vectors follows from the calculation below, which shows the vectors are uncorrelated. For (Yi, a,)

( V ED R and {Y2' a2) E V E R,

COV([(y1, a1), {P1Y, (x1, Y))], [(Y22, a2), {P2Y, (x2, Y))])

= cov((y1, PIY) + al(x1, Y), (Y2, P2Y) + a2(x2, Y))

= cov((P1 Y1, Y), (P2 Y2 Y)) + al cov{((x , Y), (P2 Y2, Y))

+ a2 cov{( P1y1, Y), (x2, Y)) + al a2 Cov {(xl, Y), (x2, Y))

= (P1Y1, P2Y2) + al(X1, P2Y2) + a2(X2, PIyI) + ala2(x1, X2)

= (y1, P1P2y2) + al(P2xI, Y2) + a2(P1x2, yI) + ala2(xI, x2).

However, P,P2 = 0 since 6Y(A,) I 6I(A2). Also, P2x, = 0 as xl E O6I(A2) and, similarly, P1x2 = 0. Further, (xl, x2) = 0 by assumption. Thus the above covariance is zero so Z1 and Z2 are independent. o

A useful consequence of Proposition 3.11 is Proposition 3.12.

Proposition 3.12. Suppose (X) = N(,t, L) on (V,(-, *)) and let Ci, i =

1,2, be self-adjoint linear transformations. If C1 C2 = 0, then (X, C1 X) and ( X, C2 X) are independent.

Proof. Let B denote the positive semidefinite square root of 2, and suppose f&(Y) = N(0, I). It suffices to show that Z- (BY + it, C1(BY +




,u)) is independent of Z2 - (BY + M, C2(BY + tt)) since f ( X) = C (BY +

,u). But

Zi = (Y, BC1BY) + 2(BCI,, Y) + (u, C>t)

for i = 1,2. Proposition 3.11 can now be applied with Ai BCJB and x= 2BCQt for i 1, 2. Since I = BB, AIA2

= BC,BBC2B = BC1IC2B =

O as CJYC2 = 0 by assumption. Also, A1x2 = 2BC1BBC2u = 2BC1>X2M =

0. Similarly, A2x 0 and (x1, X2) = 4(BC,M, BC2M) = 4(A, C,YC2M) = 0. Thus (Y, BC,BY) + 2(BC1ti, Y) and (Y, BC2BY) + 2(BC21, Y) are inde

pendent. Hence Z1 and Z2 are independent. o

The results of this section are general enough to handle most situations that arise when dealing with quadratic forms. However, in some cases we

need a sufficient condition for the independence of k quadratic forms. An examination of the proof of Proposition 3.11 shows that when f(Y) =

N(O, I), the quadratic forms Zi = (Y, AiY) + (xi, Y), i = 1,... k, are mu

tually independent if, for each i * j, AiA1. 0, Aixj 0, Ajxi = 0, and

(xi, xj) = 0. The details of this verification are left to the reader.

3.4. CONDITIONAL DISTRIBUTIONS

The basic result of this section gives the conditional distribution of one normal random vector given another normal random vector. It is this result that underlies many of the important distributional and independence

properties of the normal and related distributions that are established in

later chapters. Consider random vectors Xi E (Vi, (-, .)), i-1, 2, and assume that the

random vector (Xl, X2} in the direct sum V1 ED V2 has a normal distribution with mean vector {,L, J2) E VI E V2 and covariance given by

cov(X)= ( ).

Thus E (Xi) = N(,i, :ii) on (Vi,(, .)i) for i = 1,2. The conditional distri

bution of X, given X2 - E V2 is described in the next result.

Proposition 3.13. Let Ei(XIX2 = x2) denote the conditional distribution of X1 given X2 = x2. Then, under the above normality assumptions,

E(XIX2 x2) =N( + N(2Al 2(X2 - 1 -

Here, 22- denotes the generalized inverse of 122




Proof. The proof consists of calculating the conditional characteristic function of Xl given X2 = x2. To do this, first note that Xl - - 212-2X2 and

X2 are jointly normal on V, E V2 and are uncorrelated by Proposition 2.17.

Thus X1 - :12j2-X2 and X2 are independent. Therefore, for x E V,1

+o(x) -- (exp[i(x, X1)1]1X2 = x2)

= &(exp[i(x, Xl)l - i(x, 2122X2)1 + i(X, 1222X2)1]lX2 = X2)

exp[i(x, E12E-2x2),]&(exp[i(x, X - =X2)1X2 = x2)

= exp[i(x, E127 2x2) l]& exp[i(x, Xl - X

where the last equality follows from the independence of X2 and Xl -

l2E22 X2. However, it is clear that

( Xl - E-= X2) = N(ju

- -12X22t22, Ell

- 1 22212)

as Xl - 2 22-2X2 is normal on V1 and has the given mean vector and

covariance (Proposition 2.17). Thus

+O(x)= exp[i(x, 12- l x2)1]exp[i(x, tl

Xexp[ - I(X, (I - 21

= exp[i(x, l + 212222(X2 -

2A- 2(X (211 -

21222 12)X)1]

The uniqueness of characteristic functions yields the desired conclusion. El

For normal random vectors, Xi E (Vi, (, .)i), i = 1,2, Proposition 3.13 shows that the conditional mean of X1 given X2 = x2 is an affine function of x2 (affine means a linear transformation, plus a constant vector so zero does not necessarily get mapped into zero). In other words,

&(X1IX2 = X2) = A + 212(x222 -

Further, the conditional covariance of X, does not depend on the value of X2. Also, this conditional covariance is the same as the unconditional covariance of the normal random vector Xl - El2E22 X2. Of course, the specification of the conditional mean vector and covariance specifies the conditional distribution of X, given X2 = x2 as this conditional distribution is normal.




* Example 3.1. Let W,,..., Wl4 be independent coordinate random vectors in RP where RP has the usual inner product. Assume that E (WI) = N(I, 2) so I C RP is the coordinate mean vector of each

Wi and Y. is the p x p covariance matrix of each Wi. Form the

random matrix XE E ,n with rows Wi-, i = l,...,n. We know that

6= ep'

and

COv(X) In I) E

where e e R' is the vector of ones. To show X has a normal

distribution on the inner product space (ft, n K )), it must be verified that for each A E !p, np (A, X) has a normal distribution. To do this, let the rows of A be a',. .., a', a E RP. Then

n

(A, X) = trAX' =

E>a'W1.

However, a'Wi has a normal distribution on R since E(Wj)= N(,u, L) on RP. Also, since W,,..., Wn are independent, a,W,,..., a'Wn are independent. Since a linear combination of independent

normal random variables is normal, (A, X) has a normal distribu

tion for each A E E n. Thus

(X) = N(e,t'In I& ?)

on the inner product space (F, , n *)). We now want to describe

the conditional distribution of the first q columns of X given the last

r columns of X where q + r = p. After some relabeling and a bit of

manipulation, this conditional distribution follows from Proposition 3.13. Partition each Wi into Yi and Z1 where Yi E Rq consists of the

first q coordinates of Wi and Zi E Rr consists of the last r coordi

nates of Wi. Let Xl E Sq, n have rows Y,',..., Yn and let X2 E Er n

have rows Z',. Zn. Also, partition , into CL E Rq and ,ll2 E Rr So

St = u, and =Z, = 2= 1... n. Further, partition the covari

ance matrix l of each W, so that

Cov(Y},Zi}= (El l 12

where > 21 = E' 2.From the independence of W,,... . Wn, it follows




that

P-(XI) = N(epj, In ? 1),

ii( X2 ) = N( ett',In 2)

and (X1, X2) has a normal distribution on eq, n ED r nwith mean

vector (esf , e,'2) and

{ ) ( ~In (& y- i I In C) 2 12)

In'n ?-21 In?0 122!

Now, Proposition 3.13 is directly applicable to {XI, X2) where we

make the parameter correspondence

,u> eju, i= 1,2

and

I ij In X 2 ij .

Therefore, the conditional distribution of XI given X2 = E r n iS normal with mean vector

&(X11X2 = x2) = epLj + (In ? T12)(In ? 122)(X2 2

and

Cov(X11X2 = X2)

=In X 2111 (In 0s Y-12)(In 08 122)- (In (& 12021

However, it is not difficult to show that (In ? 22 2) = In & 2 22 Using the manipulation rules for Kronecker products, we have

E;(X11X2 = X2) = eA" + (x2 - 22

and

Cov(X1JX2 = X2) = In ?& (T11 -

712122121)

This result is used in a variety of contexts in later chapters.




3.5. THE DENSITY OF THE NORMAL DISTRIBUTION

The problem considered here is how to define the density function of a nonsingular normal distribution on an inner product space (V, (, -)). By nonsingular, we mean that the covariance of the distribution is nonsingular. To motivate the technical considerations given below, the density function of a nonsingular normal distribution is first given for the standard coordi nate space Rn with the usual inner product.

Consider a random vector X in Rn with coordinates X,,..., Xn and assume that XI,..., Xn are independent with P (Xi) = N(O, 1). The symbol

dx denotes Lebesgue measure on Rn. Since X1,..., Xn are independent, the joint density of X,,..., Xn in Rn is just the product of the marginal densities, that is, X has a density with respect to dx given by

p(x) H H exp [Ix2] /2 exp [-2 1nx2]

where x E R' has coordinates xl,..., xn. Thus

p(x) = (27") 2exp[- x'x]

and x'x is just the inner product of x with x in Rn. To derive the density of

an arbitrary nonsingular normal distribution in R , let A be an n x n

nonsingular matrix and set Y = AX + it where u E R . Since E (X) =

N(O, In), C(Y) = N(j , E) where E = AA' is positive definite. Thus X =

A - (Y - tt) and the Jacobian of the nonsingular linear transformation on

Rn to Rn sending x into A - (x - I) is Idet(A - 1)1 where I I denotes absolute

value. Therefore, the density function of Y with respect to dy is

pI(y) = Idet(A 1) p(A-'(y - )) = (detE) 1/2(27T) n/2

xexp[- '(y - t)'A'-A-'(y - IL)]

= (det E:) (2(T ) exp[ - ?(y - 1A) 1 ( - 11)].

Thus we have the density function with respect to dy of any nonsiiLgular

normal distribution on Rn. Of course, this expression makes no sense when

I is singular.

Now, suppose Y is a random vector in an n-dimensional vector space

(V, ( , )) and P (Y) = N(tL, E) where E is positive definite. The expression

p2 ( Y ) = (2 ThY) n/2(det X) -

/2 expt -((y - u), L -(y - u))



THE DENSITY OF THE NORMAL DISTRIBUTION 121

for y e V, certainly makes sense and it is tempting to call this the density

function of Y E (V, (-, )). The problem is: What is the measure on (V, (., .))

with respect to which P2 is a density? In other words, what is the analog of

Lebesgue measure on (V, (*, -))? To answer the question, we now show that

there is a natural measure on (V, (-, )), which is constructed from Lebesgue measure on R', and P2 is the density function of Y with respect to this

measure. The details of the construction of "Lebesgue measure" on an n- dimen

sional inner product space (V,(, )) follow. First, we review some basic topological notions for (V,(, )). Recall that Sr(xo) (xllIx - xo0l < r} is called the open ball of radius r with center xo. A set B c V is called open if, for each xo E B, there is an r > O such that Sr(xo) c B. Since all inner

products on V are related by positive definite linear transformations, the definition of open does not depend on the given inner product. A set is

closed iff its complement is open and a set if bounded iff it is contained in

Sr(O) for some r > 0. Just as in R', a set is compact iff it is closed and

bounded (see Rudin, 1953, for the definition and characterization of com pact sets in Rn). As with openness, the definitions and characterizations of

closedness, boundedness, and compactness do not depend on the particular inner product on V. Let 1 denote standard Lebesgue measure on R'. To

move 1 over to the space V, let x,,... x,, Xnbe a fixed orthonormal basis in

(V, (, )) and define the linear transformation T on Rn to V by

T(a) = , a,

where a E Rn has coordinates a,,. .., an. Clearly, T is one-to-one, onto, and

maps open, closed, bounded, and compact sets of R" into open, closed, bounded, and compact sets of V. Also, T- ' on V to Rn maps x e V into the vector with coordinates (xi, x), i-1,..., n. Now, define the measure vo on Borel sets B e ':,(V) by

vO(B) = 1(T-1(B)).

Notice that vo(B + x) = (T-((B + x)) = l(T-'(B) + T-'x) = l(T-'(B)) = vO(B) since Lebesgue measure is invariant under translations. Also, vo(B) < + so if B is a compact set. This leads to the following definition.

Definition 3.3. A nonzero measure ,' defined on the Borel sets 63 (V) of

(V, (., .)) is invariant if:

(i) v(B + x)=v(B) for xEc VandBE @(V).

(ii) v(B) < + xc for all compact sets B.


The measure P0 defined above is invariant and it is shown that, if v is any

invariant measure on !(V), then v = cpo for some constant c > 0. Condi tion (ii) of Definition 3.3 relates the topology of V to the measure v. The

measure that counts the number of points in a set satisfies (i) but not (ii) of

Definition 3.3 and this measure is not equal to a positive constant times v0. Before characterizing the measure v0, it is now shown that Po is a

dominating measure for the density function of a nonsingular normal distribution on (V, (*, *)).

Proposition 3.14. Suppose fE(Y) = N(ji, .) on the inner product space (V,., .)) where E is nonsingular. The density function of Y with respect to

the measure Po is given by

p(y) = (2r)_ -n/2(det )-1/2 exp[- 2 -(y _ 1 2`(y - I))]

fory E V.

Proof. It must be shown that, for each Borel set B,

P{Y E B) = fIB(Y)P(Y)PO(dY),

where IB is the indicator function of the set B. From the definition of the

measure P0, it follows that (see Lehmann, 1959, p. 38)

JIB(Y)P(Y)vo(dY) = IB(T(a))p(T(a))l(da).

Let X = T- (Y) E R' so X is a random vector with coordinates (xi, Y),

i = 1,..., n. Thus X has a normal distribution in Rn with mean vector

T` (IL) and covariance matrix [2J where [2] is the matrix of E in the given orthonormal basis x,,..., xn. Therefore,

P(Ye B) P{T-(T(Y) e T-'(B)) = P{Xe T-'(B))

- JI (B)(a)(2T)-n/2(det[21)-1/2

xexp[-4 (a - -

(,u))'[2j l (a-T- l(,u))] l(da)

- JIB(T(a))p (T(a))l(da).



The last equality follows since IT-I(B)(a) = IB(T(a)) and

p(T(a)) -(27T)-n2(det 2)-l/2

xexp[- 4(T(a) - pt, 2> (T(a) - t))]

-(2X) /1

(det[ 7. ])

Xexp -2(a - (a - T- 1

Thus

P{Y E B) JIB(T(a))p(T(a))l(da)

= JIB(Y)P(Y) vo(dy) .

We now want to show that the measure vo, constructed from Lebesgue measure on R , is the unique translation invariant measure that satisfies

fp(y)vo(dy) = 1.

Let C( be the collection of all bounded non-negative Borel measurable functions defined on V that satisfy the following: given f E V, there is a compact set B such that f(v) = 0 if v t B. If v is any invariant measure on

V and f e VC+, then Jf(v)v(dv) < + oc since f is bounded and the v-mea sure of every compact set is finite. It is clear that, if v, and v2 are invariant

measures such that

ff(v)v1 (dv) -ff(v)v2(dv) for ailf f e ,

then v1 = p2' From the definition of an invariant measure, we also have

ff(v + x)v(dv) = ff(v)v(dv)


for all f E 'C+ and x e V. Furthermore, the definition of vo shows that

f(x)vo(dx) = Jf(T(a))l(da) = ff(T(-a))l(da)

=f(- T(a))l(da) =f( -x)vo(dx)

for all f e V+. Here, we have used the linearity of T and the invariance of

Lebesgue measure under multiplication of the argument of integration by a minus one.

Proposition 3.15. If v is an invariant measure on 65(V), then there exists a positive constant c such that v = cv0.

Proof. For f, g e , we have

ff(x)v(dx)fg(y)vo(dy) = fff(x - y)g(y)v(dx)Po(dy)

=fftf(- (y - x))g(y - x + x)vo(dy)v(dx)

= fff(-w)g(w + x)v0(dw)v(dx)

=ffff(-w)g(w + x)v(dx)v0(dw)

= ff(-w)vo(dw)Jg(x)v(dx)

= f(w)vo(dw)Jg(x)v(dx).

Therefore,

ff(x)v(dx)fg(y)vo(dy)= f(w)vo(dw)fg(y)v(dy)

for all f, g e 'C+. Fix f c VC such that ff(w)vo(dw) = 1 and set c

= Jf(x)v(dx). Then

fg(y)v(dy) = c g(y)vo(dy)



for all g E W+. The constant c cannot be zero as the measure i' is not zero.

Thus c > O and v = cp.

The measure vo is called the Lebesgue measure on V and is henceforth

denoted by dv or dx, as is the Lebesgue measure on Rn. It is possible to

show that Po does not depend on the particular orthonormal basis used to

define it by using a Jacobian argument in Rn. However, the argument given

above contains more information than this. In fact, some minor technical

modifications of the proof of Proposition 3.15 yield the uniqueness (up to a positive constant) of invariant measures on locally compact topological groups. This topic is discussed in detail in Chapter 6.

An application of Proposition 3.14 to the situation treated in Example 3.1 follows.

* Example 3.2. For independent coordinate random vectors Wi E RP, i = 1,..., n, with E(Wi) = N(L, :), form the random matrix

X c E p n with rows Wi, i=1,..., n. As shown in Example 3.1,

(X) = N(e',I n ? 1)

on the inner product space (t, n< K )), where e E Rn is the

vector of ones. Let dX denote Lebesgue measure on the vector space (P . If 1 is nonsingular, then In X z is nonsingular and (In ? E) < =

In X 1 - . Thus when E is nonsingular, the density of X with

respect to dX is

(3.1) p(X) 2 )- (det(In -

x exp[ - (X X-e#' ( In 08 E - l1)( X-e'

It is shown in Chapter 5 that det(In ? l) = (det Ey)n. Since the

inner product ( , * is given by the trace, the density p can be written

p (X) = (V2, T-npn

x exp[ - 2tr(X - eM')'( X-eu') -

However, this form of the density is somewhat less revealing, from a statistical point of view, than (3.1). In order to make this statement

more precise and to motivate some future statistical considerations, we now think of it E RP and E as unknown parameters. Thus, we

can write (3.1) as

(3.2) p(Xjy, ) = (dety/

xexp[ -2 (X- e,u', (In (9 E- )( X-,)>

where ,u ranges over R P and : ranges over all p x p positive definite

matrices. Thus we have a parametric family of densities for the

distribution of the random vector X. As a first step in analyzing this

parametric family, let

M = {x C Ep nX = e,u', L E R'P).

It is clear that M is a p-dimensional linear subspace of Ep nand M is

simply the space of possible values for the mean vector of X. Let

P, = (1/n)ee' so Pe is the orthogonal projection onto span{e) c Rn.

Thus Pe ? Ip is an orthogonal projection and it is easily verified that

the range of Pe ?9 Ip is M. Therefore, the orthogonal projection onto

M is Pe 9 Ip. Let Qe = In-Pe so Qe ? Ip is the orthogonal projec

tion onto M' and (Qe C Ip)(Pe ? Ip) = 0. We now decompose X

into the part of X in M and the part of X in M' -that is, write

X = (Pe ? Ip)X + (Qe X Ip)X. Substituting this into the exponen

tial part of (3.2) and using the relation (Pe C) Ip)(In @ ')(Qe Ip) = 0, we have

<X - e,u',(IJI C) -')(X- eI') >

= (Pe (X - eM'), (In X 2 -') Pe (X -eW))

+(QeX,(In ?& Y'-)QeX)

= (PeX eW, (In ? 2 )(PeX- e,')) + trQeXy '(QeX)'

= (PeX e', (In ? 2 )(PeX- eM')) + trX'QeX' 1.

Thus the density p(X[L, L) is a function of the pair PeX and

X'QeX so Pe X and X'Qe X is a sufficient statistic for the parametric

family (3.2). Proposition 3.4 shows that (Pe ? Ip)X and (Qe p) I)X

are independent since (Pe ? Ip)(In ?) 2)(Qe ?& Ip) = (PeQe) ? = 0 as PeQe = 0. Therefore, PeX and X'QeX are independent since

PeX = (Pe C) Ip)X and X'QeX = ((Qe ?& Ip)X)'((Qe ? Ip)X). To

interpret the sufficient statistic in terms of the original random

PROBLEMS 127

vectors W,,..., W", first note that

PeX = --ee'X

= eW' en

where W= (1/n)lWi is the sample mean. Also,

X'QeX = (QeX)'(QeX) = ((In - Pe)X)(( Pe)X)

=( X - eW')'(X X-eW') I S( i - (i-W)'

The quantity (1/n)X'QeX is often called the sample covariance

matrix. Since eW' and Ware one-to-one functions of each other, we

have that the sample mean and sample covariance matrix form a

sufficient statistic and they are independent. It is clear that

=( N(,u, I.

The distribution of X'QeX, commonly called the Wishart distribu tion, is derived later. The procedure of decomposing X into the

projection onto the mean space (the subspace M) and the projec

tion onto the orthogonal complement of the mean space is funda

mental in multivariate analysis as in univariate statistical analysis. In fact, this procedure is at the heart of analyzing linear models-a

topic to be considered in the next chapter.

PROBLEMS

1. Suppose X1,.. ., Xn are independent with values in (V,(., *)) and

C(Xi) = N(tti, Ai), i = 1,..., n. Show that E(Y.Xi) = N(I,ii E:Ai).

2. Let X and Y be random vectors in Rn with a joint normal distribution

given by

( Y PIn In )

where p is a scalar. Show that IPI < 1 and the covariance is positive

definite iff IPl < 1. Let Q(Y) = In - (Y'Y)-fYY'. Prove that W=

X'Q(Y)X has the distribution of (1 - P2)X2_-I (the constant 1 - p2 times a chi-squared random variable with n - 1 degrees of freedom).


3. When X E R' and E(X) = N(O, L) with Y. nonsingular, then E(X) -

t (CZ) where l (Z) = N(O, I) and CC' = E. Hence, E(C-'X)= E (Z) so C-1 transforms X into a vector of i.i.d. N(O, 1) random

variables. There are many C-'"s that do this. The problem at hand

concerns the construction of one such C'. Given any p x p positive definite matrix A, p > 2, partition A as

(a,, A12

A A21 A22)

where all E R', A2 = A'2 E RP-'. Define Tp(A) by

(a-1/2 0

Tp (A) A21 I I

(i) Partition E: n x n as A is partitioned and set X(') Tn (E)X. Show that

cov( x(')) I

( (1

where V(l) = :22 - E21212/1 l

(ii) For k = 1, 2,.. ., n - 2, define X(k?I) by

X(k?l - (1 T-k (2k)) )

Prove that

Cov(X(k?) - (k I

2(k 1))

for some positive definite 2(k+ 1).

(iii) Fork= O,..., n - 2, let

T (k) = A

0 Tn -k(y())

where To)0= T-(2). With T= T( -2)... T(?), show that X (n-1) = TX and Cov( Xn- 1)) = In. Also, show that T is lower triangu lar and E:-' = T'T.



PROBLEMS 129

4. Suppose X E R2 has coordinates Xl and X2, and has a density

p(x) 4-exp[- (x2 +x2)] if x1x2>0 0 otherwise

so p is zero in the second and fourth quadrants. Show X, and X2 are

both normal but X is not normal.

5. Let X,,..., X, be i.i.d. N(,u, a2) random variables. Show that U =

- Xj)2 and W = 2:X1 are independent. What is the distribution of U?

6. For X E (V,(,*)) with e(X) = N(O, I), suppose (X, AX) and

(X, BX) are independent. If A and B are both positive semidefinite, prove that AB = 0. Hint: Show that tr AB = 0 by using

cov((X, AX), (X, BX)} = 0. Then use the positive semidefiniteness and tr AB = 0 to conclude that AB = 0.

7. The method used to define the normal distribution on (V,(,-)) consisted of three steps: (i) first, an N(O, 1) distribution was defined on

R'; (ii) next, if (Z) = N(O, 1), then W is N(M, a2) if e (W) = (OZ + ,u); and (iii) X with values in (V, (*, *)) is normal if (x, X) is normal

on R' for each x E V. It is natural to ask if this procedure can be used

to define other types of distributions on (V, (, * )). Here is an attempt for the Cauchy distribution. For X E R', say Z is standard Cauchy (which we write as e(Z) = C(0, 1)) if the density of Z is

AZ) = 1 1 Z E R'

Say W has a Cauchy distribution on R1

if l (W) = I (aZ + u) for

some ,u E R and a > 0-in this case write e(W) = C(,u, a). Finally, say X e (V,(., *)) is Cauchy if (x, X) is Cauchy on R'.

(i) Let W1, . . ., W,,be independent C(Quj, aj), j = 1,..., n. Show that

,(Y:ajWj) = C(QajjLj, Ilajlak). Hint: The characteristic function of a C(0, 1) distribution is exp[-ItlI, t E R'.

(ii) Let Z1,..., Z, be i.i.d. C(0, 1) and let xl,..., xn be any basis for (V,(-, *)). Show X = EZjxj has a Cauchy distribution on

(V,(., ))

8. Consider a density on R' given by

00 f(U) = |t-14o(ult)G(di)




where 4 is the density of an N(O, 1) distribution and G is a distribution function with G(0) = 0. The distribution defined by f is called a scale

mixture of normals. (i) Let ZO be N(O, 1) and let R be independent of ZO with E(R) = G.

Show that U = RZo has f as its density function.

If X(Y) = i(cU) for some c > 0, we can say that Y has a type-f

distribution. (ii) In (V,( *)), suppose e (Z) = N(O, I) and form X = RZ where R

and Z are independent and e(R)= G. For each x E V, show

(x, X) has a type-f distribution. Remark. The distribution of X in (V, (, )) provides a possible vector space generalization of a type-f distribution on R'.

9. In the notation of Example 3.1, assume that jt = 0 so fE(X) = N(O, In X X) on(p, n )). Also,

E(XIIX2 = X2) = N(X222 21' In ? 211-2)

where EHI2 = I - 212-22,2I2 Show that the conditional distribu

tion of X2X1 given X2 is the same as the conditional distribution of

X2 X, given X2 X2.

10. The map T of Section 3.5 has been defined on Rn to (V,(, )) by

Ta = Enaixi where xl,..., x,, is an orthonormal basis for (V,(., )).

Also, we have defined vo by vO(B) = I(T-'(B)) for B E @(V). Con

sider another orthonormal basis y1,..., yn for (V(., -)) and define T1

by T1a = Enaiyi, a E Rn. Define vI by vI(B) = l(Tj l(B)) for B E

qi(V). Prove that Po = vp.

11. The measure vo in Problem 10 depends on the inner product (, ) on

V. Suppose [, -] is another inner product given by [x, y] = (x, Ay)

where A > 0. Let vI be the measure constructed on (V, [, -]) in the

same manner that P0 was constructed on (V,(-, *)). Show that PI = cvO where c = (det(A))'/2.

12. Consider the space Sp of p x p symmetric matrices with the inner

product given by KS1, S2) = trS,S2. Show that the density function of

an N(O, I) distribution on (,, K *,)) with respect to the measure Po is

p(S) = (27T) -P+1)74exp '(IPs2 + 2EEs,j

where S = (si;), i, j = 1.... , p. Explain your answer (what is Po)?



13. Consider X1,..., Xn, which are i.i.d. N(tu, E) on RP. Let X E n

have rows X,..., X, so P,(X)= N(e,u', In ? 2). Assume that 2 has

the form

'p *.. p

p p ...

where a2 > 0 and - 1/(p - 1) < p < 1 so 2 is positive definite. Such

a covariance matrix is said to have intraclass covariance structure.

(i) On RP, let A = (l/p)e1e' where e1 E RP is the vector of ones.

Show that a positive definite covariance matrix has intraclass covari ance structure iff 2 =aA + ,B(I - A) for some positive scalars a and

P3. In this case :-' = a-'A + /-'(I - A).

(ii) Using the notation and methods of Example 3.2, show that when (,U, a2, p) are unknown parameters, then (X, trAX'QeX, tr(I -

A)X'QeX) is a sufficient statistic.


1. A coordinate treatment of the normal distribution similar to the treat ment given here can be found in Muirhead (1982).

2. Examples 3.1 and 3.2 indicate some of the advantages of vector space

techniques over coordinate techniques. For comparison, the reader may find it instructive to formulate coordinate versions of these examples.

3. The converse of Proposition 3.11 is true. The only proof I know involves characteristic functions. For a discussion of this, see Srivastava and

Khatri (1979, p. 64).

CHAPTER 4

Linear Statistical Models

The purpose of this chapter is to develop a theory of linear unbiased estimation that is sufficiently general to be applicable to the linear models arising in multivariate analysis. Our starting point is the classical regression

model where the Gauss-Markov Theorem is formulated in vector space language. The approach taken here is to first isolate the essential aspects of a regression model and then use the vector space machinery developed thus

far to derive the Gauss-Markov estimator of a mean vector.

After presenting a useful necessary and sufficient condition for the equality of the Gauss-Markov and least-squares estimators of a mean

vector, we then discuss the existence of Gauss-Markov estimators for what might be called generalized linear models. This discussion leads to a version

of the Gauss-Markov Theorem that is directly applicable to the general linear model of multivariate analysis.

4.1. THE CLASSICAL LINEAR MODEL

The linear regression model arises from the following considerations. Sup pose we observe a random variable Yi E R and associated with Yi are known

numbers zi,... , Zik, i = 1,..., n. The numbers ziI,..., Zik might be indica

tor variables denoting the presence or absence of a treatment as in the case

of an analysis of variance situation or they might be the numerical levels of

some physical parameters that affect the observed value of Yi. It is assumed

that the mean value of 1K is 6 I = j4z1f, where the f3j are unknown

parameters. It is also assumed that var(Yi)= a' > 0 and cov(Yi, Yj) = 0 if

i *j. Let YE R' be the random vector with coordinates Y,,..., Y1, let

Z = {zij) be the n x k matrix of zij's, and let fi E Rk be the vector with

coordinates , 1 3 k. In vector form, the assumptions we have made

132



THE CLASSICAL LINEAR MODEL 133

concerning Y are that &Y = Zf3 and Cov(Y) = a2In. In summary, we

observe the vector Y whose mean is Z,B where Z is a known n x k matrix,

,B E Rk is a vector of unknown parameters, and Cov(Y) = a21, where a2 is

an unknown parameter. The two essential features of this parametric model are: (i) the mean vector of Y is an unknown element of a known subspace of

Rn_ namely, 6 Y is an element of the range of the known linear transforma

tion determined by Z that maps Rk to Rn; (ii) Cov(Y) - a 24 that is, the distribution of Y is weakly spherical. For a discussion of the classical

statistical problems related to the above model, the reader is referred to Scheffe (1959).

Now, consider a finite dimensional inner product space (V,(, )). With the above regression model in mind, we define a weakly spherical linear

model for a random vector with values in (V, ( * )).

Definition 4.1. Let M be a subspace of V and let EO be a random vector in

V with a distribution that satisfies Seo = 0 and Cov(EO) = I. For each

,E iM and a > 0, let Q.,

denote the distribution of ,L + aeO. The family

(QIL, c E M, a > 0) is a weakly spherical linear model for Y E V if the

distribution of Y is in {Q,,, jtt E M, a > 0).

This definition is just a very formal statement of the assumption that the mean vector of Y is an element of the subspace of M and the distribution of

Y is weakly spherical so Cov(Y)= a2I for some a2 > 0. In an abuse of

notation, we often write Y = u + e for ,u E M where E is a random vector

with SE = 0 and Cov(E) = a 2I. This is to indicate the assumption that we

have a weakly spherical linear parametric model for the distribution of Y. The unobserved random vector e is often called the error vector. The subspace M is called the regression subspace (or manifold) and the subspace

M' is called the error subspace. Further, the parameter ,u Ee M is assumed

unknown as is the parameter a 2. It is clear that the regression model used to

motivate Definition 4.1 is a weakly spherical linear model for the observed random vector and the subspace M is just the range of Z.

Given a linear model Y = ,u + e, ,U E M, SE = 0, Cov(E) = a2I, we now

want to discuss the problem of estimating ,u. The classical Gauss-Markov approach to estimating ,u is to first restrict attention to linear transforma tions of Y that are unbiased estimators and then, within this class of

estimators, find the estimator with minimum expected norm-squared devia tion from ,u. To make all of this precise, we proceed as follows. By a linear estimator of ,i, we mean an estimator of the form A Y where A E f,(V, V).

(We could consider affine estimators A Y + vo, vo E V, but the unbiased

ness restriction would imply vo = 0.) A linear estimator A Y of ,t is unbiased

iff, when I E M is the mean of Y, we have 6(AY) = IL. This is equivalent



134 LINEAR STATISTICAL MODELS

to the condition that At = ,u for all u e M since &A Y = A&;Y = A,u. Thus

A Y is an unbiased estimator of ,u iff Ali = IL for all ,u e M. Let

C= (AIA EE f(V, V),Al = forMu E M).

The linear unbiased estimators of ,i are those estimators of the form A Y with A E i. We now want to choose the one estimator (i.e., A E C,) that

minimizes the expected norm-squared deviation of the estimator from P. In other words, the problem is to find an element A E cT that minimizes

&;IIAY - ,IlI2. The justification for choosing such an A is that IIAY - /1I2 iS

the squared distance between A Y and u so &IIA Y - I112 is the average

squared distance between A Y and IL. Since we would like A Y to be close to

,u, such a criterion for choosing A E e seems reasonable. The first result in

this chapter, the Gauss-Markov Theorem, shows that the orthogonal pro jection onto M, say P, is the unique element in C that minimizes &IIA Y -

,L1 2.

Theorem 4.1 (Gauss-Markov Theorem). For each A E C1, ,u e M, and 2 > 0,

&1jAY - L2 > &IIPY -L12

where P is the orthogonal projection onto M. There is equality in this

inequality iff A = P.

Proof. Write A =P + C so C = A - P. Since Ala = ,u for ,u E M, CP = O

for ,u E M and this implies that CP = 0. Therefore, C(Y - ,) and P(Y - u) are uncorrelated random vectors, so & (C(Y - t), P(Y - ,u)) = 0 (see Prop osition 2.21). Now,

&IIAY -. Il12 = &IIA(Y -)I2 = llP(Y -_ ) + C(Y -)112

= qllP(Y - L)II2 + ;IIC(y - L12

> &jlP(Y - L)II2 = &lJpy- _112.

The third equality results from the fact that the cross product term is zero.

This establishes the desired inequality. It is clear that there is equality in this

inequality iff &;IIC(Y - M)II2 = 0. However, C(Y - M) has mean zero and

covariance u2CC' so

;IIC(y - )112 = 2(j, CC')

by Proposition 2.21. Since a 2 > 0, there is equality iff (I, CC') = 0. But

(I, CC') = (C, C) and this is zero iff C = A - P = 0. o



THE CLASSICAL LINEAR MODEL 135

The estimator PY of I E M is called the Gauss-Markov estimator of the mean vector and the notation a = PY is used here. A moment's reflection

shows that the validity of Theorem 4.1 has nothing to do with the parameter

a2, be it known or unknown, as long as a2 > 0. The estimator 1 = PY is

also called the least-squares estimator of ,u for the following reason. Given the observation vector Y, we ask for that vector in M that is closest, in the

given norm, to Y- that is, we want to minimize, over x E M, the expression

jy - x112. But Y= PY + QY where Q = (I - P) so, for x E M,

Iy- x112 = IIPY - X + Qyjj2 = IlPY - X112 + IIQyII2.

The second equality is a consequence of Qx = 0 and QP = 0. Thus

IIY- x112 > 1jQY112

with equality iff x = PY. In other words, the point in M that is closest to Y

is ,i = PY. When the vector space V is Rn with the usual inner product, then

Y - x112 is just a sum of squares and ,u = PY e M minimizes this sum of

squares-hence the term least-squares estimator.

* Example 4.1. Consider the regression model used to motivate Definition 4.1. Here, Y E R' has a mean vector Z,8 when ,B E Rk

and Z is an n x k known matrix with k < n. Also, it is assumed

that Cov(Y) = a2I,, a2 > 0. Therefore, we have a weakly spherical

linear model for Y and ,u Z13 is the mean vector of Y. The regression manifold M is just the range of Z. To compute the

Gauss-Markov estimator of ,u, the orthogonal projection onto M, relative to the usual inner product on R', must be found. To find this projection explicitly in terms of Z, it is now assumed that the

rank of Z is k. The claim is that P Z(Z'Z)- 'Z' is the orthogonal

projection onto M. Clearly, p2 = P and P is self-adjoint so P is the orthogonal projection onto its range. However, Z' maps Rn onto Rk since the rank of Z' is k. Thus (Z'Z)-'Z' maps Rn onto Rk.

Therefore, the range of Z(Z'Z)- 'Z' is Z(Rk), which is just M, so P

is the orthogonal projection onto M. Hence = Z(Z'Z)- 'Z'Y is the Gauss-Markov and least-squares estimator of ,u. Since ,u = Z,B,

Z' = Z'Z,B and thus ,B = (Z'Z) 'Z',. There is the obvious

temptation to call

A (Z'Z)Z'' - (z'z ,zz(z'z)- yzy= (Z'Z) zyz'y

the Gauss-Markov and least-squares estimator of the parameter /3.


Certainly, calling 13 the least-squares estimator of /3 is justified since

ily- Z.YI12 > ily _

Zfl112

for all y E Rk, as Z4 = ,u and Zy e M. Thus 18 minimizes the sum

of squares IIY - Zyjf2 as a function of y. However, it is not clear why ,B should be called the Gauss-Markov estimator of /P. The discussion below rectifies this situation.

Again, consider the linear model in (V, (-, *)), Y = -, + E, where ,u E M,

SE = 0, and Cov(e) = a2I. As usual, M is a linear subspace of V and E is a

random vector in V. Let (W, [-, * ]) be an inner product space. Motivated by

the considerations in Example 4.1, consider the problem of estimating By, B E=e ,(V, W), by a linear unbiased estimator A Y where A e P, (V, W). That

A Y is an unbiased estimator of B,u for each ,u E M is clearly equivalent to

A-= B,u for ,u E M since GA Y = Ajt. Let

d = {AIA e E (V,W),Ap= B.t foru e=M),

so A Y is an unbiased estimator of B,u, ,u E M iff A E 6i1. The following

result, which is a generalization of Theorem 4.1, shows that B,i is the Gauss-Markov estimator for B,u in the sense that, for all A E ,

IIA Y- BAtlll > &IIBPY - BIll.

Here II is the norm on the space (W,[-, *]).

Proposition 4.1. For each A E LI

SIIA Y - BtL I SIIBPY Ill

where P is the orthogonal projection onto M. There is equality in this inequality iff A = BP.

Proof. The proof is very similar to the proof of Theorem 4.1. Define

C E e(V, W) by C = A - BP and note that C,u = Ay - BP,u = Bu - By = 0 since A e c and P,u = ,u for ,u E M. Thus CP = 0, and this implies

that BP(Y - ,u) and C(Y - ,u) are uncorrelated random vectors. Since these

random vectors have zero means,

& [BP(Y - u), C(Y- = 0.



PROPOSITION 4.1 137

For A E i

&IIAY - B,IIJ, = &IIBP(Y - i) + C(Y

= &IIBP(Y - I)112 + &IIC(Y -)112

> &IIBP(Y - I)jj = &jjBPY - BAL I.

This establishes the desired inequality. There is equality in this inequality iff &IIC(Y - = 0. The argument used in Theorem 4.1 applies here so there is equality iff C = A - BP =0. O

Proposition 4.1 makes precise the statement that the Gauss-Markov estimator of a linear transformation of ,u is just the linear transformation applied to the Gauss-Markov estimator of IL. In other words, the

Gauss-Markov estimator of B,t is B,i where B e fC(V, W). There is one

particular case of this that is especially interesting. When W = R, the real

line, then a linear transformation on V to W is just a linear functional on V.

By Proposition 1.10, every linear functional on V has the form (x0, x) for some x0 E V. Thus the Gauss-Markov estimator of (x0, ,u) is just (x0, j) =

(xo, PY) = (Pxo, Y). Further, a linear estimator of (x0, ,u), say (z, Y), is

an unbiased estimator of (x0, ,u) iff (z, ,u) = (x0, A) for all ,u E M. For any

such vector z, Proposition 4.1 shows that

var(z, Y) > var(Px0, Y).

Thus the minimum of var(z, Y), over the class of all z's such that (z, Y) is

an unbiased estimator of (x0, A), is achieved uniquely for z = Px0. In

particular, if x0 e M, z = x0 achieves the minimum variance.

In the definition of a linear model, Y = ,I + E, no distributional assump

tions concerning - were made, other than the first and second moment

assumptions &E = 0 and Cov(e) = a2I. One of the attractive features of

Proposition 4.1 is its validity under these relatively weak assumptions. However, very little can be said concerning the distribution of a = PY other than &,i = ,u and Cov(u) = a2P. In the following example, some of the

implications of assuming that E has a normal distribution are discussed.

* Example 4.2. Consider the situation treated in Example 4.1. A coordinate random vector Y E R' has a mean vector ,u = Z,B where

Z is an n x k known matrix of rank k (k < n) and A e Rk is a

vector of unknown parameters. It is also assumed that Cov(Y) =

u2In. The Gauss-Markov estimator of ,u is a = Z(Z'Z) 'Z'Y. Since

,B = (Z'Z)- 1Z',., Proposition 4.1 shows that the Gauss-Markov estimator of 13 is 13 = (Z'Z) IZ',u= (Z'Z)-Z'Y. Now, add the assumption that Y has a normal distribution-that is, f(Y) =

N(,u, a2JI) where ,t E M and M is the range of Z. For this particu

lar parametric model, we want to find a minimal sufficient statistic and the maximum likelihood estimators of the unknown parame ters. The density function of Y, with respect to Lebesgue measure, is

p(ylf, 02)= (2 7u2) n/2exp[- y IIY- tLPl2

where y E Rn, ,u E M, and a2 > 0. Let P denote the orthogonal

projection onto M, so Q I - P is the orthogonal projection onto

M'. Since IIY- y112 = jiPy - pL12 + IIQy1I2, the density of y can be

written

P(YIt, a2) =(2 o2 yn/2 exp[ - 1lIIPy _ ,.12 - IIQ 12 ]

This shows that the pair (Py, IIQy112} is a sufficient statistic as the density is a function of the pair (Py, IIQyII 2). The normality assump

tion implies that PY and QY are independent random vectors as

they are uncorrelated (see Proposition 3.4). Thus PY and IIQYII2 are

independent. That the pair (Py, IIQyyI2} is minimal sufficient and complete follows from results about exponential families (see Lehmann 1959, Chapter 2). To find the maximum likelihood esti

mators of ,u E M and a2, the density p(yItL, a2) must be maximized

over all values of IL E M and a2 For each fixed a2 > O,

p(ylft,a2) = (27T2) n/2exp- ilPy - L12 - I 2Iy']

[x 1 2lIQ211Il21

< (2Sra2) / exp _- 2 lQyII2

with equality iff ,u= Py. Therefore, the Gauss-Markov estimator = PY is the maximum likelihood estimator for ,t. Of course, this

also shows that ,B = (Z'Z)- 'Z'Y is the maximum likelihood estima

tor of 13. To find the maximum likelihood estimator of a2, it

remains to maximize

p(ylPy, 2) =(2a2 n/2 - 11QY112

PROPOSITION 4.2 139

An easy differentiation argument shows that p(y yjPy, a 2) is maxi mized for a2 equal to IIQyII2/n. Thus 62 = IlQyII2/n is the maxi mum likelihood estimator of a2. From our previous observation,

= PY and 62 are independent. Since E(Y) = N(,u, a21),

E()=e (PY) =N(u, a2P)

and

e~(A~) = f((z'z)'z'y) = N(1f, a2(Z,Z)-I). Also,

E(QY) = N(O, a2Q)

since Q,u = 0 and Q2 = Q = Q'. Hence from Proposition 3.7,

1(IlQYIl2 2 L12) Xn-k

since Q is a rank n - k orthogonal projection. Therefore,

& = n -_ka2 n

It is common practice to replace the estimator a2 by the unbiased estimator

o2 - IIQYII n - k

It is clear that 62 is distributed as the constant a 2/(n - k) times a

X n_k random variable.

The final result of this section shows that the unbiased estimator of a2, derived in the example above, is in fact unbiased without the normality assumption. Let Y = ,u + e be a random vector in V where y E M c V, SE = 0, and Cov(e)= - 2I. Given this linear model for Y, let P be the

orthogonal projection onto M and set Q = I - P.

Proposition 4.2. Let n = dim V, k = dim M, and assume that k < n. Then

the estimator

A 2 IIQYII2 a n - k

is an unbiased estimator of a2.


Proof. The random vector QY has mean zero and Cov(QY) =2Q. By Proposition 2.21,

&IlQyII2 = (I, (Y2Q) = 02(J, Q) = a2(n - k).

The last equality follows from the observation that for any self-adjoint operator S, (I, S) is just the sum of the eigenvalues of S. Specializing this to the projection Q yields (I, Q) = n - k. o

4.2. MORE ABOUT THE GAUSS-MARKOV THEOREM

The purpose of this section is to investigate to what extent Theorem 4.1 depends on the weak sphericity assumption. In this regard, Proposition 4.1 provides some information. If we take W = V and B = I, then Proposition

4.1 implies that

SIIAY - ; &1PY - ,L112

where 11 is the norm obtained from an inner product [, ]. Thus the orthogonal projection P minimizes &IIA Y -112 over A E 6B no matter what

inner product is used to measure deviations of A Y from ,. The key to the

proof of Theorem 4.1 is the relationship

[P(Y - ), (A - P)(Y -,)] = 0.

This follows from the fact that the random vectors P(Y - ,) and (A -

P)( Y - IL) are uncorrelated and

gP(Y-1I)= (A -P)(Y- )=0 forA e i.

This observation is central to the presentation below. The following alterna

tive development of linear estimation theory provides the needed generality to apply the theory to multivariate linear models.

Consider a random vector Y with values in an inner product space

(V,(-, -)) and assume that the mean vector of Y, say p= & Y, lies in a

known regression manifold M c V. For the moment, we suppose that

Cov(Y) = I where I is fixed and known (E is not necessarily nonsingular). As in the previous section, a linear estimator of ,u, say A Y, is unbiased iff

A E Ei- {AIAM = ,L, ,u E M).

Given any inner product [,-*] on V, the problem is to choose A E C to



MORE ABOUT THE GAUSS-MARKOV THEOREM 141

minimize

I(A) = &IAY- I112 = &[AY- ,L, AY- u]

where the expectation is computed under the assumption that & Y = ,i and

Cov(Y) = E. Because of Proposition 4.1, it is reasonable to expect that the minimum of +(A) occurs at a point PO e C where PO is a projection onto M

along some subspace N such that M r N = {O) and M + N = V. Of course,

N is the null space of PO and the pair M, N determines PO. To find the

appropriate subspace N, write I(A) as

'(A)= IIAY- tLY-I

= S&IPO(Y - ,) + (A-P0)(Y

- &jjP0(Y - M)12 + ;I(A -P0)(Y )11

+ 2& [Po(Y- ), (A - P0)(Y- )]

When the third term in the final expression for +(A) is zero, then PO minimizes I(A). If PO(Y - p) and (A - P0)(Y- IL) are uncorrelated, the

third term will be zero (shown below), so the proper choice of PO, and hence N, will be to make PO(Y - ,L) and (A - P0)(Y - ,u) uncorrelated. Setting

C = A - PO, it follows that 6%(C) D M. The absence of correlation be

tween PO(Y - ,u) and C(Y - t) is equivalent to the condition

PolC' = 0.

Here, C' is the adjoint of C relative to the initial inner product (*, *) on V.

Since 6L(C)

D M, we have

((C)= (c))' C

and

61 (yC') c Z(M-L).

The symbol 1 refers to the inner product (-, *). Therefore, if the null space of P0, namely N, is chosen so that N D I(M'), then PO:C' = 0 and P0

minimizes I(A). Now, it remains to clean up the technical details of the above argument. Obviously, the subspace Y(M') is going to play a role in what follows.

First, a couple of preliminary results.




Proposition 4.3. Suppose l = Cov(Y) in (V,(,)) and M is a linear subspace of V. Then:

(i) Y2(M-'-)nM= (0). (ii) The subspace 2(M') does not depend on the inner product on V.

Proof To prove (i), recall that the null space of Y. is

(xl(x, Ex) = 0)

since l is positive semidefinite. If u E- E(M') n M, then u = 2ul for some

ut E M'. Since EYu1 E M, (u,, lu,) = 0 so u = Eu, = 0. Thus (i) holds.

For (ii), let [*, * be any other inner product on V. Then

[x, y] = (x, Aoy)

for some positive definite linear transformation AO. The covariance transfor mation of Y with respect to the inner product [, ] is I2Ao (see Proposition

2.5). Further, the orthogonal complement of M relative to the inner product [ ] is

{yj[x, y] = 0 for all x e M) = {yI(x, Aoy) = 0 for all x e M)

={A-'ul(x,u)= Oforallxe M =-e(M-)

Thus 2 (M') = (>2A0)(A- 1(M')). Therefore, the image of the orthogonal

complement of M under the covariance transformation of Y is the same no

matter what inner product is used on V. o

Proposition 4.4. Suppose Xl and X2 are random vectors with values in

(V,(-, *)). If Xl and X2 are uncorrelated and &X2 = 0, then

Sf1[XI, X2] = 0

for every bilinear function f defined on V x V.

Proof. Since X, and X2 are uncorrelated and X2 has mean zero, for

x1, x2e V, we have

O = cov{(x1, X1), (X2, X2)) = (X1, X)(x2, X2) - $(X, XO)(x2, X2)

= &(x1, X)(x2, X2)



PROPOSITION 4.4 143

However, every bilinear form f on (V, (*, *)) is given by

f [Ul, U2] = (u,, Bu2)

where BE E (V, V). Also, every B can be written as

B = , bijyiEJy1 ij

where y.... ., Yn is a basis for V. Therefore,

gf[Xi, X2] = & bij (XI, 0 yjX2)= bij&(yi, Xl)(yj, X2) = 0.

We are now in a position to generalize Theorem 4.1. To review the

assumptions, Y is a random vector in (V,(., *)) with & Y = IL E M and

Cov(Y) = 2. Here, M is a known subspace of V and Y: is the covariance of

Y relative to the given inner product (, ). Let [*, * be another product on V and set

'I(A) = &IIAY- -Ll1

for A E 6i , where 1 is the norm defined by [*, *1]

Theorem 4.2. Let N be any subspace of V that is complementary to M and

contains the subspace I(M'). Here M1 is the orthogonal complement of M relative to (*, *). Let PO be the projection onto M along N. Then

(4.1) '(A)>, 4(PO) forA E C.

If l is nonsingular, define a new inner product (, ) by

(x, Yy E(x, E-'ly).

Then PO is the unique element of 6i that minimizes 'I(A). Further, PO is the orthogonal projection, relative to the inner product (-, .)J, onto M.

Proof. The existence of a subspace N D I(M'), which is complementary to M, is guaranteed by Proposition 4.3. Let C E c (V, V) be such that

M C %(C). Therefore,

(CI) = (6L(C))' CM




so

6A (EC) C (MI).

This implies that

PoEc' = o

since 9(PO) = N D 7(M'). However, the condition POC' = 0 is equiva

lent to the condition that PO(Y - IL) and C(Y - ,u) are uncorrelated.

With these prelminaries out of the way, consider A e C and let C = A

- P sO Go (C) p M. Thus

'(A) = SIIA(Y - &= P0(Y -_ ) + C(Y - )112

= SliP0(y- _)112 + &lIC(Y- _.t)112 + 2&[P0(Y- j),C(Y- t)] = -~~~ 1 [+o &j ,C(Y -

= Sllp0(y- _

)112 + SlIC(y -

U)112

The last equality follows by applying Proposition 4.4 to PO(Y - p) and

C(Y - I). Therefore,

+(A) = I'(P0) + SlIC(Y - tL)112

so PO minimizes 1 over A e d.

Now, assume that L is nonsingular. Then the subspace N is uniquely

defined (N = 7.(M')) since dim(E(M')) = dim(M') and M + I(M') =

V. Therefore, PO is uniquely defined as its range and null space have been

specified. To show that PO uniquely minimizes I, for A E C, we have

+'(A) = '(P0) + &IIC(Y - )112

where C = A - PO. Thus 'I(A) > I(PO) with equality iff

SlIC(y_ tL)112 = 0.

This expectation can be zero iff C(Y - I) = 0 (a.e.) and this happens iff the

covariance transformation of C(Y - ,u) is zero in some (and hence every)

inner product. But in the inner product (*, *),

Cov(C(Y - la)) = C2C'

and this is zero iff C = 0 as Y is nonsingular. Therefore, PO is the unique



PROPOSITION 4.5 145

minimizer of I. For the last assertion, let N, be the orthogonal complement of M relative to the inner product (', *)1. Then,

N, = (yl(x, y), = 0 for all x E M) = {yj(x, -'y) = 0 for all x e M)

= (Xyl(x, y) = 0 for all x E M} = I(M1).

Since 6Z(PO) = I(M'), it follows that PO is the orthogonal projection onto M relative to (, *).

In all of the applications of Theorem 4.2 in this book, the covanance of Y is nonsingular. Thus the projection PO is unique and ,I = POY is called the

Gauss-Markov estimator of ,u E M. In the context of Theorem 4.2, if

CoV(Y) = a2 where I is known and nonsingular and a2 > 0 is unknown,

then PoY is still the Gauss-Markov estimator for ,u E M since (a2 :)(MI)

= 2(M') for each a2 > 0. That is, the presence of an unknown scale

parameter a2 does not affect the projection PO. Thus PO still minimizes ' for each fixed a2 > 0.

Consider a random vector Y taking values in (V, (-, *)) with & Y = ,u E M

and

Cov(Y) = a 2 a2 > 0.

Here, Y., is assumed known and positive definite while a2 > 0 is unknown.

Theorem 4.2 implies that the Gauss-Markov estimator of ,u is ,u = POY where PO is the projection onto M along II (M'). Recall that the least-squares

estimator of ,u is PY where P is the orthogonal projection onto M in the given inner product, that is, P is the projection onto M along MI.

Proposition 4.5. The Gauss-Markov and least-squares estimators of ,u are the same iff II(M) C M.

Proof. Since P0 and P are both projections onto M, POY = PY iff both PO and P have the same null spaces-that is, the Gauss-Markov and least squares estimators are the same iff

11 (M-,) = M

Since :1 is nonsingular and self-adjoint, this condition is equivalent to the condition 2,(M) C M. O




The above result shows that if I(M) c M, we are free to compute either

P or PO to find ,i The implications of this observation become clearer in the next section.

4.3. GENERALIZED LINEAR MODELS

First, consider the linear model introduced in Section 4.2. The random vector Y in (V, (*, .)) has a mean vector ,u E M where M is a subspace of V

and Cov(Y) = a21 i. Here, 1: is a fixed positive definite linear transforma tion and a2 > 0. The essential features of this linear model are: (i) the mean vector of Y is assumed to be an element of a known -subspace M and (ii) the

covariance of Y is an element of the set (a 21 I2 > 0). The assumption concerning the mean vector of Y is not especially restrictive since no special assumptions have been made about the subspace M. However, the covari ance structure of Y is quite restricted. The set (a21, 1I2 > 0) is an open half line from 0 E f&(V, V) through the point 2, E C(V, V) so the set of the

possible covariances for Y is a one-dimensional set. It is this assumption

concerning the covariance of Y that we want to modify so that linear models

become general enough to include certain models in multivariate analysis. In particular, we would like to discuss Example 3.2 within the framework of

linear models. Now, let M be a fixed subspace of (V, (., *)) and let -y be an arbitrary set

of positive definite linear transformations on V to V. We say that (M, -y} is

the parameter set of a linear model for Y if SY = ,u E M and Cov(Y) E y.

For a general parameter set (M, y}, not much can be said about a linear

model for Y. In order to restrict the class of parameter sets under considera

tion, we now turn to the question of existence of Gauss-Markov estimators (to be defined below) for IL. As in Section 4.1, let

Ci = {AIA e $B(V,V),AMu = forM eM).

Thus a linear transformation of Y is an unbiased estimator of ,u E M iff it

has the form A Y for A E Z. The following definition is motivated by

Theorem 4.2.

Definition 4.2. Let (M, y) be the parameter set of a linear model for Y.

For AO E i, AOY is a Gauss-Markov estimator of , iff

&2IIAY -112 > S - l Y12

for all A E C, and E: E y. The subscript I on the expectation means that the

expectation is computed when Cov(Y) = E.



PROPOSITION 4.6 147

When y = (a'2 Ia > O), Theorem 4.1 establishes the existence and uniqueness of a Gauss-Markov estimator for ,u. More generally, when y = (g2 I2l2> 0), Theorem 4.2 shows that the Gauss-Markov estimator for ,u is P1Y where Pi is the orthogonal projection onto M relative to the

inner product (., ) given by

(x, Y) (x, 2:y), x, ye V.

The problem of the existence of a Gauss-Markov estimator for general y is taken up in the next paragraph.

Suppose that {M, y) is the parameter set for a linear model for Y.

Consider a fixed element :, E y, and let (, ) be the inner product on V

defined by

(x, Y)i (x, 1 y), x, y E V.

As asserted in Theorem 4.2, the unique element in 6? that minimizes

61,11A Y - ,L12 is P,-the orthogonal projection onto M relative to (*, .)J. Thus if a Gauss-Markov estimator AOY exists according to Definition 4.2, Ao must be P1. However, exactly the same argument applies for 22 E y, So

Ao must be P2-the orthogonal projection onto M relative to the inner product defined by 22' These two projections are the same iff Y1(M') =

22(M')-see Theorem 4.2. Since 2l and 22 were arbitrary elements of y, the conclusion is that a Gauss-Markov estimator can exist iff Y.,(M')= E2(M') for all , Y 2 E y. Summarizing this leads to the following.

Proposition 4.6. Suppose that (M, y) is the parameter set of a linear model for Y in (V, (-, ,)). Let 2, be a fixed element of y. A Gauss-Markov estimator of ,u exists iff

:(M-)= 21(M') for all 2 E y.

When a Gauss-Markov estimator of u exists, it is ,u = PY where P is the orthogonal projection onto M relative to any inner product [*, * given by [x, y] = (x, .- 'y) for some E E y.

Proof. It has been argued that a Gauss-Markov estimator for ,u can exist

iff Y1(M') = 12(M') for all El ,2 e y. This is clearly equivalent to

X((M') = 1,(M') for all I e M. The second assertion follows from the

observation that when 2(M') = Y:,(M'), then all the projections onto M, relative to the inner products determined by elements of y, are the same. That a = PY is a consequence of Theorem 4.2. a




An interesting special case of Proposition 4.6 occurs when I E y. In this case, choose 2E = I so a Gauss-Markov estimator exists iff I(M') = M' for all L E y. This is clearly equivalent to E(M) = M for all 2 E y, which

is equivalent to the condition

2(M) c M for all I E y

since each I E y is nonsingular. It is this condition that is verified in the examples that follow.

* Example 4.3. As motivation for the discussion of the general multivanate linear model, we first consider the multivariate version of the k-sample situation. Suppose Xij's, j = 1,..., ni and i =

1,..., k, are random vectors in RP. It is assumed that 6Xij = [i,

Cov(Xij) = L, and different random vectors are uncorrelated. Form

the random matrix X whose first n1 rows are X1, j = 1,..., n1, the

next n2 rows of X are X2j, j = 1,..., n2, and so on. Then X is a

random vector in (e ( * , )) where n = Ekni . It was argued in

the discussion following Proposition 2.18 that

COv(X) = In ? z

relative to the inner product (,*) on lip ,. The mean of X, say ,u = &X, is an n X p matrix whose first n1 rows are all IL, whose

next n2 rows are all 2, and so on. Let B be the k x p matrix with

rows , it,..., '4. Thus the mean of X can be written ,u = ZB where

Z is an n X k matrix with the following structure: the first column

of Z consists of n1 ones followed by n - n1 zeroes, the second

column of Z consists of nI zeroes followed by n2 ones followed by n - n I - n2 zeroes, and so on. Define the linear subspace M of ep n by

M={ i=ZB, B etpk}

so M is the range of Z ? Ip, as a linear transformation on Ep k to n. Further, set

y = (In X 2 IE fC, , Y positive definite)

and note that y is a set of positive definite linear transformations on

p to e . Therefore, &X E M and Cov(X) E y, and (M, y} is a

parameter set for a linear model for X. Since In ? Ip is the identity



PROPOSITION 4.6 149

linear transformation on p and I h Ip e y, to show that a

Gauss-Markov estimator for ,u E M exists, it is sufficient to verify that, if x E M, then (I, ? 2)x E M. For x e M, x = ZB for some

B E fp k. Therefore,

(In ? 2)(ZB) = ZBE = (Z 0 Ip)(B?)q

which is an element of M. Thus M is invariant under each element of y so a Gauss-Markov estimator for y exists. Since the identity is

an element of y, the Gauss-Markov estimator is just the orthogonal projection of X on M relative to the given inner product ( *, * ). To

find this projection, we argue as in Example 4.1. The regression subspace M is the range of Z 0 Ip and, clearly, Z has rank k. Let

P = (Z ? Ip)[(Z ? ip)f(Z ? ip)] (Z 0 IJ

= (z ? Ip)[(Z'Z) ? ip] ?(Z Ip) =

Z(Z'Z) z' ? ip,

which is an orthogonal projection; see Proposition 1.28. To verify that P is the orthogonal projection onto M, it suffices to show that the range of P is M. For any x E Ep, n

Px = (z(z'z)I'z ? Ip4x = (Z ? Ip)[(ZZ) Zx

which is an element of M since (Z'Z)f-Z'x e EP k. However, if

x e M, then x = ZB and Px = P(ZB) = ZB-that is, P is the identity on M. Hence, the range of P is M and the Gauss-Markov

estimator of ,u is

= PX= z(z'z)y'z'x.

Since ,u = ZB,

B = (ZZ) Zz - ((Z'Z) Z

and, by Proposition 4.1,

B = ((ZZ) 'Z ? = (ZZ)'ZX

is the Gauss-Markov estimator of the matrix B. Further, S( B) = B




and

Cov(B)= Cov[((Z'Z) Z' I,)x]

((Z Z) IZ X Ip )(In C ) )( (Z ) p ) I

- ((z'z 'z X ? i =(z'z)' ? Y.

For the particular matrix Z, Z'Z is a k X k diagonal matrix with

diagonal entries nl,..., nk so (Z'Z)-' is diagonal with diagonal elements nT ..., n. A bit of calculation shows that the matrix B = (Z'Z)- Z'X has rows X, ...,Xk where

ni1

ij n I

is the sample mean in the ith sample. Thus the Gauss-Markov

estimator of the ith mean ,ui is Xi, i = 1,..., k.

It is fairly clear that the explicit form of the matrix Z in the previous

example did not play a role in proving that a Gauss-Markov estimator for

the mean vector exists. This observation leads quite naturally to what is

usually called the general linear model of multivariate analysis. After

introducing this model in the next example, we then discuss the implications of adding the assumption of normality.

* Example 4.4 (Multivariate General Linear Model). As in Example

4.3, consider a random matrix X in (ltp n,,*, *) and assume that

(i) &X = ZB where Z is a known n x k matrix of rank k and B is a

k x p matrix of parameters, (ii) Cov(X) = In X L where L: is a

p x p positive definite matrix-that is, the rows of X are uncorre

lated and each row of X has covariance matrix E:. It is clear we have

simply abstracted the essential features of the linear model in

Example 4.3 into assumptions for the linear model of this example.

The similarity between the current example and Example 4.1 should

also be noted. Each component of the observation vector in Exam

ple 4.1 has become a vector, the parameter vector has become a

matrix, and the rows of the observation matrix are still uncorre

lated. Of course, the rows of the observation vector in Example 4.1

are just scalars. For the example at hand, it is clear that

M {,L,u = ZB, B Ep k



PROPOSITION 4.6 151

is a subspace of C,, n and is the range of Z X Ip. Setting

y = (In e I Ij is ap x p positive definite matrix),

(M, y) is the parameter set of a linear model for X. More specifi cally, the linear model for X is that & X = IL e M and Cov(X) E y.

Just as in Example 4.3, M is invariant under each element of y so a

Gauss-Markov estimator of u = &X exists and is PX where

P Z(Z'z)1 z' Ip

is the orthogonal projection onto M relative to K *). Mimicking the argument given in Example 4.3 yields

B= (z'z) Z'X ((z 'z' Z Ip? X

and

Cov(B)= (Z'Z) ' 2:.

In addition to the linear model assumptions for X, we now assume that E(X)= N(ZB, In ? E) so X has a normal distribution in

(Ep n' ( , ,)). As in Example 4.2, a discussion of sufficient statis

tics and maximum likelihood estimators follows. The density func tion of X with respect to Lebesgue measure is

p (x |,, I:) = (2 7T)_ n p/2 i ,rln/2

xexp[-2 <x - L(I E )(x- )> ,

as discussed in Chapter 3. Let PO = Z(Z'Z)- Z' and QO = I - PO so P = PO 0 Ip is the orthogonal projection onto M and Q QO X

Ip is the orthogonal projection onto M'. Note that both P and Q commute with In 0 2 for any E. Since ,u E M, we have

(x - ,u, (In C - )(x

= (P(x - U) + Qx, (In 2Y 1)(P(x - ,) + Qx))

= (P(x

- L), (In

? Y. r)P(x -)) + <Qx, (In ? 2Y )Qx)

because (Qx,(In 0 -')P(x - 1))= (x, Q(In 0 Y-I)P(x - ,u))

- 0 since Q(In C J- ')P = QP(In e -') = 0. However,

(Qx,~ (In e 7- -)QX)

= (x, Q(In X -')Qx) = (x, Q(In ? 2 )x)

= Kx, (QO X ? )x = (x, QoXYJ-)

= tr(xE-x'Q0)= tr(x'Q0x2-')

Thus

(x -

,u, (In X - tO)X

= (PX

- It, (I,

? Y )(Px- + tr(x'QOx -).

Therefore, the densityp(xjjL, Y.) is a function of the pair (Px, x'Qox) so the pair (Px, x'Qox} is sufficient. That this pair is minimal sufficient and complete for the parametric family { p( * Ij, E); , E M,

2 positive definite) follows from exponential family theory. Since P(In 0 2)Q = PQ(In 0 = 0, the random vectors PX and QX

are independent. Also, X'QOX = (QX)'(QX) so the random vectors PX and X'Qo X are independent. In other words, {PX, X'Qo X) is a

sufficient statistic and PX and X'QOX are independent. To derive the maximum likelihood estimator of ,u E M, fix 2. Then

p(x|Ju, Y-) = (2qT)-np/2111-n/2

Xexp[- <(Px - , (In C 1 -)(Px -)) -trx'Q0xE-]

< (27rY)-nP/2jIn/2exp[ trx'Q0x2-']

with equality iff ,u = Px. Thus the maximum likelihood estimator of

IL is ,I = PX, which is also the Gauss-Markov and least-squares estimator of ,u. It follows immediately that

B = (Z'Z)Z'X

is the maximum likelihood estimator of B, and

f (B) = N(B,(Z'Z)' 1 ).

PROPOSITION 4.6 153

To find the maximum likelihood estimator of 2, the function

p(xl I':Y) = (2,r -B2z-/exp [-

I-l trx'QoxE:-']

must be maximized over all p X p positive definite matrices E. When x'Qox is positive definite, this maximum occurs uniquely at

-x'Q x n

so the maximum likelihood estimator of E is stochastically indepen dent of a. A proof that I is the maximum likelihood estimator of E and a derivation of the distribution of l is deferred until later.

The principal result of this chapter, Proposition 4.6, gives necessary and sufficient conditions on the parameter set {M, y) of a linear model in order that the Gauss-Markov estimator of ,t E M exists. Many of the classical parametric models in multivariate analysis are in fact linear models with a parameter set {M, y) so that there is a Gauss-Markov estimator for ,E E M.

For such models, the additional assumption of normality implies that ,u is also the maximum likelihood estimator of A, and the estimation of ,u is relatively easy if we are satisfied with the maximum likelihood estimator. For the time being, let us agree that the problem of estimating it has been solved in these models. However, very little has been said about the estimation of the covariance other than in Example 4.4. To be specific, assume C(X) = N(,t, E) where yt E M c (V,(-, *)) and (M, y) is the

parameter set of this linear model for x. Assume that I E -y and a = PX is the Gauss-Markov estimator for ,u so EM = M for all E y. Here, P is the orthogonal projection onto M in the given inner product on V. It follows immediately from Proposition 4.6 that a = PX is also the maximum likeli hood estimator of ,u E M. Substituting aL into the density of X yields

p(xl#>:)=(2sr-/ll / exp[ - I (QxI 1Q )]

where n = dim V and Q = I - P is the orthogonal projection onto M'.

Thus to find the maximum likelihood estimator of L E y, we must compute

sup p(xI, A ) --p(x , E);

assuming that the supremum is attained at a point I' E y. Although many



examples of explicit sets -y are known where I is not too difficult to find, general conditions on y that yield an explicit E are not available. This overview of the maximum likelihood estimation problem in linear models where Gauss-Markov estimators exists has been given to provide the reader with a general framework in which to view many of the estimation and testing problems to be discussed in later chapters.

PROBLEMS

1. Let Z be an n X k matrix (not necessarily of full rank) so Z defines a

linear transformation on Rk to Rn. Let M be the range of Z and let

zl- ... Zk be the columns of Z.

(i) Show that M = span(zl,..., Zk).

(ii) Show that Z(Z'Z)-Z' is the orthogonal projection onto M where

(Z'Z)- is the generalized inverse of Z'Z.

2. Suppose Xl,,..., X,, are i.i.d. from a density p(xjf3) = f (x - ,B) where

f is a symmetric density on R' and Jx2f(x) dx = 1. Here, /3 is an

unknown translation parameter. Let X e Rn have coordinates XI,...,

xn .

(i) Show that E(X) = C(/3e + E) where .1...., En are i.i.d. with densityf. Show that &X = 13e and Cov(X) = In.

(ii) Based on (i), find the Gauss-Markov estimator of /3. (iii) Let U be the vector of order statistics for X (Ul < U2 < ...<

U.) so E (U) = P-(e + v) where v is the vector of order statis

tics of e. Show that &(U) = /3e + ao where ao = iv is a known

vector (f is assumed known), and Cov(U) = :50 Cov(v) where :0 is also known. Thus P_(U - ao) = f_(/e + (v - ao)) where

6(v - ao) = 0 and Cov(v - ao) = 20. Based on this linear

model, find the Gauss-Markov estimator for /3.

(iv) How do these two estimators of /3 compare?

3. Consider the linear model Y = ,i + e where ,L E M, SE = 0, and

Cov(e) = a 21n. At times, a submodel of this model is of interest. In

particular, assume ,t E w where w is a linear subspace of M.

(i) Let M - w = (xIx e M, x I w}. Show that M - , = M n l.

(ii) Show that PM - P.,, is the orthogonal projection onto M - X and

verify that Ii(PM - P",)XII2 = IPMXII2 - IIP,XII2.

PROBLEMS 155

4. For this problem, we use the notation of Problem 1.15. Consider subspaces of R'I given by

MO = (YIyij = y.. for all i, j)

Ml = (YlYij = Yik for allj, k; i = I

M2 ={YlYij = Ykj for all i, k; j ,. .., J}

(i) Show that 'I(A) = MO, 6A(BI)= Ml - MO, and 6A(B2)= M2 -

MO.

Let M3 be the range of B3. (ii) Show that RIJ = MO @ (M1 - MO) @ (M2 - MO) C M3.

(iii) Show that a vector ,u is in M = MO E (Ml - MO) @ (M2 - MO) iff ,u can be writte.n as ,Uij

= a + ,Bi + -Yj,i-1 .,I P,..

J, where a, f3i, and y. are scalars that satisfy i = Zyj= 0.

5. (The IF-test.) Most of the classical hypothesis testing problems in regression analysis or ANOVA can be described as follows. A linear

model Y= , + E, ,u E M, &E = 0, and Cov(e) = o2I is given in

(V,(., ( )). A subspace w of M (o * M) is given and the problem is to

test Ho: , E w versus H, : ,t 0 X, ,u E M. Assume that 1 (Y)=

N(p, a2I) in (V,(, .)).

(i) Show that the likelihood ratio test of Ho versus H1 rejects for large values of F = IIPM_.YII2/I1QMYII2 where QM = I - PM.

(ii) Under Ho, show that F is distributed as the ratio of two indepen dent chi-squared variables.

6. In the notation of Problem 4, consider Y e RIJ with E&Y = e M (M is given in (iii) of Problem 4). Under the assumption of normality, use the results of Problem 5 to show that the '-test for testing Ho: I = fB2 = * /*3-,t rejects for large values of

JEi (Yi-Y..)2

?ijEj(YijYi. Yj +.Y )2

Identify w for this problem.

7. (The normal equations.) Suppose the elements of the regression sub space M c- RI are given by u = X,8 where X is n X k and /3 E Rk.

Given an observation vector y, the problem is to find Pa = My. The




equations (in ,B)

(4.2) X'yP=EX'X3, /ERk

are often called the normal equations. (i) Show that (4.2) always has a solution b E Rk.

(ii) If b is any solution to (4.2), show that Xb = PMY.

8. For Y e R', assume ,t = &Y e M and Cov(Y) E y where y = ( =

aPe + fiQe, a > 0, / > 0). As usual, e is the vector of ones, Pe is the

orthogonal projection onto span{e}, and Qe = I - Pe

(i) If e e M or e E M', show that the Gauss-Markov and least

squares estimators for ,t are the same for each a and /3.

(ii) If e Z M and e 0 M', show that there are values of a and / so

that the least-squares and Gauss-Markov estimators of ,u differ. (iii) If C(Y)= N(I,) with I E y and Mc (span{e))

I (M *

(span(e))'), find the maximum likelihood estimates for ,u, a, and ,B. What happens when M = span(e)?

9. In the linear model Y = X,8 + e on Rn with X: n x k of full rank, = 0, and Cov(E) = a2'1 (2, is positive definite and known), show

that =X(X'E 'X)-'X'2 -Y and P = (X' IX)-X'YI y.

10. (Invariance in the simple linear model.) In (V, (-,)), suppose that (M, y) is the parameter set for a linear model for Y where y = {E12 =

a2I, a > 0). Thus GY = ,u E M and Cov(Y) E y. This problem has to

do with the invariance of this linear model under affine transforma

tions: (i) If r e ((V) satisfies F(M) C M. show that F'(M) C M.

Let EM(V) be those F e ((V) that satisfy F(M) c M.

(ii) For xo E M, c > 0, and r E (OM(V), define the function

(c, F, xo) on V to V by (c, F, xo)y = cry + xo. Show that this

function is one-to-one and onto and find the inverse of this

function. Show that this function maps M onto M.

(iii) Let Y = (c, F, x0)Y. Show that &;Y E M and Cov(Y) E y. Thus

(M, y) is the parameter set for Y and we say that the linear

model for Y is invariant under the transformation (c, F, xo). Since &Y = u, it follows that &4Y = (c, F, xo)0L for IL E M. If t(Y) (t

maps V into M) is any point estimator for ,u, then it seems plausible to

use t(Y) as a point estimator for (c, r, x0o) = crFu + xo. Solving for

,u, it then seems plausible to use c- IF'(t(Y) - xo) as a point estimator

for ,u. Equating these estimators of u leads to t(Y) = c- F'(t(cFY +




xo) - xo) or

(4.3) t(crY + x0) = crt(Y) + xO.

An estimator that satisfies (4.3) for all c >O, r E cM(V), and xo E M

is called equivariant. (iv) Show that to(Y) = PMY is equivariant.

(v) Show that if t maps V into M and satisfies the equation

t(rY + xo) = Ft(Y) + xo for all F E ?M(V) and xo E M, then

t(Y) = PMY.

11. Consider U s R' and V e Rn and assume e&(U) = N(ZI/3I, aHIn) and f (V) = N(Z2fP2, 022In). Here, Zi is n x k of rank k and fPi E Rk

is an unknown vector of parameters, i = 1, 2. Also, aii > 0 is unknown,

i = 1, 2. Now, let X = (UV): n X 2 so u = &X has first column Zfl, and second column Z2/32.

(i) When U and V are independent, then Cov(X) = In 0 A where

A = (h 02

V ? ?22J

In this case, show that the Gauss-Markov and least-squares estimates for A are the same. Further, show that the

Gauss-Markov estimates for P, and 2 are the same as what we

obtain by treating the two regression problems separately.

(ii) Now, suppose Cov(X) = In e where

(a11 a12

012 022

is positive definite and unknown. For general Z, and Z2, show that the regression subspace of X is not invariant under all

I, 0 . so the Gauss-Markov and least-squares estimators are

not the same in general. However, if Z1 = Z2, show that the

results given in Example 4.4 apply directly. (iii) If the column space of Z1 equals the column space of Z2, show

that the Gauss-Markov and least-squares estimators of ,u are the same for each In 0 E.


1. Scheff? (1959) contains a coordinate account of what might be called

univariate linear model theory. The material in the first section here

follows Kruskal (1961) most closely.




2. The result of Proposition 4.5 is due to Kruskal (1968).

3. Proposition 4.3 suggests that a theory of best linear unbiased estimation

can be developed in vector spaces without inner products (i.e., dual

spaces are not identified with the vector space via the inner product). For a version of such a theory, see Eaton (1978).

4. The arguments used in Section 4.3 were used in Eaton (1970) to help answer the following question. Given X e ? with Cov(X)

= In ? 2

where 2 is unknown but positive definite, for what subspaces M does

there exist a Gauss-Markov estimator for ?i e M ? In other words, with

y as in Example 4.4, for what M's can the parameter set {M, y) admit a

Gauss-Markov estimator? The answer to this question is that M must

have the form of the subspaces considered in Example 4.4. Further

details and other examples can be found in Eaton (1970).



CHAPTER 5

Matrix Factorizations

and Jacobians

This chapter contains a collection of results concerning the factorization of matrices and the Jacobians of certain transformations on Euclidean spaces. The factorizations and Jacobians established here do have some intrinsic interest. Rather than interrupt the flow of later material to present these results, we have chosen to collect them together for easy reference. The reader is asked to mentally file the results and await their application in future chapters.

5.1. MATRIX FACTORIZATIONS

We begin by fixing some notation. As usual, Rn denotes n-dimensional coordinate space and Emn,n is the space of n X m real matrices. The linear space of n X n symmetric real matrices, a subspace of en", is denoted bye5. If S E Sn, we write S > 0 to mean S is positive definite and S > 0 means that S is positive semidefinite.

Recall that n,, is the set of all n x p linear isometries of RP into R , that is, I E iff 'In = Ip. Also, ifTen , then T = (tij) is lower triangu lar if tij =0 for i < j. The set of all n X n lower triangular matrices with tu> 0, i = 1,..., n, is denoted by GT. The dependence of GT on the dimension n is usually clear from context. A matrix U E en, n is upper

triangular if U' is lower triangular and Gt denotes the set of all n X n

upper triangular matrices with positive diagonal elements.

159

160 MATRIX FACTORIZATIONS AND JACOBIANS

Our first result shows that GT and GU are closed under matrix multipli

cation and matrix inverse. In other words, GT and GU are groups of matrices with the group operation being matrix multiplication.

Proposition 5.1. If T = {tij) E GT, then T-1 E GT and the ith diagonal

element of T` is l/ti, i = 1,..., n. If T, and T2 E G ', then T,T2 E GT.

Proof To prove the first assertion, we proceed by induction on n. Assume the result is true for integers 1, 2,. . ., n - 1. When T is n X n, partition T as

VT2, tnn

where T,, is (n - 1) X (n - 1), T2, is 1 X (n - 1), and tnn is the (n, n)

diagonal element of T. In order to be T-', the matrix

{All 0|

A21 annJ

must satisfy the equation TA = In. Thus

T, I 0 8All ?| TIAI ? 0 = In- 0

T21 tnn]A21 annl VT2,All + tnnA21 tnnannj (? 1)

so A,1 = T 17, ann = l/tnn, and

A T2 T-' A2,= - "2 1'.

nn

The induction hypothesis implies that Tj11 is lower triangular with diagonal elements l/ti1, i = l,..., n - 1. Thus the first assertion holds. The second

assertion follows easily from the definition of matrix multiplication. o

Arguing in exactly the same way, Gu is closed under matrix inverse and

matrix multiplication. The first factorization result in this chapter is next.

Proposition 5.2. Suppose A E En where p < n and A has rank p. Then

A - PU where 'I E 6p, and U E Gt is p x p. Further, I and U are

unique.

Proof. The idea of the proof is to apply the Gram-Schmidt orthogonaliza tion procedure to the columns of the matrix A. Let a l,..., ap be the

PROPOSITION 5.3 161

columns of A so ai e R , i = l,...,p. Since A is of rank p, the vectors

al,..., ap are linearly independent. Let (b,,..., bp) be the orthonormal set of vectors obtained by applying the Gram-Schmidt process to al,. .., ap in the order 1,2,..., p. Thus the matrix ' with columns b1,..., bp is an

element of F n as I'I = Since span(al,..., ai)= span(bl,..., bi) for i = 1,... ., p, bjai = 0 if j > i, and an examination of the Gram-Schmidt

Process shows that b,a > 0 for i = 1,... , p. Thus the matrix U V'A is an element of Gt, and

*U = 4"'A.

But 'P4 is the orthogonal projection onto span(b,,..., bp) = span(a,,.....

ap) so **'A = A, as **' is the identity transformation on its range. This

establishes the first assertion. For the uniqueness of I and U, assume that A = 'IU1 for % e 6p and U1 E Gt. Then *,U, = 'U, which implies

that *'I" = UU1-'. Since A is of rank p, U1 must have rank p so @,(A) =

(*)= i(+). Therefore, I'" = ' since 'P' is the orthogonal pro jection onto its range. Thus 'P"I'4 - Ip-that is, *'I, is a p x p orthogonal matrix. Therefore, UUI' = 'I4 is an orthogonal matrix and UUj E Gt. However, a bit of reflection shows that the only matrix that is

both orthogonal and an element of Gu is Ip. Thus U= U1 so 4'-'as U

has rank p. O

The main statistical application of Proposition 5.2 is the decomposition of the random matrix Y discussed in Example 2.3. This decomposition is used to give a derivation of the Wishart density function and, under certain assumptions on the distribution of Y = *U, it can be proved that * and U

are independent. The above decomposition also has some numerical appli cations. For example, the proof of Proposition 5.2 shows that if A = *U, then the orthogonal projection onto the range of A is ' 4 = A(A'A)- 'A'.

Hence this projection can be computed without computing (A'A) l. Also, if p = n and A = 'PU, then A -' = U- "P'. Thus to compute A', we need

only to compute U' and this computation can be done iteratively, as the proof of Proposition 5.1 shows.

Our next decomposition result establishes a one-to-one correspondence between positive definite matrices and elements of GT. First, a property of positive definite matrices is needed.

Proposition 5.3. For S E 5p and S > 0, partition S as

S1 S12

VS21 S22



where S1 and S22 are both square matrices. Then S1, 122, S11 - S 1S21,

and S22 - S21SI I S,2 are all positive definite.

Proof. For x E RP, partition x into y and z to be comformable with the

partition of S. Then, for x * 0,

0 < x'Sx y'S,,y + 2z'S21y + Z'S22z.

For y * 0 and z = 0, x * 0 so y'S1,y>0, which shows that S, >0.

Similarly, S22 > 0. Fory * 0 and z = -S-2S21 y,

0 < x'Sx = AS-S O < X'S-y'( S 1 - 12 S2 IS21 ) Y,

which shows that S - S12 S21 > 0. Similarly, S22 - S2lS12 > 0. O

Proposition 5.4. If S > 0, then S = TT' for a unique element T e GT.

Proof. First, we establish the existence of T and then prove it is unique.

The proof is by induction on dimension. If S E p with S > 0, partition S

as

S (SI S,12 S 21 S22

where S,1 is (p - 1) x (p - 1) and S22 E (0, xo). By the induction hypothe

sis, S,I = T IT'1I for T11 e GT. Consider the equation

S21 S22! T21 T22)( T2 T22)

which is to be solved for T21: 1 x (p - 1) and T22 e (0, oo). This leads to

the two equations T2ITil = S21 and T2,T2'1 + T222 S22- Thus T21 =

S21 (T I) -so

S22 = T222 +S2 (T1

= T22 + S21(T11T7 1) S12 =

T22 + S21SIl2.

Therefore, TA S= 222- S21S,,'12' which is positive by Proposition 5.3.

Hence, T22 = (S22 - S21Sj11S12)'/2 is the solution for T22 > 0. This shows

PROPOSITION 5.5 163

that S = TT' for some T E GT. For uniqueness, if S = TT' = T1T,, then

TV ITT'(T1)1 = s O Tj 'T is an orthogonal matrix. But TV 'T E GT and

the only matrix that is both orthogonal and in GT is IP. Hence, T'1 T= Ip and uniqueness follows. O

Let S' denote the set of p x p positive definite matrices. Proposition 5.4 p

shows that the function F: GT SP defined by F(T) = TT' is both

one-to-one and onto. Of course, the existence of F-: + + - GT is also part

of the content of Proposition 5.4. For T, E G', the uniqueness part of

Proposition 5.4 yields F- l (T, ST,') = T, F (S). This relationship is used

later in this chapter. It is clear that the above result holds for GT replaced

by G . In other words, every S E S + has a unique decomposition S = UU'

for Ue Gt.

Proposition 5.5. Suppose A e n where p < n and A has rank p. Then

A = S where IE ,,, and S is positive definite. Furthermore, 'I and S

are unique.

Proof. Since A has rank p, A'A has rank p and is positive definite. Let S be

the positive definite square root of A'A, so A'A = SS. From Proposition 1.31, there exists a linear isometry I e S5 n such that A = PS. To establish

the uniqueness of I and S, suppose that A = *S = ',S, where I, I' e

6,, and S and S are both positive definite. Then 6it(A) = 6(4')- (= ).

As in the proof of Proposition 5.2, this implies that I"'p,II = since * I' 4 is the orthogonal projection onto 6A(% ) = 6A(4'). Therefore, SS`l = *'I" is a p x p orthogonal matrix so the eigenvalues of SS l are all on the

unit circle in the complex plane. But the eigenvalues of SS`' are the same

as the eigenvalues of S1/2S- S1S12 (see Proposition 1.39) where S'/2 is the positive definite square root of S. Since S/2S 1- 1/2 is positive definite, the eigenvalues of S'2S- iS'12 are all positive. Therefore, the eigenvalues of S 1/2Sj 'S'/2 must all be equal to one, as this is the only point of intersection

of (0, oo) with the unit circle in the complex plane. Since the only p x p

matrix with all eigenvalues equal to one is the identity, S112S- IS/2 = 0so

S = S. Since S is nonsingular, I I = I. o

The factorizations established this far were concerned with writing one matrix as the product of two other matrices with special properties. The results below are concerned with factorizations for two or more matrices. Statistical applications of these factorizations occur in later chapters.

Proposition 5.6. Suppose A is a p x p positive definite matrix and B is a

p x p symmetric matrix. There exists a nonsingular p x p matrix C and a

p x p diagonal matrix D such that A = CC' and B = CDC'. The diagonal

elements of D are the eigenvalues of A- 'B.

Proof Let A',2 be the positive definite square root of A and A-1/2= (A'1/2)- '. By the spectral theorem for matrices, there exists a p x p orthogo nal matrix r such that r'A 'r2BA -'/2- D is diagonal (see Proposition 1.45), and the eigenvalues of A - '12BA -1/2 are the diagonal elements of D.

Let C = A'/21r. Then CC' = A"/2rr'A1/2 = A and CDC' = B. Since the eigenvalues of A - '12BA -1/2 are the same as the eigenvalues of A - 'B, the proof is complete. LI

Proposition 5.7. Suppose S is a p x p positive definite matrix and partition S as

s- (5,, S12

S12 S22

where S, 1 is p 1 x p I and S22 is P2 X P2 with p I< P2. Then there exist

nonsingular matrices A1i of dimension pi x pi, i = 1, 2, such that A ii =

Ip, i = 1,2, and A15,2A'22 = (DO) where D is a P1 x pI diagonal matrix and 0 is apI X (P2 - p1) matrix of zeroes. The diagonal elements of D2 are

the eigenvalues of S, 'Sl2U22S21 where S21 = S12, and these eigenvalues are all in the interval [0, 1].

Proof. Since S is positive definite, S,I and S22 are positive definite. Let S 12 and S21/2 be the positive definite square roots of SI, and S22* Using Proposition 1.46, write the matrix SHj1/2S12S21/2 in the form

1l -

1/2S21/2 = rD'I

where r is aP1 x P I orthogonal matrix, D is a P I x p I diagonal matrix, and I is ap1 X P2 linear isometry. The pI rows of I form an orthonormal set in

RP2 and P2 - PI orthonormal vectors can be adjoined to ' to obtain a

P2 X P2 orthogonal matrix 'I, whose first P1 rows are the rows of I. It is clear that

D+ = (DO)*,

where 0 is a P1 X (P2 -PI) matrix of zeroes. Set A = -

S1l/2 and

A22 = 4,5S1/2 so AiiSiiA'i = Ip for i = 1, 2. Obviously, A, IS,2 A'22 = (DO). Since S, I ,/2S,2SI/2 =FrD4,

S,-'/255-155-1S21S -

/2= rD2Fr

PROPOSITION 5.8 165

so the eigenvalues of Sj1 '/2S12S23'S21 Sj- 1/2 are the diagonal elements of D2. Since the eigenvalues of Sjl '/2S12S~3'S2 I Sj- 1/2 are the same as the eigen

values of Sj~ 'SllSl2S , ~ it remains to show that these eigenvalues are in [0, 1]. By Proposition 5.3, S1l - SU2S,2%S, is positive definite so IP5 -

S,,'/2S,~'2S21 Sj-Q1/2 is positive definite. Thus for x E R1P,

0 < X'Sl'12S12S-'S21Sjj'/2x i X'X,

which implies that (see Proposition 1.44) the eigenvalues of S S -2lS21S -/2 are in the interval [0, 1]. o

It is shown later that the eigenvalues of SiljSI2S2 21S21 are related to the

angles between two subspaces of RP. However, it is also shown that these eigenvalues have a direct statistical interpretation in terms of correlation coefficients, and this establishes the connection between canonical correla tion coefficients and angles between subspaces. The final decomposition result in this section provides a useful result for evaluating integrals over the space of p X p positive definite matrices.

Proposition 5.8. Let S I denote the space of p x p positive definite matrices. p For S E 5 , partition S as

S - l l S12

S21 S22 J

where Sii is pi X pi, i = 1, 2, S12 iS PI X P2, and S21 = Sl2. The function f defined on Sp+ to SI X S' x P by

f (S) = (S I - S12Si2IS21, S22' S12Si22)

is a one-to-one onto function. The function h on Sp+ x S'+ X p to S+ pi P2 P2, PI P

given by

h(A1j, A22, A12) = (A11 + 1A2A22A12 A12A22) 22 12 22

is the inverse off.

Proof. It is routine to verify that f o h is the identity function on Sp+ x SP?

X and h o f is the identity function on S . This implies the assertions

of the proposition. O

5.2. JACOBIANS

Jacobians provide the basic technical tool for describing how multivariate integrals over open subsets of Rn transform under a change of variable. To describe the situation more precisely, let Bo and B, be fixed open subsets of Rn and let g be a one-to-one onto mapping from Bo to B,. Recall that the

differential of g, assuming the differential exists, is a function Dg defined on

Bo that takes values in en, n and satisfies

llg(x + 3) - g(x) - Dg(X)1ll

for each x E Bo. Here 8 is a vector in Rn chosen small enough so that

x + 8 E Bo. Also, Dg(x)S is the matrix Dg(x) applied to the vector 6, and

* -1 denotes the standard norm on Rn. Let gl,..., gn denote the coordinate

functions of the vector valued function g. It is well known that the matrix Dg(x) is given by

Dg(x) dg{ ()~ x E Bo.

In other words, the (i, j) element of the matrix Dg(x) is the partial

derivative of gi with respect to xj evaluated at x E Bo. The Jacobian of g is

defined by

Jg(x) = Idet Dg(x)l, x E Bo

so the Jacobian is the absolute value of the determinant of Dg. A formal

statement of the change of variables theorem goes as follows. Consider any real valued Borel measurable function f defined on the open set B1 such that

|If (y)I dy < + oo

JB,

where dy means Lebesgue measure. Introduce the change of variables y = g(x), x E Bo in the integral JB, f(y) dy. Then the change of variables

theorem asserts that

(5.1) f (y) dy = f (g(x))Jg(x) dx. I1B

JACOBIANS 167

An alternative way to express (5.1) is by the formal expression

(5.2) d(g(x)) = Jg(x) dx, x E Bo.

To give a precise meaning to (5.2), proceed as follows. For each Borel measurable function h defined on Bo such that JB0Ih(x)lJg(x) dx < + oo, define

I, (h) h (X) Jg(x) dx, B0

and define

I2(h)-f h(x)d(g(x))-f h(g- '(x)) dx. B0 g(BO)

Then (5.2) means that I,(h) = I2(h) for all h such that I,(IhI) < + oo. To

show that (5.1) and the equality of I, and I2 are equivalent, simply set f=hog-' sofog = h. ThusI,(h)= I2(h) iff

f f(g(x))Jg(x) dx =- f(x) dx

since B, g(Bo). One property of Jacobians that is often useful in simplifying computa

tions is the following. Let Bog B1, and B2 be open subsets of R', suppose g, is a one-to-one onto map from Bo to B1, and suppose Dg, exists. Also, suppose g2 is a one-to-one onto map from B1 to B2 and assume that Dg2

exists. Then, g2 o g9 is a one-to-one onto map from Bo to B2 and it is not

difficult to show that

Dg2.,g(X) Dg2(gi(x))Dg,(x), x E Bo.

Of course, the right-hand side of this equality means the matrix product of

Dg2(g,(x)) and Dg,(x). From this equality, it follows that

Jg2.g,(x) =

Jg2(g1(x))Jgl(x), x E Bo.

In particular, if B2 = Bo and g2 = gj 1, then g2 o =1 = g o g is the identity

function on Bo so its Jacobian is one. Thus

I = Jg22g1(X)

= Jg2(gi(x))Jg,(x),

x E Bo


and

___I(y Y EI i

Jg,I(Y) (- ( ))

We now turn to the problem of explicitly computing some Jacobians that are needed later. The first few results present Jacobians for linear transfor mations.

Proposition 5.9. Let A be an n x n nonsingular matrix and define g on Rn

to Rn by g(x) = A(x). Then Jg(x) = Idet(A)l for x E R .

Proof. We must compute the differential matrix of g. It is clear that the ith coordinate function of f is gi where

n

gi(x) = Z aik Xk k = I

Here A = (aij) and x has coordinates x1,. . ., xn. Thus

OgX (X) = aij

so Dg(x) = (aij). Thus Jg(x) = jdet(A)j. []

Proposition 5.10. Let A be an n x n nonsingular matrix and let B be a

p x p nonsingular matrix. Define g on the np-dimensional coordinate space

lp, n to fp, nby

g(X) = AXB' = (A X B)X.

Then Jg(X) = Idet AIPIdet BIN.

Proof First note that A ? B = (In ? B)(A X Ip). Setting g,(X) = (A X

Ip)X and g2(X) = (In 0 B)X, it is sufficient to verify that

Jgl(X) = Idet AIP

and

Jg2(X) = Idet BI.




Let x,,..., xp be the columns of the n x p matrix X so xi E R'. Form the

np-dimensional vector

XI X2

[X]= xp

Since (A ? Ip)X has columns Ax,,..., Axn, the matrix of A ? Ip as a linear

transformation on [X] is

A A

. : ~(np) x (np)

where the elements not indicated are zero. Clearly, the determinant of this matrix is (det A) P since A occurs p times on the diagonal. Since the

determinant of a linear transformation is independent of a matrix represen tation, we have that

det(A e Ip) = (det A)P.

Applying Proposition 5.9, it follows that

Jgl(X) = detAIP.

Using the rows instead of the columns, we find that

det(In ? B) = (det B) ,

so

Jg2(X) = Idet B.

Proposition 5.11. Let A be a p x p nonsingular matrix and define the function g on the linear space p of p x p real symmetric matrices by

g(S) = ASA' = (A ? A)S.

Then Jg(S) = Idet AIP i'.



Proof The result of the previous proposition shows that det(A ? A) = (det A)2p when A 0 A is regarded as a linear transformation on E

However, this result is not applicable to the current case since we are considering the restriction of A 0 A to the subspace Sp of p

To establish the present result, write A = P1Dr2 where F1 and F2 are p x p orthogonal matrices and D is a diagonal matrix with positive diagonal elements (see Proposition 1.47). Then,

ASA' = (A 0 A)S = (F1 X F1)(D 0 D)(I2 ? F2)S

so the linear transformation A 0 A has been decomposed into the composi tion of three linear transformations, two of which are determined by orthogonal matrices.

We now claim that if r is a p x p orthogonal matrix and g, is defined on

p by

g1(s) = rsrF = (r X r)s,

then Jg, = 1. To see this, let K *, * ) be the natural inner product on restricted to Sp, that is, let

(SI, S2) = trS,S2.

Then

K(r ? r)S, (r X r)S2) = trrs,Frrs2Ir' = tr rSIs2rF

= tr rFrs1S2 = trS,S2 = KSI, S2)

Therefore, r r F is an orthogonal transformation on the inner product space (SP, ( K )), so the determinant of this linear transformation on Sp is

+ 1. Thus g, is a linear transformation that is also orthogonal so Jg, = 1 and

the claim is established. The next claim is that if D is a p x p diagonal matrix with positive

diagonal elements and g2 is defined on Sp by

g2(S) = DSD,

then Jg= (det D) '. In the [p(p + 1)/21-dimensional space jj, let Si;, 1 i j - i < p, denote the coordinates of S. Then it is routine to show that

the (i, j) coordinate function of g2 is g2 ij(S) = xiXjsij where A1,...., A,p are

the diagonal elements of D. Thus the matrix of the linear transformation g2

is a [p( p + 1)/2] X [p( p + 1)/2] diagonal matrix with diagonal entries

X1xi for 1 < j < i < p. Hence the determinant of this matrix is the product

of the XAX. for 1 < j -< i < p. A bit of calculation shows this determinant is

(H X)P+ . Since det D = 1IIX, the second claim is established. To complete the proof, note that

g(S) = ASA' = (rF ? rF)(D X D)(r2 X r2)S = h1(h2(h3(S)))

where h(S) = (rl e rl)S h2(S) = (D X D)S, and h3(S) = (r2 0 r2)S. A direct argument shows that

Jhl, oh2 , h3(S) =Jhj, h2(h3(S)) Jh3(S)

Jh,(h2(h3(S)))Jh2(h3(S))Jh3(S)

But Jh, 1 = Jh3 and Jh2 = (det D) P". Since A = r1Dr2, Idet Al = det D,

which entailsJg = Idet AI P+. 0

Proposition 5.12. Let M be the linear space of p X p skew-symmetric

matrices and define g on M to M by

g(S) = ASA'

where A is a p X p nonsingular matrix. Then Jg(S) = Idet A I P.

Proof. The proof is similar to that of Proposition 5.11 and is left to the reader. ?

Proposition 5.13. Let GT be the set of p x p lower triangular matrices with

positive diagonal elements and let A be a fixed element of GT. The function g defined on GT to GT by

g(T)=AT, Te GT

has a Jacobian given by Jg(T) = Hr'Pa, where all,..., a are the diagonal

elements of A.

Proof. The set GT is an open subset of [2p(p + 1)]-dimensional coordi nate space and g is a one-to-one onto function by Proposition 5.1. For T E GT, form the vector [T] with coordinates t11, t21 t22, t31,..., tpp and write the coordinate functions of g in the same order. Then the matrix of partial derivatives is lower triangular with diagonal elements


a,,, a22, a22, a33, a where a occurs i times on the diagonal. Thus the

determinant of this matrix of partial derivatives is Hfa soJg - ii

Proposition 5.14. In the notation of Proposition 5.13, define g on GT to GT by

g(T)= TB, TeG

where B is a fixed element of GT. Then Jg(T) = Hf b P. whereb1,..., bp are the diagonal elements of B.

Proof. The proof is similar to that of Proposition 5.13 and is omitted. O

Proposition 5.15. Let Gu be the set of all p x p upper triangular matrices

with positive diagonal elements. For fixed elements A and B of Gt, define g

by

g(U) = A UB, U E Gt.

Then,

p p

Jg(U) = fHaP H' flb' I I

where a, app and b. bpp are diagonal elements of A and B.

Proof. The proof is similar to that given for lower triangular matrices and

is left to the reader. El

Thus far, only Jacobians of linear transformations have been computed

explicitly, and, of course, these Jacobians have been constant functions. In

the next proposition, the Jacobian of the nonlinear transformation de

scribed in Proposition 5.8 is computed.

Proposition 5.16. Let p and P2 be positive integers and set p = p1 + P2. Using the notation of Proposition 5.8, define h on S+ x S+ x cP2 to S+

by

h(All, A22, A12) (A11 A122212 A12A22 hhn ^ A 1 1 A 2, 2' 12= (dt A22A12 A22 )

Then Jh(All, A22, A12) = (det A22 ) PI




Proof. For notational convenience, set S = h (AI,, A22, A12) and partition

S as

S_ (SlI S12\ =S12 S22J

where Sij is pi x pj, i, j = 1, 2. The partial derivatives of the elements of S,

as functions of the elements of A1 1, AI2 and A22, need to be computed. Since

SI = A + A22AA' the matrix of partial derivatives of thepI(pI + 1)/2 elements of S11 with respect to the p I( pI + 1)/2 elements of All is just the

[p1 (p1 + 1)/21-dimensional identity matrix. Since SI2 = A12A22, the matrix of partial derivatives of the p1 P2 elements of SI2 with respect to the elements

of A I I is the P I P2 X P I P2 zero matrix. Also, since S22 = A22, the partial

derivative of elements of S22 with respect to the elements of AI, or A,2 are

all zero and the matrix of partial derivatives of the P2(P2 + 1)/2 elements

of S22 with respect to the P2(P2 + 1)/2 elements of A22 is the identity matrix. Thus the matrix of partial derivatives has the form

All A12 A22

SI ( IF 7] S12 ? B

S22 ? ? I2

so the determinant of this matrix is just the determinant of the P I P2 x P, P2

matrix B, which must be found. However, B is the matrix of partial derivatives of the elements of S12 with respect to the elements of A12 where S12 = A,2A22. Hence the determinant of B is just the Jacobian of the transformation g(A12) = A,2A22 with A22 fixed. This Jacobian is (det A22)P' by Proposition 5.10. o

As an application of Proposition 5.16, a special integral over the space S+ is now evaluated. p

* Example 5.1. Let dS denote Lebesgue measure on the set p +. The

integral below arises in our discussion of the Wishart distribution. For a positive integer p and a real number r > p - 1, let

c(r, p) = I>SI(r-P- )/2exp[- - trS] dS. p

In this example, the constant c(r, p) is calculated. When p = 1,




p= (0, oo) so for r > O,

c(r,l) =f s(/)1exp[- j] ds = 2r/2J(j)

where r(r/2) is the gamma function evaluated at r/2. The first claim is that

c(r, p + 1) - (27T)12c(r - 1, p)c(r, 1),

for r > p and p > 1. To verify this claim, consider S - S + and

partition S as

'l \2 S22

where S E +

, S22 e (0, oo), and S12 is p X 1. Introduce the

change of variables

(Sl S12 _ (All + A2A22Af2 A12A22

S 1i2 S22 2 12 A22

where A1 E S, A22 e (0, ox), and AE2 E RP. By Proposition 5.16,

the Jacobian of this transformation is AP2. Since det S = det(S11 -

S12Si2 'S'2 )det S22 = (det A 1I ) A 22, we have

c(r, p + 1) = (r-P-2)/2exp[- 2 trS] dS

T f L A4(r-p2)/2Ar-p - 2)/2 xexp[- 2tr11 -

#2A22A12A12-2 22]

XAP dAHd12 dA22.

Integrating with respect to A12 yields

JR| ex- A22A22A 12] dA12 = (T)P/2A P/22 SP

Substituting this into the second integral expression for c(r, p + 1



and then integrating on A22 shows that

c(r, p + 1) (21)Tp/2c(r, 1)j IAI (r-P-2)/2exp[-2 trA11] dA1j

= (2 r) p2c(r, 1)c(r - 1, p).

This establishes the first claim. Now, it is an easy matter to solve for c(r, p). A bit of manipulation shows that with

c(r, p) = IP(P- 1)/42rP/2 r( r -1+1)

for p = 1, 2,.. ., and r > p - 1, the equation

c(r, p + 1) - (2v ) p2c(r, l)c(r - 1, p)

is satisfied. Further,

c(r, 1) = 2r/2r 2().

Uniqueness of the solution to the above equation is clear. In summary,

f +Sl(r-P -)/2exp[- 4 trS] dS = TP(p-I)/42rP12 H r( r )

and this is valid for p = 1, 2, ... and r > p - 1. The restriction that r be greater than p - 1 is necessary so that r[(r - p + 1)/2] be

well defined. It is not difficult to show that the above integral is + zo if r p, f turns out

to be the density of the Wishart distribution.

Proposition 5.4 shows that there is a one-to-one correspondence between elements of p and elements of GT. More precisely, the function g defined on G' by

g(T) = TT', TE GT

is one-to-one and onto +. It is clear that g has a differential since each


coordinate function of g is a polynomial in the elements of T. One way to find the Jacobian of g is to simply compute the matrix of partial derivatives and then find its determinant. As motivation for some considerations in the next chapter, a different derivation of the Jacobian of g is given here. The first observation is as follows.

Proposition 5.17. Let dS denote Lebesgue measure on S' and consider the measure ,u on Sp given by ,u(dS) = dS/ISI(P+ 1)/2. For each Borel measur able function f on Sp which is integrable with respect to ,u, and for each nonsingular matrix A,

ff(S)A(dS) = f(ASA')uL(dS).

Proof Set B = ASA'. By Proposition 5.11, the Jacobian of this transforma tion on Sp to Sp is idet AIP+ . Thus

p p IS

f(ASA')tLdS' ff(ASA') dS +

ISI(p+ 1)72

= f(ASA')jdet AIP+' 1)2 dS J J ASA'I(P+ 1)/2

f (B) dB _

f (S) ,u (dS) 4IBI(P ?1)/2dB f()dd)

The result of Proposition 5.17 is often paraphrased by saying that the

measure ,u is invariant under each of the transformations gA defined on 0p+ by gA(S) = ASA'. The following calculation gives a heuristic proof of this

result:

~(dgAS)) -d(gA(S)) JgA (S) dS ,u ( gs (

))I A SA' (p + 1)/2 A SA'l (p+ 1)/2

Idet AIP ' dS dS (dS) IAA'I(P+ 1)/2 1 I(p 1)72 -is1(p? 1)72 pd)

In fact, a similar calculation suggests that ,u is the only invariant measure in

Sp (up to multiplication of IL by a positive constant). Consider a measure v




of the form i(dS) = h(S) dS where h is a positive Borel measurable

function and dS is Lebesgue measure. In order that v be invariant, we must

have

h(S) dS = v(dS) = v(dgA(S)) = h(gA(S))d(gA(S))

= h(gA(S))Idet AIP+' dS

so h should satisfy the equation

h(S) = h(ASA')IAA'l(P+ 1)/2

since gA(S) = ASA' and Idet AIP+ = IAA'I(P+ 1)/2. Set S = Ip, B - AA', and c = h(Ip). Then

h(B) = tB ' B E' SP

so

v(dS) = cii(dS)

where c is a positive constant. Making this argument rigorous is one of the

topics treated in the next chapter. The calculation of the Jacobian of g on GT to Sp+ is next.

Proposition 5.18. For g(T) = TT', T E GT

p

Jg(T ) = 2P I1 tl?-+ i=I

where tl,..., tp are the diagonal elements of T.

Proof. The Jacobian Jg is the unique continuous function defined on GT that satisfies the equation

' f(S) dS - f (g(T)) Jg(T) J

f ( S)5 + IT/2 T+ g(T) (P+I)'2 d1 T

for all Borel measurable functions f for which the integral over S + exists. But the left-hand side of this equation is invariant under the replacement of

f(S) byf(ASA') for any nonsingularp x p matrix. Thus the right-hand side




must have the same property. In particular, for A E GT we have

f (TT+) I (T) dT f ( TT)) Jg(TT) dTT.

In this second integral, we make the change of variable T = A -'B for A e GT fixed and B e GT. By Proposition 5.12, the Jacobian of this transformation is l/flPai where al ,..., a are the diagonal elements of A.

Thus

f (TT') f(~d= f(BB') hg(A - B) dB

J t (Tr))/Jg( T) dT =dGB.B'(+'z All 'lHa T ~~~~Hai1

Since this must hold for all Borel measurable f and since Jg is a continuous

function, it follows that for all T e GT and A E GT

Jg(T) =Jg(A-'T) IAp

rl aiii

Setting A = T and noting that ITI = rIP ti,, we have

p

Jg (T) = Jg(Ip) rH tP? -+

Thus Jg(T) is a constant k times HftfPVt+ 1. Hence

dS kf (TT') lP ild = |f (S)

ISI(P+ + + kf (TT') H tidT.

To evaluate the constant k, pick

f(S) = ISIr/2exp[- 2trS], r > p - 1

But

|Sr/2exp[ - 1 tr S]

dS c(r, p)

JS+ISIeXP 2 s

- ( c(r/p



where c(r, p) is defined in Example 5.1. However,

p

kf ITTilr/2exp[ - 2 tr TT'] IIt,, dT

GT~~~~~~ I

=kfC Hlt,iexpt- I -

,4 dT =k2-PC( r, p)

so k = 2P. The evaluation of the last integral is carried out by noting that tii ranges from 0 to oo and tij for j < i ranges from - oo to oo. Thus the

integral is a product of p( p + 1)/2 integrals on R, each of which is easy to

evaluate. o

A by-product of this proof is that

h(T) - i

exp . t

c(r, p) ]

is a density function on GT. Since the density h factors into a product of

densities, the elements of T, ti1 forj < i, are independent. Clearly,

C(tij) =N(O, 1) forj p.

Proposition 5.19. Define g on Gu to Sp by g(U)= UU'. Then Jg(U) is

given by

p Jg(U) = 2P Hu,

i=1I

where u,,..., upp are the diagonal elements of U.

Proof The proof is essentially the same as the proof of Proposition 5.18

and is left to the reader. o

The technique used to prove Proposition 5.18 is an important one. Given g on GT to Sp, the idea of the proof was to write down the equation the

Jacobian satisfies, namely,

f S)/ dS = WT)) Jg(T) dT

I+ (p+1)/2 GT ITtP

for all integrable f. Since this equation must hold for all integrable f, Jg is uniquely defined (up to sets of Lebesgue measure zero) by this equation. It is clear that any property satisfied by the left-hand integral must also be satisfied by the right-hand integral and this was used to characterize Jg. In particular, it was noted that the left-hand integral remained the same if f(S) was replaced by f(ASA') for an nonsingular A. For A E GT, this led to the

equation

Jg(T) = Jg(A-T) lAlP' Ha~

which determined Jg. It should be noted that only Jacobians of the linear transformations discussed in Propositions 5.11 and 5.13 were used to determine the Jacobian of the nonlinear transformation g. Arguments similar to this are used throughout Chapter 6 to derive invariant integrals

(measures) on matrix groups and spaces that are acted upon by matrix

groups.

PROBLEMS

1. Given A e p n with rank(A) = p, show that A = *T where I E Cp n

and T E GT. Prove that I and T are unique.

2. Define the function F on SP to GT as follows. For each S e Sp , F(S)

is the unique element in GT such that S = F(S)(F(S))'. Show that F(TST') = TF(S) for T E GT and ScE .

3. Given S E Sp, show there exists a unique U E Gu such that S = UU'.

4. For S E S., partition S as

/Sll S12

S ( 11S21 S22)

where Sij is pi x pj, i, j = 1, 2. Assume for definiteness that pi < P2.

PROBLEMS 181

Show that S can be written as

( SI1 Sl2 {(Al O Ip, (DO)(A, o S21 S22J 0 A2 k(DO)' IP2 Jk A2)

where Ai is pi X pi and nonsingular, D is p, x p, and diagonal with

diagonal elements in [0, 1).

5. Let Co,, be those elements in ep that have rank p. Define F on X Gu

to Cn by F(4, U)

= PU.

(i) Show that F is one-to-one onto, and describe the inverse of F.

(ii) For FEE On and T E GT, define F ? T on Co to Co by

(F ? T)A = FAT'. Show that (r X T)F('I, U) = F(F4, UT'). Also, show that F-,((r X T)A) = (IF, UT') where F-'(A) =

(4, U).

6. Let Bo and B1 be open sets in Rn and fix xo E Bo. Suppose g maps Bo into B1 and g(x) = g(xo) + A(x - xo) + R(x - xo) where A is an

n x n matrix and R(.) is a function that satisfies

lim IIR(u)II = 0. u-0 liull

Prove that A = Dg(xo) so Jg(xo) = Idet(A)l.

7. Let V be the linear coordinate space of all p X p lower triangular real

matrices so V is of dimension p(p + 1)/2. Let Sp be the linear

coordinate space of all p X p real symmetric matrices so Sp is also of

dimension p(p + 1)/2. (i) Show that GT is an open subset of V.

(ii) Define g on GT to ,Sp by g(T) = TT'. For fixed TO E GT, show

that g(T) = g(To) + L(T -

To) + (T - To)(T

- TO)' where L is

defined on V to Sp by L(x) = xTo' + Tox', x E V. Also show

that R(T - To) = (T - To)(T - TO)' satisfies

lim JR( X)JJ - 0. x--+o llxii

(iii) Prove by induction that det L = 2PHPt_

-i 1 where t,

that detL = IPHt~~'weet1. tp, are the diagonal elements of To.

(iv) Using (iii) and Problem 6, show that Jg(T) = 2PIlPt/-+ . (This is just Proposition 5.18).




8. When S is a positive definite matrix, partition S and S-' as

(S2, S22 )S2l s22

Show that

sl (si, -SS:221l S"= Sll-Sl2Si22S)

s12 =-S12

S22 = (S22 -S21S12

S2' = s22S2S 521 =~ S1 i

and verify the identity

si-Is2,sl' = s22s21s-I. 2- 11 22 -

9. In coordinate space RP, partition x as x = (y), and for E > 0, parti

tion 2 : p x p conformably as

121 222

Define the inner product (, ) on RP by (u, v) = u' - 'v.

(i) Show that the matrix

p =1 -:2:222

defines an orthogonal projection in the inner product (, ). What is 31 (P)?

(ii) Show that the identity

( Z2- (Z- (Y -

-2E Z) E (Y- El2E21Z) + z'X221z

is the same as the identity

IIXI12 = ljPXII2 + I(I - P)x112

where (x, x) = 11x112 and x = (y).




(iii) For a random vector

x-(Y) p

with P-(X) = N(O, :), I > 0, use part (ii) to give a direct proof

via densities that the conditional distribution of Y given Z is

N(122j21Z 211 - 212 221).

10. Verify the equation

p [P

|rtiri 1:p_ t,|dT=2 -Pc(r, p) jes i in[exp I. dT= =

ir2 ,p

where c(r, p) is given in Example 5. 1. Here, r is real, r > p - 1


1. Other matrix factorizations of interest in statistical problems can be

found in Anderson (1958), Rao (1973), and Muirhead (1982). Many matrix factorizations can be viewed as results that give a maximal

invariant under the action of a group?a topic discussed in detail in

Chapter 7.

2. Only the most elementary facts concerning the transformation of mea

sures under a change of variable have been given in the second section. The Jacobians of other transformations that occur naturally in statisti cal problems can be found in Deemer and Olkin (1951), Anderson

(1958), James (1954), Farrell (1976), and Muirhead (1982). Some of these transformations involve functions defined on manifolds (rather than open subsets of Rn) and the corresponding Jacobian calculations

require a knowledge of differential forms on manifolds. Otherwise, the

manipulations just look like magic that somehow yields answers we do not know how to check. Unfortunately, the amount of mathematics

behind these calculations is substantial. The mastery of this material is no* mean feat. Farrell (1976) provides one treatment of the calculus of

differential forms. James (1954) and Muirhead (1982) contain some

background material and references.

3. I have found Lang (1969, Part Six, Global Analysis) to be a very readable introduction to differential forms and manifolds.



CHAPTER 6

Topological Groups

and Invariant Measures

The language of vector spaces has been used in the previous chapters to describe a variety of properties of random vectors and their distributions. Apart from the discussion in Chapter 4, not much has been said concerning the structure of parametric probability models for distributions of random vectors. Groups of transformations acting on spaces provide a very useful framework in which to generate and describe many parametric statistical

models. Furthermore, the derivation of induced distributions of a variety of functions of random vectors is often simplified and clarified using the existence and uniqueness of invariant measures on locally compact topologi cal groups. The ideas and techniques presented in this chapter permeate the remainder of this book.

Most of the groups occurring in multivariate analysis are groups of nonsingular linear transformations or related groups of affine transforma tions. Examples of matrix groups are given in Section 6.1 to illustrate the

definition of a group. Also, examples of quotient spaces that arise naturally in multivariate analysis are discussed.

In Section 6.2, locally compact topological groups are defined. The existence and uniqueness theorem concerning invariant measures (integrals) on these groups is stated and the matrix groups introduced in Section 6.1

are used as examples. Continuous homomorphisms and their relation to

relatively invariant measures are described with matrix groups again serving as examples. Some of the material in this section and the next is modeled

after Nachbin (1965). Rather than repeat the proofs given in Nachbin

(1965), we have chosen to illustrate the theory with numerous examples. Section 6.3 is concerned with the existence and uniqueness of relatively

invariant measures on spaces that are acted on transitively by groups of

184



GROUPS 185

transformations. In fact, this situation is probably more relevant to statisti cal problems than that discussed in Section 6.2. Of course, the examples are

selected with statistical applications in mind.

6.1. GROUPS

We begin with the definition of a group and then give examples of matrix

groups.

Definition 6.1. A group (G, o) is a set G together with a binary operation o such that the following properties hold for all elements in G:

(i) (g1 ? g2) ? g3 = g1 ? (g2 ? g3).

(ii) There is a unique element of G, denoted by e, such that g o e =

e o g = g for all g E G. The element e is the identity in G.

(iii) For each g E G, there is a unique element in G, denoted by g-', such that g o = g-I o g = e. The element g-' is the inverse

of g.

Henceforth, the binary operation is ordinarily deleted and we write g1 g2 for

o1 ? g2. Also, parentheses are usually not used in expressions involving more than two group elements as these expressions are unambiguously defined in (i). A group G is called commutative if g1g2 = g29g for all

g1, g2 E G. It is clear that a vector space V is a commutative group where

the group operation is addition, the identity element is 0 E V, and the

inverse of x is - x.

* Example 6.1. If (V,(-, -)) is a finite dimensional inner product

space, it has been shown that the set of all orthogonal transforma

tions 6(V) is a group. The group operation is the composition of linear transformations, the identity element is the identity linear transformation, and if F E e(V), the inverse of r is I". When V is the coordinate space R', ((V) is denoted by en, which is just the

group of n x n orthogonal matrices.

* Example 6.2. Consider the coordinate space RP and let G' be the set of all p x p lower triangular matrices with positive diagonal elements. The group operation in G' is taken to be matrix multipli cation. It has been verified in Chapter 5 that G' is a group, the

identity in G' is the p x p identity matrix, and if T e G', T- ' is



186 TOPOLOGICAL GROUPS AND INVARIANT MEASURES

just the matrix inverse of T. Similarly, the set of p x p upper

triangular matrices with positive diagonal elements Gt is a group with the group operation of matrix multiplication.

* Example 6.3. Let V be an n-dimensional vector space and let Gl(V) be the set of all nonsingular linear transformations of V onto V. The group operation in Gl(V) is defined to be composition of linear transformations. With this operation, it is easy to verify that

Gl(V) is a group, the identity in Gl(V) is the identity linear transformation, and if g E Gl(V), g- ' is the inverse linear transfor

mation of g. The group Gl(V) is often called the general linear group of V. When V is the coordinate space Rn, Gl( V) is denoted by

Gln. Clearly, GlI is just the set of n x n nonsingular matrices and

the group operation is matrix multiplication.

It should be noted that 0(V) is a subset of Gl(V) and the group

operation in C(V) is that of Gl(V). Further, GT and Gu are subsets of Gln with the inherited group operations. This observation leads to the definition of a subgroup.

Definition 6.2. If (G, o) is a group and H is a subset of G such that (H, o) is also a group, then (H, o) is a subgroup of (G, o).

In all of the above examples, each element of the group is also a

one-to-one function defined on a set. Further, the group operation is in fact

function composition. To isolate the essential features of this situation, we define the following.

Definition 6.3. Let (G, o) be a group and let 9 be a set. The group (G, o) acts on the left of 9X if to each pair (g, x) E G x 9X, there corresponds a

unique element of IX, denoted by gx, such that

(i) g(g2X) = (g1 o g2)x.

(ii) ex = x.

The content of Definition 6.3 is that there is a function on G x % to DC

whose value at (g, x) is denoted by gx and under this mapping, (g,, g2x)

and (g, a g2, x) are sent into the same element. Furthermore, (e, x) is

mapped to x. Thus each g E G can be thought of as a one-to-one onto

function from 9 to 9 and the group operation in G is function composition.

To make this claim precise, for each g E G, define tg on 6 to 6 by

tg(x) = gx.



PROPOSITION 6.1 187

Proposition 6.1. Suppose G acts on the left of 9X. Then each tg is a

one-to-one onto function from 9 to 9 and:

(i) tg1tg2 tg1 ' 92

(ii) tg ' = t-.

Proof. To show tg is onto, consider x e 9. Then tg(g-'x) = g(g-'x)=

(g o g-=)x = ex = x where (i) and (ii) of Definition 6.3 have been used.

Thus tg is onto. If tg(Xi) = tg(X2), then gx, = gX2 SO

XI =

ex, = (g-l o g)X1 = g1-(gXl) = g'-(gx2)

= (g-l o g)X2 = eX2 = X2

Thus tg is one-to-one. Assertion (i) follows immediately from (i) of Defini tion 6.3. Since te is the identity function on 6X and (i) implies that

t9tg-I - tg-'tg = te,

wehavet - =t ' 0

Henceforth, we dispense with tg and simply regard each g as a function

on 9 to 9 where function composition is group composition and e is the

identity function on 9I. All of the examples considered thus far are groups of functions on a vector space to itself and the group operation is defined to

be function composition. In particular, GI(V) is the set of all one-to-one onto linear transformations of V to V and the group operation is function

composition. In the next example, the motivation for the definition of the group operation is provided by thinking of each group element as a function.

* Example 6.4. Let V be an n-dimensional vector space and consider the set Al(V) that is the collection of all pairs (A, x) with A E Gl(V) and x E V. Each pair (A, x) defines a one-to-one onto function from V to V by

(A, x)v = Av + x, v E V.

The composition of (A,, xl) and (A2, x2) is

(A,, x,)(A2, x2)v = (A,, x,)(A2v + x2) = A,A2v + AIx2 + xI

- (AIA2, AIX2 + XI)v.




Also, (I, 0) E Al(V) is the identity function on V and the inverse of (A,x) is (A-, -A-'x). It is now an easy matter to verify that

Al(V) is a group where the group operation in Al(V) is

(A,, x,)(A2, x2)-(AIA2, A,x2 + xl).

This group Al(V) is called the affine group of V. When V is the coordinate space Rn, Al(V) is denoted by Alna.

An interesting and useful subgroup of Al(V) is given in the next example.

* Example 6.5. Suppose V is a finite dimensional vector space and let M be a subspace of V. Let H be the collection of all pairs (A, x)

where x E M, A(M) C M, and (A, x) E Al(V). The group opera

tion in H is that inherited from Al(V). It is a routine calculation to

show that H is a subgroup of Al(V). As a particular case, suppose

that V is Rn and M is the m-dimensional subspace of Rn consisting

of those vectors x E R' whose last n - m coordinates are zero. An

n x n matrix A E Gln satisfies AM c M iff

A,l A12) t 22J

where A11 is m x m and nonsingular, A,2 is m x (n - m), and A22

is (n - m) x (n - m) and nonsingular. Thus H consists of all pairs

(A, x) where A E Gln has the above form and x has its last n - m

coordinates zero.

* Example 6.6. In this example, we consider two finite groups that arise naturally in statistical problems. Consider the space Rn and let

P be an n x n matrix that permutes the coordinates of a vector

x E Rn. Thus in each row and in each column of P, there is a single

element that is one and the remaining elements are zero. Con

versely, any such matrix permutes the coordinates of vectors in Rn.

The set 6,n of all such matrices is called the group of permutation

matrices. It is clear that 'Pn is a group under matrix multiplication

and in has n! elements. Also, let 6Dn be the set of all n X n diagonal

matrices whose diagonal elements are plus or minus one. Obviously,

6Dn is a group under matrix multiplication and 6Dn has 2n elements.

The group 6Dn is called the group of sign changes on Rn. A bit of

reflection shows that both '57n and GiDn are subgroups of On. Now, let



PROPOSITION 6.1 189

H be the set

H = {PDIP e ';9,, D E (6).

The claim is that H is a group under matrix multiplication. To see

this, first note that for P E -P and D E 6, PDP' is an element of 6D n. Thus if PAD1 and P2D2 are in H, then

P,D,P2D2 = PIP2P2DIP2D2 = P3D3 E H

where P3 = P,P2 and D3 = P2DIP2D2. Also,

(PD) ' = DP' = P'PDP' e H.

Therefore H is a group and clearly has 2nn! elements.

Suppose that G is a group and H is a subgroup of G. The quotient space

G/H, to be defined next, is often a useful representation of spaces that arise in later considerations. The subgroup H of G defines an equivalence relation in G by g, g2 iff gI Eg, H. That = is an equivalence relation is easily verified using the assumption that H is a subgroup of G. Also, it is not

difficult to show that g, g2 iff the set g,H = {glhlh E H) is equal to the set g2 H. Thus the set of points in G equivalent to g1 is the set g IH.

Definition 6.4. If H is a subgroup of G, the quotient space G/H is defined to be the set whose elements are gH for g E G.

The quotient space G/H is obviously the set of equivalence classes (defined by H) of elements of G. Under certain conditions on H, the quotient space G/H is in fact a group under a natural definition of a group operation.

Definition 6.5. A subgroup H of G is called a normal subgroup if g 'Hg = H for all g E G.

When H is a normal subgroup of G, and giH E G/H for i = 1, 2, then

g,Hg2H- {g(g = gh,g2h 2; hl, h2 e H)

= g,g2g92Hg2H = g,g2HH = g,g2H

since HH = H.




Proposition 6.2. When H is a normal subgroup of G, the quotient space

G/H is a group under the operation

(g,H)(g2H)=g9g2H.

Proof. This is a routine calculation and is left to the reader.

* Example 6.7. Let Al(V) be the affine group of the vector space V.

Then

H ((I, x)lx E V)

is easily shown to be a subgroup of G, since (I, x0)(I, x2) = (I, xi + x2). To show H is normal in Al(V), consider (A, x) E Al(V)

and (I, xo) E H. Then

(A, x) (I, xo)(A, x) =

(A-, -A'x)(A, x + xo)

= (I,

A-lx + A-lxo- A-lx)

= (I, A -'xo),

which is an element of H. Thus g- 'Hg C H for all g E Al(V). But

if (I, xo) E H and (A, x) E Al(V), then

(A, x) '(I, Axo)(A, x) = (I, xo)

so g- 'Hg = H, for g E Al(V). Therefore, H is normal in Al(V). To

describe the group Al(V)/H, we characterize the equivalence rela

tion defined by H. For (Ai, xi) E Al(V), i = 1, 2,

(Al,, x,)(A x2) = (A-', -A, 'x,)(A2, x2)

=(A, A2, A1'x2-A'lx,)

is an element of H iff A 'A2 = I or Al = A2. Thus (A, xI)

is

equivalent to (A2, x2) iff A, = A2. From each equivalence class, select the element (A,O). Then it is clear that the quotient group

Al(V)/H can be identified with the group

K= ((A,O)|A E Gl(V))



PROPOSITION 6.3 191

where the group operation is

(Al, O)(A2,O) = (AIA2, O).

Now, suppose the group G acts on the left of the set 9?. We say G acts

transitively on 9 if, for each x, and x2 in X, there exists a g E G such that

gx, = x2. When G acts transitively on X, we want to show that there is a

natural one-to-one correspondence between X and a certain quotient space. Fix an element xo E 9C and let

H = {hlhxo = xo, h E G).

The subgroup H of G is called the isotropy subgroup of xo. Now, define the

function T on G/H to QC by T(gH) = gxo.

Proposition 6.3. The function T is one-to-one and onto. Further,

T(gjgH) = g1T(gH).

Proof. The definition of T clearly makes sense as ghxo = gxo for all h e H.

Also, T is an onto function since G acts transitively on 'X. If T(g,H) =

T(g2H), then g1xO = g2xO so gj'g1 Ee H. Therefore, g,H= g2H 50 7 is one-to-one. The rest is obvious. El

If H is any subgroup of G, then the group G acts transitively on 9 G/H where the group action is

gl(gH) -glgH.

Thus we have a complete description of the spaces 9 that are acted on

transitively by G. Namely, these spaces are simply relabelings of the quotient spaces G/H where H is a subgroup of G. Further, the action of g on 9 corresponds to the action of G on the quotient space described in Proposition 6.3. A few examples illustrate these ideas.

* Example 6.8. Take the set 9 to be IF n-the set of n X p real

matrices I that satisfy +'I = Ip, 1 < p < n. The group G = /9n of

all n X n orthogonal matrices acts on p, n by matrix multiplication. That is, if FE Q and I e 6 n, then r* is the matrix product of F

and I. To show that this group action is transitive, consider I' and I2 in SF n. Then, the columns of *, form a set of p orthonormal

vectors in R' as do the columns of '2. By Proposition 1.30, there exists an n x n orthogonal matrix F that maps the columns of I1 into the columns of 'P2. Thus FI, = '2 so Qn is transitive on p .

A convenient choice of x0 E Cp n to define the map T is

x IP xO= (o)

where 0 is a block of (n - p) X p zeroes. It is not difficult to show

that the subgroup H = (IFrx0 = x0, r E en) iS

H=({FIF= ( r22)I 22 (n -p)}

The function T is

T(H) = xo =rp

which is the n x p matrix consisting of the first p columns of r. This gives an obvious representation of ip n.

Example 6.9. Let 9 be the set of all p x p positive definite

matrices and let G = Glp. The transitive group action is given by

A(x) = AxA' where A is a p x p nonsingular matrix, x E 'X, and A' is the transpose of A. Choose xo e 9X to be Ip. Obviously, H = 6p and the map T is given by

T(AH) =

A(xo) = AA'.

The reader should compare this example with the assertion of Proposition 1.31.

Example 6.10. In this example, take 9 to be the set of all n x p

real matrices of rank p, p < n. Consider the group G defined by

G={gfg=r? T,rEQ, TE GT)

where GT is the group of all p x p lower triangular matrices with

positive diagonal elements. Of course, ? denotes the Kronecker product and group composition is

(11 ? T1)(12 X T2) = (1F12) ? (TIT2).

PROPOSITION 6.3 193

The action of G on 9 is

(F T)X= rXT', X E 9C.

To show G acts transitively on 9, consider Xl, X2 E 9 and write

Xi= - LUi, where 4' E- IF and UV E Gt, i = 1, 2 (see Proposition

5.2). From Example 6.8, there is a F e en such that F4' = '2. Let

T'1" = U2 so

rX,T' = r*IU1Uj'U2 = *2U2 = X2.

Choose X0 E % to be

( O )

as in Example 6.8. Then the equation (F ? T)Xo = X0 implies that

Ip= XoXo = ((I' ? T)Xo)'(I ? T)Xo = TX;F'FXOT' = TT'

so T Ip by Proposition 5.4. Then the equation (F ? Ip)X0 = X=

is exactly the equation occurring in Example 6.8 for elements of the subgroup H. Thus for this example,

H= (v?i v= (- vn4 } ( P ( O 0 ]22 ) ' 22 E

I)

Therefore,

T((FX?T)H)=(F ? T)X0 = r( T

is the representation for elements of '. Obviously,

r ) r E c- FPn

and the representation of elements of 9 via the map T is precisely the representation established in Proposition 5.2. This representa tion of 9 is used on a number of occasions.



6.2. INVARIANT MEASURES AND INTEGRALS

Before beginning a discussion of invariant integrals on locally compact topological groups, we first outline the basic results of integration theory on locally compact topological spaces. Consider a set 9X and let i be a

Hausdorff topology for 9X.

Definition 6.6. The topological space (C, i) is a locally compact space if for each x E 9, there exists a compact neighborhood of x.

Most of the groups introduced in the examples of the previous section are subsets of the space Rm, for some m, and when these groups are given the

topology of Rm, they are locally compact spaces. The verification of this is not difficult and is left to the reader. If (9X, i) is a locally compact space,

STC( ) denotes the set of all continuous real-valued functions that have

compact support. Thus f E X(;X() if f is a continuous and there is a

compact set K such that f(x) = 0 if x 0 K. It is clear that Yu((9X) is a real

vector space with addition and scalar multiplication being defined in the obvious way.

Definition 6.7. A real-valued function J defined on 'X(9X) is called an

integral if:

(i) J(alfl + a2f2) = alJ(fl) + a2J(f2) for a,, a2 E R and fl, f2 E

(ii) J(f ) > 0 if f > O, f E YuC(9).

An integral J is simply a linear function on YuC( (X) that has the additional

property that J( f ) is nonnegative when f > 0. Let 03 (9X) be the a-algebra

generated by the compact subsets of 6X. If ji is a measure on B (6X ) such

that ,u(K) < + oo for each compact set K, it is clear that

J( f ) J | f (x)M(dx)

defines an integral on SCf(). Such measures ,u are called Radon measures.

Conversely, given an integral J, there is a measure ,I on ' (BX) such that

,u(K) < + so for all compact sets K and

(f ) = ff(x)tt(dx)

for f E 'X(6X%). For a proof of this result, see Segal and Kunze (1978,

INVARIANT MEASURES AND INTEGRALS 195

Chapter 5). In the special case when (9, i) is a a-compact space-that is,

9 = u OK1 where KI is compact-then the correspondence between in

tegrals J and measures j, that satisfy ,u(K) < + oo for K compact is

one-to-one (see Segal and Kunze, 1978). All of the examples considered here are a-compact spaces and we freely identify integrals with Radon measures and vice versa.

Now, assume (9C, i) is a a-compact space. If an integral J on Xff) corresponds to a Radon measure ,t on 6(9X), then J has a natural extension

to the class of all 603(%X)-measurable and yi-integrable functions. Namely, J

is extended by the equation

J(f) = ff(x)t(dx)

for all f for which the right-hand side is defined. Obviously, the extension of J is unique and is determined by the values of J on '3CQX). In many of the

examples in this chapter, we use J to denote both an integral on SC(%X) and

its extension. With this convention, J is defined for any 13(DC) measurable function that is ,-integrable where u corresponds to J.

Suppose G is a group and i is a topology on G.

Definition 6.8. Given the topology i on G, (G, i) is a topological group if the mapping (x, y) -- xy- is continuous from G x G to G. If (G, i) is a

topological group and (G, i) is a locally compact topological space, (G, i) is called a locally compact topological group

In what follows, all groups under consideration are locally compact topological groups. Examples of such groups include the vector space Rn, the general linear group GlI the affine group Ale, and G'. The verification that these groups are locally compact topological groups with the Euclidean space topology is left to the reader.

If (G, i) is a locally compact topological group, fC(G) denotes the real vector space of all continuous functions on G that have compact support. For s E G and f E '3C(G), the left translate of f by s, denoted by sf, is defined by (sf )(x) f(s -'x), x E G. Clearly, sf E S3C(G) for all s E G.

Similarly, the right translate of f E StC(G), denoted by fs, is (fs)(x) =_ f(xs- ') and fs e SC(G).

Definition 6.9. An integral J * 0 on SXC(G) is left invariant if J(sf) J(f) for all f E -'X3(G) and s E G. An integral J * 0 on %;(G) is right invariant if

J(fs) = J(f ) for all fE Y (G) and s e G.

1% TOPOLOGICAL GROUPS AND INVARIANT MEASURES

The basic properties of left and right invariant integrals are summarized in the following two results.

Theorem 6.1. If G is a locally compact topological group, then there exist left and right invariant integrals on SC(G). If J, and J2 are left (right) invariant integrals on YuC(G), then J2 = cJ, for some positive constant c.

Proof. See Nachbin (1965, Section 4, Chapter 2).

Theorem 6.2. Suppose that

.J( f) _|ff(x)tu(dx)

is a left invariant integral on 3(G). Then there exists a unique continuous function Ar mapping G into (0, x) such that

(f(xs- ')(dx) Ar(S) f(x),u(dx)

for all s E G and f E i;C(G). The function Ar, called the right-hand modulus

of G, also satisfies:

(i) Ar(St) = Ar(S)Ar(t), S, t E G.

(ii) Jf(x')(dx) = f(X)Ar(X-1),u(dx).

Further, the integral

J(f) f f(X)Ar(Xl')t(dX)

is right invariant.

Proof. See Nachbin (1965, Section 5, Chapter 2). The two results above establish the existence and uniqueness of right and

left invariant integrals and show how to construct right invariant integrals from left invariant integrals via the right-hand modulus Ar. The right-hand

modulus is a continuous homomorphism from G into (0, x)-that is, Ar 1s

continuous and satisfies Ar(St)0 = r(S)Ar(t), for s, t E G. (The definition of a homomorphism from one group to another group is given shortly.)

Before presenting examples of invariant integrals, it is convenient to introduce relatively left (and right) invariant integrals. Proposition 6.4, given



PROPOSITION 6.4 197

below, provides a useful method for constructing invariant integrals from

relatively invariant integrals.

Definition 6.10. A nonzero integral J on SC(G) given by

J(f ) = ff(x)m(dx), f e Yu(G),

is called relatively left invariant if there exists a function X on G to (0, oo)

such that

ff(s-x)m(dx) = x(s)ff(x)m(dx)

for all s e G and f e S(;(G). The function X is the multiplier for J.

It can be shown that any multiplier X is continuous (see Nachbin, 1965). Further, if J is relatively left invariant with multiplier X, then for s, t E G and f E cW (G),

x(st) f (x)m(dx) = fr((st)y'x)m(dx) = J(tf)(s-'x)m(dx)

= x(s)J(tf)(x)m(dx) = X(s)f(t-'x)m(dx)

= X(s)x(t) f(x)m(dx).

Thus x(st) = X(s)X(t). Hence all multipliers are continuous and are homo morphisms from G into (0, oc). For any such homomorphism X, it is clear that x(e) = 1 and X(s-') = l/X(s). Also, x(G) = {x(s)Is e G) is a sub

group of the group (0, oc) with multiplication as the group operation.

Proposition 6.4. Let X be a continuous homomorphism on G to (0, oo).

(i) If J(f) = Jf(x),u(dx) is left invariant on 'X(G), then

J,(f) ff(x)x(x)A (dx)

is a relatively left invariant integral on 9(;(G) with multiplier X.




(ii) If J,(f) = Jf(x)m(dx) is relatively left invariant with multiplier X, then

J(f) -|f(x)x(x-')m(dx)

is a left invariant integral.

Proof. The proof is a calculation. For (i),

J1(sf) = f(sf)(x)x(x)ji(dx) = |f(s- x)X(ss'- x),i(dx)

= x(s)f(s-'x)x(s-1x)[(dx) = X(s)Jf(x)x(x)l(dx)

= x(S)Ji(f).

Thus J1 is relatively left invariant with multiplier X. For (ii),

J(sf) = f(- lx)x(x'- )m(dx) = f(s'- x)X(s'-'sx'- )m(dx)

= X(s-)f (s -lx)X((s-lx) ')m(dx)

= X(s')X(s) f(x)X(x-')m(dx)

= ff(X)x(x'- )m(dx) = J(f ).

Thus J is a left invariant integral and the proof is complete. O

If J is a relatively left invariant integral with multiplier X, say

J(x) = Jf(x)m(dx),

the measure m is also called relatively left invariant with multiplier X. A

nonzero integral J, on Yu;(G) is relatively right invariant with multiplier X if

J1(fs) = x(s)J1(f). Using the results given above, if J1 is relatively right

invariant with multiplier X, then J, is relatively left invariant with multiplier



PROPOSITION 6.4 199

X/Ar where Ar is the right-hand modulus of G. Thus all relatively right and

left invariant integrals can be constructed from a given relatively left (or right) invariant integral once all the continuous homomorphisms are known.

Also, if a relatively left invariant measure m can be found and its multiplier X calculated, then a left invariant measure is given by m/X according to

Proposition 6.4. This observation is used in the examples below.

* Example 6.11. Consider the group Gln of all nonsingular n x n matrices. Let ds denote Lebesgue measure on Gl. Since Gl =

(sldet(s) * 0), Gln is a nonempty open subset of n2-dimensional Euclidean space and hence has positive Lebesgue measure. For f e SX (Gln), let

J(f) = ff(t) dt.

To find a left invariant measure on Gln, it is now shown that

J(sf ) = Idet(s)lnJ(f) so J is relatively left invariant with multiplier X(s) = Idet(s)In. From Proposition 5.10, the Jacobian of the trans

formation g(t) = st, s e Gln, is Idet(s)l. Thus

J(sf) = |f (st) dt = Idet(s)Inff(t) dt = Idet(s)InJ(f).

From Proposition 6.4, it follows that the measure

Id ) et( t )In'

is a left invariant measure on Gln. A similar Jacobian argument shows that ,u is also right invariant, so the right-hand modulus of GI is Ar 1. To construct all of the relatively invariant measures on Gln, it is necessary that the continuous homomorphisms X be characterized. For each a E R, let

Xe(s) = Idet(s)Ia, s E Gl.

Obviously, each X. is a continuous homomorphism. However, it can be shown (see the problems at the end of this chapter) that if X is a

continuous homomorphism of Gln into (0, oo), then X = Xa for some

a E R. Hence every relatively invariant measure on Gln is given by

m(dt) = CXa(t) dt

where c is a positive constant and a e R.




A group G for which Ar = 1 is called unimodular. Clearly, all commuta

tive groups are unimodular as a left invariant integral is also right invariant. In the following example, we consider the group G', which is not unimodu

lar, but GT is a subgroup of the unimodular group Gl.

* Example 6.12. Let GT be the group of all n X n lower triangular

matrices with positive diagonal elements. Thus GT is a nonempty

open subset of [n(n + 1)/21-dimensional Euclidean space so GT

has positive Lebesgue measure. Let dt denote [n(n + 1)/2] dimensional Lebesgue measure restricted to GT. Consider the in

tegral

J( f) - |f(t) dt

defined on WY(GT). The Jacobian of the transformation g(t)= st, S E GT is equal to

n

XO(S) nsiii i=7I

where s has diagonal elements S11,.., Snn (see Proposition 5.13). Thus

J(sf) = f(s' t) dt = Xo(s)f (t) dt = Xo(s) J(f)

Hence J is relatively left invariant with multiplier Xo so the measure

dt dt (dt)=-rl ti ot

is left invariant. To compute the right-hand modulus Ar for GT, let

J1(f) = ff(t)M(dt)

so J1 is left invariant. Then

JU(fs) =f(ts'),L(dt)=ff(ts-) dt (ts- dt

f(tsI)X (S dt = Xo(s')f f(ts') dt.



PROPOSITION 6.4 201

By Proposition 5.14, the Jacobian of the transform g(t) = ts is

n

Xl(s) = n-"i+l. i= 1

Therefore,

J1(fs) = o(s-P f(ts') = Xd(S ')X (s) (t)

X(s) J (f Xo(s)

By Theorem 6.2,

Ar(S)

- x(S)

= nHSn-2i+1

is the right-hand modulus for GT. Therefore, the measure

P(t =I(dt) dt dt

) r(t) Xo(t)Ar(t) -iH1= itri-i+

is right invariant. As in the previous example, a description of the relatively left invariant measures is simply a matter of describing all the continuous homomorphisms on GT. For each vector c E Rn

with coordinates c,,..., cn,, let

n

XC(t -- (i c, 1=1

where t E GT has diagonal elements t1, ...., tnn. It is easy to verify that Xc is a continuous homomorphism on GT. It is known that if X is a continuous homomorphism on GT, then X is given by Xc for some c E Rn (see Problems 6.4 and 6.9). Thus every relatively left invariant measure on GT has the form

m(dt) = kXc(t) Xd(t)

for some positive constant k and some vector c E R'.

The following two examples deal with the affine group and a subgroup of Gln related to the group introduced in Example 6.5.




* Example 6.13. Consider the group Al, of all affine transformations on R'. An element of Aln is a pair (s, x) where s C GlI and x C Rn.

Recall that the group operation in Aln is

(SI, X1)(S2, X2) = (s1s2, s1X2 + XI)

so

(s, X) = (S, s-s x).

Let ds dx denote Lebesgue measure restricted to Aln. In order to construct a left invariant measure on Aln, it is shown that the

integral

J(f) ff(t, y) dt dy

is relatively left invariant with multiplier

O(S, x) = Idet(s )In+ 1

For (s, x) E Aln,

J((s, x)f) = ff((s x)1(t, y)) dtdy

f ((s- , -s- 'x)(t, y)) dt dy

=ff(s-t, s-'y - s-'x) dtdy

= Idet(s)lff(s-'t, u) dtdu.

The last equality follows from the change of variable u = s- ly - sx,

which has a Jacobian Idet(s)I. As in Example 6.1 1,

f (s-'t, u) dt = idet(s)lnf f(t, u) dt Gln GIn

for each fixed u E Rn. Thus

J((s , x)f) = Idet(s )In+ If (t, u) dt du = Idet(s)l n+ lJ(f)

- Xo(s x)J(f)



PROPOSITION 6.4 203

so J is relatively left invariant with multiplier Xo. Hence the

measure

dsdu dsdu

p(ds, du)

xo(s, u) Idet(s)In+I

is left invariant. To find the right-hand modulus of Aln, let

U)dt du

J1(f) = Jf(t, u) XO(t, u)

be a left invariant integral. Then using an argument similar to that

above, we have

J1(f(S' x)) = ff((t' )(S' X)') dt du xo(t, U)

dt du - Mf((t, u)(s-, '-sx))o(t u)

= (f(ts', u - ts'x) dtdu

dt du

u ) jdet( )in+ '

= Idet(s 1)In+If f(tS,u) dt du

= Idet(s- 1)In?+

Idet(s )InJf(t,

u) dt du )Idet( t)In+ I

= Idet(s)F J1 (f )

Thus Ar(S, x) = Idet(s)l-' so a right invariant measure on Aln is

v(ds, du) = (S ) ,(ds, du) = du

A'(S' u) ~~Idet(s )In

Now, suppose that X is a continuous homomorphism on Aln. Since

(s, x) = (s,O)(e, s-'x) = (e, x)(s,O)




where e is the n X n identity matrix, X must satisfy the equation

X(s, x) = X(s,O)X(e, s-'x) = X(s,O)X(e, x)

Thus for all s c Gln,

X(e, x) = X(e, s-'x).

Letting s- I converge to the zero matrix, the continuity of X implies that

X(e, x) = X(e,O) = 1

since (e, 0) is the identity in Aln. Therefore,

X(s, X) = x(S,O), s E Gl .

However,

X(51X?)(S21?)) = X((S1S2),0) = X(S1X )X(S2 ?)

so X is a continuous homomorphism on Gln. But every continuous homomorphism on Gln is given by s -- Idet(s)ta for some real a. In

summary, X is a continuous homomorphism on Aln iff

X(s, x) = Idet(s)la

for some real number a. Thus we have a complete description of all

the relatively invariant integrals on Alna.

Example 6.14. In this example, the group G consists of all the n

X n nonsingular matrices s that have the form

1sti 5120 S= (

s:P SI1 E Glp, S22 C Glq

where p + q = n. Let M be the subspace of Rn consisting of those

vectors whose last q coordinates are zero. Then G is the subgroup of

Gln consisting of those elements s that satisfy s(M) c M. Let

ds1 ds12ds22 denote Lebesgue measure restricted to G when G is regarded as a subset of (p2 + q2 + pq)-dimensional Eucidean space. Since G is a nonempty open subset of this space, G has

positive Lebesgue measure. As in previous examples, it is shown



PROPOSITION 6.4 205

that the integral

J(f) )-Jff(t) dtll dti2 dt22

is relatively left invariant. For s E G,

J(sf ) = f(s- 't) dt,, dtl2 dt22.

A bit of calculation shows that

(Sii S12)\' _ (si5,1 S11 12S22)

t ? 522} l ? 5~~~22

and

s-1t = 1 t l I 1 12 -

s1i s112s2-2 t22

\ O 5~~~22 t22

Let

uII = sHs1t1, U22 = S22t22

1=1 SI2 - S11

S12522 t22'

The Jacobian of this transformation is

xO(s) idet(s,,)IPldet(S22)lqldet(s,,)Iq = Idet(s11)Injdet(s22)I .

Therefore,

J(Sf) = xo(s)J(f)

so the measure

L ( dt, 1 7 dtI2 dt22 ) - Idet( t 1 )n nIdet( t22 )Iq

is left invariant. Setting

.1(f ) ff(t)(dti, dt'2, dt22),




a calculation similar to that above yields

JI(fS) =

Ar(S)Ji(f)

where

Ar(S) = Idet S1ll-qIdetS221P.

Thus Ar is the right-hand modulus of G and the measure

vp(dt,,,1 dt12 9 dt22 )- A t dtt)Pdtt)In

is right invariant. For a, ,B E R, let

X. ( s) =-ldet(s5, 1)jajdet(S522 )W

Clearly, X.# is a continuous homomorphism of G into (0, oo).

Conversely, it is not too difficult to show that every continuous homomorphism of G into (0, oo) is equal to X for some a, ,B e R.

Again, this gives a complete description of all the relatively in variant integrals on G.

In the four examples above, the same argument was used to derive the

left and right invariant measures, the modular function, and all of the relatively invariant measures. Namely, the group G had positive Lebesgue

measure when regarded as a subset of an obvious Euclidean space. The

integral on SC(G) defined by Lebesgue measure was relatively left invariant with a multiplier that we calculated. Thus a left invariant measure on G was

simply Lebesgue measure divided by the multiplier. From this, the right-hand modulus and a right invariant measure were easily derived. The characteri zation of the relatively invariant integrals amounted to finding all the

solutions to the functional equation X(st) = x(s)X(t) where X is a continu

ous function on G to (0, ox). Of course, the above technique can be applied

to many other matrix groups-for example, the matrix group considered in

Example 6.5. However, there are important matrix groups for which this

argument is not available because the group has Lebesgue measure zero in

the "natural" Eucidean space of which the group is a subset. For example,

consider the group of n X n orthogonal matrices On. When regarded as a

subset of n2-dimensional Euclidean space, (9n has Lebesgue measure zero.

But, without a fairly complicated parameterization of 9n, it is not possible to

regard (9n as a set of positive Lebesgue measure of some Eucidean space.



PROPOSITION 6.5 207

For this reason, we do not demonstrate directly the existence of an invariant measure on On in this chapter. In the following chapter, a probabilistic proof

of the existence of an invariant measure on On is given.

The group Q,n as well as other groups to be considered later, are in fact

compact topological groups. A basic property of such groups is given next.

Proposition 6.5. Suppose G is a locally compact topological group. Then G

is compact iff there exists a left invariant probability measure on G.

Proof. See Nachbin (1965, Section 5, Chapter 2). o

The following result shows that when G is compact, left invariant

measures are right invariant measures and all relatively invariant measures

are in fact invariant.

Proposition 6.6. If G is compact and X is a continuous homomorphism on

G to (0, oo), then X(s) = 1 for all s E G.

Proof Since X is continuous and G is compact, X(G) = (x(s)Is E G) is a compact subset of (0, oo). Since X is a homomorphism, x(G) is a subgroup of (0, oo). However, the only compact subgroup of (0, oo) is (1). Thus

X(s) = 1 for all s e G. El

The nonexistence of nontrivial continuous homomorphisms on compact groups shows that all compact groups are unimodular. Further, all relatively invariant measures are invariant. Whenever G is compact, the invariant

measure on G is always taken to be a probability measure.

6.3. INVARIANT MEASURES ON QUOTIENT SPACES

In this section, we consider the existence and uniqueness of invariant integrals on spaces that are acted on transitively by a group. Throughout this section, 9 is a locally compact Hausdorff space and 'J(6x) denotes the set of continuous functions on 9 that have compact support. Also, G is a locally compact topological group that acts on the left of 6X.

Definition 6.11. The group G acts topologically on 9 if the function from G x 9 to 9 given by (g, x) -- gx is continuous. When G acts topologically on 9, 9 is a left homogeneous space if for each x E 9, the function -tx on G to 9 defined by rx(g) = gx is continuous, open, and onto '9.




The assumption that each 1Tr is an onto function is just another way to say that G acts transitively on 9X. Also, it is not difficult to show that if, for one x E 9, rx is continuous, open, and onto 9X, then for all x, ?T. is

continuous, open, and onto 9?. To describe the structure of left homoge neous spaces 9, fix an element xo e 9 and let

Ho = (glgxo = xo, g e G).

That Ho is a closed subgroup of G is easily verified. Further, the function T considered in Proposition 6.3 is now one-to-one, onto, and T and Tr 1 are

both continuous. Thus we have a one-to-one, onto, bicontinuous mapping between 9 and the quotient space G/Ho endowed with the quotient topology. Conversely, let H be a closed subgroup of G and take % = G/H

with the quotient topology. The group G acts on G/H in the obvious way (g(g,H) = gg1H) and it is easily verified that G/H is a left homogeneous space (see Nachbin 1965, Section 3, Chapter 3). Thus we have a complete

description of the left homogeneous spaces (up to relabelings by T) as

quotient spaces G/H where H is a closed subgroup of G.

In the notation above, let 9 be a left homogeneous space.

Definition 6.12. A nonzero integral J on '7Y(9X)

J(f) =ff(x) m(dx), fe ('X)

is relatively invariant with multiplier X if, for each s E G,

f(s- ix)m (dx) = x(s) f(x)m(dx)

for allf E SC(%).

For f EE '3('X9), the function sf given by (sf )(x) = f(s- 'x) is the left

translate of f by s E G. Thus an integral J on C(9(X) is relatively invariant

with multiplier X if J(sf ) = x(s)J(f). For such an integral,

x(st)J(f ) = J((st)f ) = J(s(tf)) = x(s)J(tf ) = x(s)x(t)J(f)

so X(st) = X(s)X(t). Also, any multiplier X is continuous, which implies that a multiplier is a continuous homomorphism of G into the multiplicative

group (0, oo).



INVARIANT MEASURES ON QUOTIENT SPACES 209

* Example 6.15. Let 9 be the set of all p x p positive definite

matrices. The group G = Glp acts transitively on 9 as shown in

Example 6.9. That 9 is a left homogeneous space is easily verified.

For a E R, define the measure m, by

ma(dx) = (det(x))a/2 dx (p + 1)/2 (det(x))~ 12

where dx is Lebesgue measure on 9X. Let Ja(f) Jf(x)ma(dx). For s E Glp, s(x) = sxs' is the group action on 6. Therefore,

Ja(sf) = Jf(s'(x))ma(dx)

(det(x))(p?+ )/2

Idet(s)laff(s-xs' I)det(s- xs,I)(/2 dx

1)/2

= Idet(s)IaJf(x)(det(X)) a/2 dx 1)/2

The last equality follows from the change of variable x = sys', which has a Jacobian equal to Idet(s)IP+ (see Proposition 5.11). Hence

Ja(sf) = ldet(s)laJ(f)

for all s E Glp, f E CY(9X), and Ja is relatively invariant with multiplier xa(s) = Idet(s)la. For this example, it has been shown that for every continuous homomorphism X on G, there is a relatively invariant integral with multiplier X. That this is not the case in general is demonstrated in future examples.

The problem of the existence and uniqueness of relatively invariant integrals on left homogeneous spaces 9 is completely solved in the follow ing result due to Weil (see Nachbin, 1965, Section 4, Chapter 3). Recall that xo is a fixed element of % and

Ho =

(glgxo = xo, g E G)

is a closed subgroup of G. Let Ar denote the right-hand modulus of G and let Ar denote the right-hand modulus of Ho.



Theorem 6.3. In the notation above:

(i) If J(f) = Jf(x)m(dx) is relatively invariant with multiplier X, then

r(Ih) =X(h)Ar(h) for allh E Ho.

(ii) If X is a continuous homomorphism of G to (0, ox) that satisfies A?(h) = X(h)Ar(h), h E Ho, then a relatively invariant integral with multiplier X exists.

(iii) If J1 and J2 are relatively invariant with the same multiplier, then there exists a constant c > 0 such that J2 = cJ,.

Before turning to applications of Theorem 6.3, a few general comments are in order. If the subgroup Ho is compact, then A?(h) = 1 for all h E Ho. Since the restrictions of X and of Ar to Ho are both continuous homomor

phisms on Ho, Ar(h) = x(h) = 1 for all h e Ho as Ho is compact. Thus

when Ho is compact, any continuous homomorphism X is a multiplier for a

relatively invariant integral and the description of all the relatively invariant integrals reduces to finding all the continuous homomorphisms of G.

Further, when G is compact, then only an invariant integral on %u(X) can

exist as X 1 is the only continuous homomorphism. When G and H are

not compact, the situation is a bit more complicated. Both Ar and Al must

be calculated and then, the continuous homomorphisms X on G to (0, x0) that satisfy (ii) of Theorem 6.3 must be found. Only then do we have a

description of the relatively invariant integrals on SC(X). Of course, the

condition for the existence of an invariant integral (X 1) is that A?(h) =

Ar( h) for all h E Ho. If J is a relatively invariant integral (with multiplier X) given by

J(f ) = ff(x)m(dx), f E 3(9X),

then the measure m is called relatively invariant with multiplier X. In

Example 6.15, it was shown that for each a E R, the measure ma was

relatively invariant under Glp with multiplier Xa Theorem 6.3 implies that

any relatively invariant measure on the space of p x p positive definite

matrices is equal to a positive constant times an m< for some a E R. We

now proceed with further examples.

* Example 6.16. Let 9 = p,n and let G = On. It was shown in

Example 6.8 that Q acts transitively on 6, n. The verification that

INVARIANT MEASURES ON QUOTIENT SPACES 211

lp is a left homogeneous space is left to the reader. Since Q is

compact, Theorem 6.3 implies that there is a unique probability measure IL on p n, that is invariant under the action of onPI Also, any relatively invariant measure on 6F will be equal to a positive constant times ,u. The distribution , is sometimes called the uniform distribution on 65 n. When p = 1, then

ql,n= (xlx E Rn,IIXII = 1),

which is the rim of the unit sphere in Rn. The uniform distribution on 6Cl n iS just surface Lebesgue measure normalized so that it is a

probability measure. When p = n, then 'n, ?=Q on and ,u is the

uniform distribution on the orthogonal group. A different argu ment, probabilistic in nature, is given in the next chapter, which also establishes the existence of the uniform distribution on 6F

* Example 6.17. Take 9X = RP - (0) and let G = Glp. The action of

Glp on 9 is that of a matrix acting on a vector and this action is

obviously transitive. The verification that 9X is a left homogeneous space is routine. Consider the integral

J(f) = ff(x) dx, f EfCQTO)

where dx is Lebesgue measure on 9X. For s e Glp, it is clear that

J(sf ) = Idet(s)l)J(f ) so J is relatively invariant with multiplier xI (s) = Idet(s)I. We now show that J is the only relatively invariant integral on fC( X9). This is done by proving that X1 is the only possible multiplier for relatively invariant integrals on %3C( X). A convenient choice of x0 E 9c is x0 = El where El = (1,0,..., 0). Then

Ho =

(hhel=1, h E Glp}.

A bit of reflection shows that h E Ho iff

hI( h12) w e1(p 122)

where h 22 E- GI(p _ 0

and h 12 is 1 x ( p - 1). A calculation similar to




that in Example 6.14 yields

4(dh,2, dh22) = -dh1 h22

as a left invariant measure on Ho. Then the integral

J1(f) ff|o(h)L(dhl2 dh22)

is left invariant on '3C(HO) and a standard Jacobian argument yields

J(ff) = Ar(h) J1(f, ff E 'J(Ho)

where

Ar(h) = Idet(h22)j, h E Ho.

Every continuous homomorphism on Gip has the form xe(s)= Idet(s)Ia for some a E R. Since Ar = 1 for Glp, Xa can be a

multiplier for an invariant integral iff

Yr(h) = Xa(h), h E Ho.

But A?(h) = jdet(h22)1 and for h E Ho, Xa(h) = Idet(h22)1a so the

only value for a for which Xa can be a multiplier is a = 1. Further,

the integral J is relatively invariant with multiplier X,. Thus Lebesgue measure on 9 is the only (up to a positive constant) relatively

invariant measure on 9 under the action of Gip.

Before turning to the next example, it is convenient to introduce the

direct product of two groups. If GI and G2 are groups, the direct product of

G, and G2, denoted by G GI x G2, is the group consisting of all pairs

(g1, g2) with gi e Gi, i = 1, 2, and group operation

(g1, g2)(h1, h2) =(glhl, g2h2).

If ei is the identity in Gi, i = 1,2, then (el, e2) is the identity in G and

(g, g2'- = (g-', gj'). When G, and G2 are locally compact topological groups, then GI x G2 is a locally compact topological group when endowed

with the product topology. The next two results describe all the continuous

homomorphisms and relatively left invariant measures on G, x G2 in terms



PROPOSITION 6.8 213

of continuous homomorphisms and relatively left invariant measures on G, and G2.

Proposition 6.7. Suppose G, and G2 are locally compact topological groups. Then X is a continuous homomorphism on GI x G2 iff X((g1, g2)) = X1(g1)X2(g2), (g1, g2) e GI x G2, where Xi is a continuous homomor phism on Gi, i = 1, 2.

Proof If X((g1, g2)) = x1(g1)x2(g2), clearly X is a continuous homomor

phism on G, x G2. Conversely, since (g1, g2) = (g,, e2)(el, g2), if X is a

continuous homomorphism on G, X G2, then

X((g1, g2)) = X(g1, e2)X(eI, g2).

Setting XI(g1) = X(gI, e2) and X2(g2) = X(el, g2), the desired result fol lows. o]

Proposition 6.8. Suppose X is a continuous homomorphism on G1 x G2 with X(g1, g2) =

x1(g1)x2(g2) where Xi is a continuous homomorphism on

Gi, i = 1, 2. If m is a relatively left invariant measure with multiplier X, then there exist relatively left invariant measures mi on Gi with multipliers Xi, i = 1, 2, and m is product measure m 1 X M2. Conversely, if mi is a relatively left invariant measure on Gi with multiplier Xi, i = 1, 2, then m I X M 2 is a

relatively left invariant measure on G1 x G2 with multiplier X, which satisfies X(g1, g2) = Xl(gl)X2(9g2)

Proof This result is a direct consequence of Fubini's Theorem and the existence and uniqueness of relatively left invariant integrals. O

The following example illustrates many of the results presented in this chapter and has a number of applications in multivariate analysis. For example, one of the derivations of the Wishart distribution is quite easy given the results of this example.

* Example 6.18. As in Example 6.10, 9 is the set of all n x p matrices with rank p and G is the direct product group On X GT. The action of (1', T) E en X GT on 9 is

(ri T)X (r X T)XX rXT , X E mt oe.

Since 9C = (XI X Ep- E det(X'X) > 0}, 9C is a nonempty open




subset of ep n. Let dX be Lebesgue measure on 9 and define a measure on 9 by

m (dX) = dX

(det( X X))n/

Using Proposition 5.10, it is an easy calculation to show that the integral

J(f) _Jf(X)m(dX)

is invariant-that is, J((F, T)f) = J(f ) for (F, T) E On X GT and

f e 'i(9X). However, it takes a bit more work to characterize all the

relatively invariant measures on 9. First, it was shown in Example 6.10 that, if X0 is

xo = (I) E 9CX,

then Ho = ((, T)I(J, T)Xo = X0) is a closed subgroup of 0on and hence is compact. By Theorem 6.3, every continuous homomor phism on 0,, X GT is the multiplier for a relatively invariant in

tegral. But every continuous homomorphism X on 0n X GT has the form X(F, T) = xl(r)X2(T) where XI and X2 are continuous ho

momorphisms on 0,, and GT. Since 0,, is compact, Xi = 1. From Example 6.12,

p x2(T) = r (tii)c XC(T)

i= 1

where c E RP has coordinates c,,..., cp. Now that all the possible

multipliers have been described, we want to exhibit the relatively

invariant integrals on XfCQX). To this end, consider the space

'tJ = '>nx GU so points in L3? are (I, U) where I is an n x p

linear isometry and U is a p X p upper triangular matrix in Gu. The

group 06, x GT acts transitively on 6J under the group action

(F, T)(4, U) (r, UT').

Let ,uo be the unique probability measure on L,P n,, that is On-invariant and let vr be the particular right invariant measure on the group Gu



PROPOSITION 6.8 215

given by

vr(dU)= dU H u 1u1.

Obviously, the integral

J1(f) J f (t, U)to(d4)vr(dU)

is invariant under the action of Q x GT on p n x GU, f E X(9pn X Ga). Consider the integral

J2(U)-| fJ(4, U)Xc(U')Io(d4)Vr(dU)

defined on Yu(Sp,n X G') where Xc is a continuous homomorphism on GT. The claim is that J2((r, T)f) = Xc(T)J2(f ) so J2 is rela

tively invariant with multiplier Xc. To see this, compute as follows:

j2((r, T)f) = fJt((r, T) '(', U))Xc(U')M0(d4)Pr(dU)

= f (rJ , UT' l)xc(TT- 'U')#o(d4)Pr(dU)

= xc(T)fff(rJ," UT' ')Xc((UT-1)')I.LO(dI)Pr(dU)

= xc(T)J2(f).

The last equality follows from the invariance of pu and Pr. Thus all

the relatively invariant integrals on X(6p ,n x G') have been ex

plicitly described. To do the same for c{;(%), the basic idea is to move the integral J2 over to SC(%). It was mentioned earlier that

the map 40 on 1p

X Gu to 9X given by

00(',U) = Ue 'X

is one-to-one, onto, and satisfies

f0((r, T)(4, U)) = (r, T)40(', U),




for group elements (f, T). For f E C( 'X), consider the integral

J3(f ) fff(40(e U))Io(d')Vr(dU).

Then for (F,T) E Q x G ,

J3((r, T)f) = ff((r, T) '40(4, U))Lo(d4)Vr(dU)

= fft((r', T- )40(A, U))A 0 (d')Pr(dU)

= |Iff(+0(r'4, UT' - 1))L0(dI)Vr(dU) = J3(f)

since ,uO and Pr are invariant. Therefore, J3 is an invariant integral on

SC(%). Since J is also an invariant integral on Yu fX), Theorem 6.3

shows that there is a positive constant k such that

J(f ) = kJ3(f ), fe C('X)

More explicitly, we have the equation

J (X) dX = kff ('U) AO (d) PJr (du)

for all f E X3(f). This equation is a formal way to state the very nontrivial fact that the measure m on 9 gets transformed into the

measure k(IL x vr) on n X Gu under the mapping S5-'. To

evaluate the constant k, it is sufficient to find one particular

function so that both sides of the above equality can be evaluated.

Consider

fo(X) = IX'XInl2(27T) P/ exp[-2 tr(X'X)].

Clearly,

dX ff(x)___x__=2



PROPOSITION 6.8 217

so

k = |Iff(4U)L0(d4)Pr(dU)

= (27T) -P/f2 IUUin/2eXp[-2 tr U'U ]r(dU)

-(2 7T) - np2 HUn -eXp _jdU

- ~~~~~~~~~~~~~~~uU

=(2r) - np2

2-Pc(n, p).

The last equality follows from the result in Example 5.1, where c(n, p) is defined. Therefore,

(61 ( X ) IdXX1 n= (nr )p) |f (*U )Ito ( d*+) Pr,(d U )

It is now an easy matter to derive all the relatively invariant

integrals on YuC(;C). Let Xc be a given continuous homomorphism on GT. For each X - 9, let U(X) be the unique element in Gu

such that X = IU(X) for some I E Cp n (see Proposition 5.2). It is

clear that U(rXT') = U(X)T' for F E ?n and T E GT. We have

shown that

J2( = ) ff(P, U)Xc(U')tuO(dI)Pr(dU)

is relatively invariant with multiplier Xc on YUC(6 n,n Gu). For h e 'W( X), define an integral J4 by

J4(h) =ffh(4U)Xc(U'),Lo(d4')vr(dU).

Clearly, J4 is relatively invariant with multiplier Xc since J4(h)= J2(h) where hQ(*, U) h Q(*U). Now, we move J4 over to 9 by (6.1). In (6.1), take f(X) = h(X)Xc(U'(X)) so f('U)= h(4U)Xc(U'). Thus the integral

J5(h) h(X)Xc(U'(X)) dX




is relatively invariant with multiplier X, Of course, any relatively invariant integral with multiplier Xc on AX(%) is equal to a positive constant times J5.

6.4. TRANSFORMATIONS AND FACTORIZATIONS OF MEASURES

The results of Example 6.18 describe how an invariant measure on the set of n x p matrices is transformed into an invariant measure on xp X Gu

under a particular mapping. The first problem to be discussed in this section is an abstraction of this situation. The notion of a group homomorphism plays a role in what follows.

Definition 6.13. Let G and H be groups. A function q from G onto H is a

homomorphism if:

(i) (g1g2) - (g1)nq(g2), g1, g2 E G.

(ii) (g) = (n(g))- I, g E G.

When there is a homomorphism from G to H, H is called a homomorphic

image of G.

For notational convenience, a homomorphic image of G is often denoted by G and the value of the homomorphism at g is g. In this case, g,g2 =-12 and g-' = g-'. Also, if e is the identity in G, then e is the identity in G.

Suppose 9 and '4 are locally compact spaces, and G and G are locally

compact topological groups that act topologically on 9X and tJ, respectively.

It is assumed that G is a homomorphic image of G.

Definition 6.14. A measurable function p from 9N onto 6J is called equi

variant if 4(gx) = g4(x) for all g E G and x E 9X.

Now, consider an integral

J(f) = Jf(x),(dx), f E YuC(9),

which is invariant under the action of G on 9, that is

J(f) Jf (g-'x)tL(dx) = ff(x)d(dx) = J(f)



PROPOSITION 6.9 219

for g E G and f e XfCQ;). Given an equivariant function 4 from 9 to 6J, there is a natural measure v induced on 6?. Namely, if B is a measurable subset of @, v(B) - L(,-'(B)). The result below shows that under a regularity condition on 4, the measure v defines an invariant (under G) integral on X(5).

Proposition 6.9. If 4 is an equivariant function from 9 onto 'J that

satisfies tL(O- '(K)) < + x for all compact sets K C 6;, then the integral

J,( f ) - |ff(y)v(dy), f e C(6@)

is invariant under G.

Proof. First note that J, is well defined and finite since ,.(4F '(K)) < + oo for all compact sets K C 62. From the definition of the measure v, it follows

immediately that

J1(f) = ff(y)v(dy) = ff(O(x))tL(dx), f E EX(62).

Using the equivariance of 4) and the invariance of ,u, we have

il(gf f (g-ly)v(dy) = ff(g-'4(x))p(dx)

- If(O(g-'x))tu(dx) = ff(O(x))t1(dx) = J,(f)

so J, is invariant under G. O

Before presenting some applications of Proposition 6.9, a few remarks are in order. The groups G and G are not assumed to act transitively on 'X and

624, respectively. However, if G does act transitively on 624 and if 6J is a left homogeneous space, then the measure i is uniquely determined up to a positive constant. Thus if we happen to know an invariant measure on 624, the identity

ff(y)v(dy) = ff(4(x))tu(dx), f E YuC(62)

relates the G-invariant measure IL to the G-invariant measure P. It was this


line of reasoning that led to (6.1) in Example 6.18. We now consider some

further examples.

* Example 6.19. As in Example 6.18, let 9 be the set of all n x p

matrices of rank p, and let 6J be the space S' of p x p positive

definite matrices. Consider the map 4 on 9 to Sp defined by

lo(x)=X'X, Xe6.

The group GOn x Glp acts on 9 by

(F, A)X= (r ? A)X = FXA'

and the measure

ft4dX) dX

IX,'Xj n/2

is invariant under 19, x Glp. Further,

((F, A)X) = AX'XA' = A,(X)A',

and this defines an action of Glp on S'. It is routine to check that

the mapping

(r, A) +A- (IF, A)

is a homomorphism. Obviously,

0((F, A)X) = (F, A)+(X)

since the action of Glp on b p is

A(S) = ASA'; S E Sp, A E Glp.

Since Glp acts transitively on S, the invariant measure

vP(dS)= dS

is unique up to a positive constant. The remaining assumption to

verify in order to apply Proposition 6.9 is that k- '(K) has finite IL

measure for compact sets K c p +. To do this, we show that



PROPOSITION 6.9 221

Jr-'(K) is compact in 'X. Recall that the mapping h on Cp n X Sp onto 9 given by

h(P, S) = 'S c -

is one-to-one and is obviously continuous. Given the compact set Kc0S' let

K1 = {SIS E S+, S2 E K).

Then K1 is compact so n X K1 is a compact subset of 9 x n X

It is now routine to show that

+- (K) = {XiX'X E K) = h(qp n X K,),

which is compact since h is continuous and the continuous image of a compact set is compact. By Proposition 6.9, we conclude that the

measure v = a o 0-' is invariant under Glp and satisfies

(XX) f dXS) - (dS),

for allf E 'Y((Sp ). Since v is invariant under Glp, v cv, where c is a positive constant. Thus we have the identity

(6.2) f (x'x) dX cf ((s) dS

To find the constant c, it is sufficient to evaluate both sides of (6.2) for a particular function fo. For fo, take the function

A (S) - ( 2')-np) Stn/2exp[- 4trS],

so

f0(X'X) = (2Th) nX'Xjn/2exp[- trX'X].

Clearly, the left-hand side of (6.2) integrates to one and this yields the equation

Cf( 2r) np n-p- )/2exp[ trS] dS = 1.




The result of Example 5.1 gives

c(> 7-T) `c(n,p)= 1

so

C ,)~

(2 2rT) w(n, p). c(n, p)

In conclusion, the identity

(6.3) Jf(x'x) dX =~2- dS (6.3) f(XX) Ix,xr2 = )np (n, p f

ISI)p(n) f2(s) S(p 1/

has been established for all f E XJ(Sp ), and thus for all measurable

f for which either side exists.

* Example 6.20. Again let 9X be the set of n x p matrices of rank p

so the group On X GT acts on 9X by

(1, T)X- (F e T)X= FXT.

Each element X E 9 has a unique representation X = IU where

'Pet and U e G6. Define p on 9 onto Gt by defining O(X) to be the unique element U E Gu such that X = *U for some

I' E Cp n. If +(X) = U, then r((F, T)X) = UT', since when X=

'U, (F, T)X = F*UT'. This implies that UT' is the unique ele

ment in Gt such that X = (]F4)UT' as ITF E n. The mapping

(, T)-. T (F, T) is clearly a homomorphism of (1, T) onto

GT and the action of GT on GU is

T(U) UT'; UE G, T GT

Therefore, f4((F, T)X) = (Fr, T)4O(X) so 4 is equivariant. The mea

sure

t(dX) = dX

is Q x GT invariant. To show that +-1(K) has finite ,u measure

when K C GU is compact, note that h(', U) 'PU is a continuous

function on p n, X Gt onto 9X. It is easily verified that

?-'(K) = h(Tp n X K).



PROPOSITION 6.9 223

But 1p x K is compact, which shows that O-'(K) is compact

since h is continuous. Thus (4-4 '(K)) < + xo. Proposition 6.9

shows that v IL o -V' is a Gi-invariant measure on Gt and we

have the identity

f ( ( X)) X2n/2 f f (u ) v (dU)

for all f E %C(Gt). However, the measure

vI(dU) dU

is a right invariant measure on GU, and therefore, vP is invariant

under the transitive action of GT on Gt. The uniqueness of

invariant measures implies that v = cv, for some positive constant c

and

dX r dU | ((X)) -X

C Jf U) dU

The constant c is evaluated by choosing f to be

f (U) = ( V2)-- IU'UIn"2exp[-2trU'U]

Since (p(X))'O(X) = X'X,

f(+(X)) - (2) np ) / exp[-2 tr X'X]

and

dX_ Jf(f(X))

Therefore,

1 = c ( 27) Tnp IU,Uln/2 exp I

tr UU] dU

= c (>2 ) |+ T unziexp [-2 tr U'U ]dU

- C(2 ( np2Pc(n, p)


where c(n, p) is defined in Example 5.1. This yields the identity

ff(~(x)) dX npdU f W )) dxxr12 2P(v/27) w(n,p)Jf(U) P

for all f E 3C(Gt). In particular, when f(U) = f,(U'U), we have

(6.4) (f (X'X) dX = 2P()ff ( (n, P fI (U'U) d pU

whenever either integral exists. Combining this with (6.3) yields the identity

(6.5) f(S) dS 2P f(U'U) dU

for all measurable f for which either integral exists. Setting T = U'

in (6.5) yields the assertion of Proposition 5.18.

The final topic in this chapter has to do with the factorization of a Radon

measure on a product space. Suppose % and 1? are locally compact and

a-compact Hausdorff spaces and assume that G is a locally compact

topological group that acts on 9 in such a way that T is a homogeneous space. It is also assumed that y is a G-invariant Radon measure on 9 so

the integral

J1(fl) - fj(x)j(dx), f, Ex W(9)

is G-invariant, and is unique up to a positive constant.

Proposition 6.10. Assume the conditions above on %, '4, G, and J,. Define

G acting on the locally compact and a-compact space 9 x 64 by g(x, y) =

(gx, y). If m is a G-invariant Radon measure on 9X x 'J, then m = yI x v

for some Radon measure v on J.

Proof. By assumption, the integral

J(f) fJf(x, y)m(dx, dy), f X

6(cX 6 ) 6~ 9C




satisfies

J(gf) = ff (g-'x, y)m(dx, dy) = J(f).

For f2 E 3X ((6) and f1 E 'XC( f), the product f 112' defined by (f1f2)(x, y)

= I (x)f2(y), is in ('X(6 x 6J) and

J(flf2) = ff f((x)f2(y)m(dx, dy).

Fixf2 E SW(?) such that 12 > 0 and let

H(f1) fl(x)f2(y)m(dx, dy), f, e (c)

Since J( gf) - J(f ), it follows that

H(gfl) = H(fl) for g e G andf1 E (6)

Therefore H is a G-invariant integral on Yu(f). Hence there exists a non-negative constant c(f2) depending on f2 such that

H(fl) = c(f2)Jj(f1)

and c(f2) = 0 iff H(fl) 0 O for allf, e '3(6X). For an arbitraryf2 e 'X(6), write f2 = f2 - f7 where f2j = max(f2,0) and f1 = max(-f2,0) are in

'1C( ). For such an f2, it is easy to show

J(ff2) = c(f2J)1(fl) - c(f7)Jl(fl) = (C(f2) - C(f7))J1(1).

Thus defining c on SW(64) by c(f2) = c(f2 ) - c(f7), it is easy to show that

c is an integral on 'K(6J). Hence

C(f2) = yf2(Y)(dy)

for some Radon measure v. Therefore,

Jf f1(x)f2(y)m(dx, dy) = Jff(x)f2(y)Ju1(dx)v(dy).

A standard approximation argument now implies that m is the product measure,u1 x v. U




Proposition 6.10 provides one technique for establishing the stochastic independence of two random vectors. This technique is used in the next chapter. The one application of Proposition 6.10 given here concerns the space of positive definite matrices.

* Example 6.21. Let Z be the set of all p x p positive definite matrices that have distinct eigenvalues. That S is an open subset of S? follows from the fact that the eigenvalues of S E+ are

continuous functions of the elements of the matrix S. Thus S has nonzero Lebesgue measure in S . Also, let fJ be the set of p x p

p. diagonal matrices Y with diagonal elements y1,.. ., yp that satisfy

Yi > y2 > > yp. Further, let 9 be the quotient sp'ace 1/6D1

where 6D is the group of sign changes introduced in Example 6.6. We now construct a natural one-to-one onto map from X x @ to S.

For XEE , X = '6Dp for some F Ee p. Define 4 by

O(X,Y)= FYF, X= F6D, YE6.

To verify that 4 is well defined, suppose that X = IF16D = F26D .

Then

+(X, Y) = F1YF; = F2r2F,I'1Y 2I' = F2YI'!

since 7I'l EG and every element D E p satisfies DYD = Y for

all Y E 6@. It is clear that +(X, Y) has ordered eigenvalues Yi > Y2 >... > yp > 0, the diagonal elements of Y. Clearly, the function 4

is onto and continuous. To show 4 is one-to-one, first note that, if Y

is any element of fJ, then the equation

FYJ" Y, re(s) p

implies that F e6D (FYF' - Y implies that rY = YF and equat

ing the elements of these two matrices shows that F must be

diagonal so F E p). If

0(X1, Y1) = ?(X2, Y2),

then Y1 = Y2 by the uniqueness of eigenvalues and the ordering of

the diagonal elements of Y E 6J. Thus

F1YlNl = r2YIF2




when

+(X1, Y1) = ?(X2, Y1).

Therefore,

F2Fl Yl Flr2= Y1,

which implies that F2fr1 e 6DP. Since Xi= ij6Dp for i= 1,2, this

shows that X1 = X2 and that 4 is one-to-one. Therefore, 0 has an

inverse and the spectral theorem for matrices specifies just what V- 1

is. Namely, for Z E 7, let Yi > ... > yp > 0 be the ordered eigen

values of Z and write Z as

Z =JYJ", FGEep

where Y E 'J has diagonal elements Yi > ... > yp > 0. The prob

lem is that r E 0p is not unique since

FYF' = FDYDF' forD e 6D,.

To obtain uniqueness, we simply have "quotiented out" the sub group 6D,p in order that 4' be well defined. Now, let

A (dZ) = dZ

be Lebesgue measure on , and consider v = ,u o 4-the induced

measure on 9 x tJ. The problem is to obtain some information about the measure v. Since 4 is continuous, v is a Radon measure on 9 x 6J, and v satisfies

fJ(X, Y)P(dX, dY) = ff(Zr'(z)) dZ

for f E 9C(%X9 X 6). The claim is that the measure P is invariant under the action of 0p on 9 x '? defined by

F(X, Y) = (Fx, Y).

To see this, we have

fff(F'(X, Y))P(dX, dY) = ff(r'v'(z)) dZ.




But a bit of reflection shows that Ic- '(Z) = 4V 1(F'Zr). Since the Jacobian of the transformation rlZr is equal to one, it follows that v is Op-invariant. By Proposition 6.10, the measure v is a product

measure v1 X v2 where v1 is an ?p-invariant measure on 9C. Since Op is compact and QX is compact, the measure v1 is finite and we take

il(f) = I as a normalization. Therefore,

ff('- 1(Z)) dZ = f(X, Y)v1I(dX)v2(dY)

for all f E YuC( X9 x (5?). Setting h = f4 - ' yields

fh(Z) dZ = fJh(4(X, Y)) I (dX)v2(dY)

for h E SJC(S). In particular, if h e 'W(F) satisfies h(Z) = h(FZ1")

for all r1 E and Z E 7, then h(4(X, Y)) = h(Y) and we have the identity

fh(Z) dZ = fh(Y)v2(dY).

It is quite difficult to give a rigorous derivation of the measure v2 without the theory of differential forms. In fact, it is not obvious that P2 is absolutely continuous with respect to Lebesgue measure

on 6. The subject of this example is considered again in later

chapters.

PROBLEMS

1. Let M be a proper subspace for V and set

G(M) = (glg e Gl(V), g(M) = M)

where g(M) = (xix = gv for some v E M).

(i) Show that g(M) = M iff g(M) C M for g E Gl(V) and show

that G(M) is a group.

Now, assume V = RP and, for x E RP, write x = (y) with y E Rq and

z E Rr, q + r = p. Let M = (xlx = (y), y E Rq).



PROBLEMS 229

(ii) For g E Gl., partition g as

(g11 g912 isq X q g-921 922' g1isqq

Show that g e G(M) iff glI E Glq, g22 e Glr, and g21 =0. For

such g show that

I-I -g1g12g221

(iii) Verify that G1 = (g E G(M)jgj1 = Iq' g12 = 0) and G2 = {g E

G(M)1922 = Ir) are subgroups of G(M) and G2 is a normal

subgroup of G(M). (iv) Show that G, n G2 = (I) and show that each g can be written

uniquely as g = hk with h E GI and k e G2. Conclude that, if

gi = hiki, i = 1,2, then g1g2 = h3k3, where h3 = h,h2 and k3 =

hj 'k1h2k2, is the unique representation of g1g2 with h3 E GI and k3 E G2.

2. Let G(M) be as in Problem 1. Does G(M) act transitively on V - (0)? Does G(M) act transitively on V n MC where MC is the complement

of the set M in V?

3. Show that Q)n is a compact subset of Rm with m = n2. Show that On is a

topological group when Qn has the topology inherited from Rm. If X is a continuous homomorphism from Q) to the multiplicative group (0, oo), show that x(F) = 1 for all r en.

4. Suppose X is a continuous homomorphism on (0, oo) to (0, oo). Show

that x(x) = xG for some real number a.

5. Show that Q) is a compact subgroup of Gln and show that Gu (of

dimension n X n) is a closed subgroup of Gln. Show that the unique

ness of the representation A = rU (A E Gl, F Ee,n U E Gt) is

equivalent to Q n Gu = (I"). Show that neither On nor Gu is a normal

subgroup of Gln.

6. Let (V, (*, *)) be an inner product space.

(i) For fixed v E V, show that X defined by x(x) = exp[(v, x)] is a

continuous homomorphism on V to (0, oo). Here V is a group under addition.



(ii) If X is a continuous homomorphism on V, show that x(x)= log x(x) is a linear function on V. Conclude that x(x) exp[(v, x)] for some v E V.

7. Suppose X is a continuous homomorphism defined on Gln to (0, oo). Using the steps outlined below, show that X(A) = Idet Al' for some real a.

(i) First show that X(F) = 1 for r 1 en.

(ii) Write A = FDA with F, A E On and D diagonal with positive

diagonals X1,..., X n Show that x(A) = X(D)

(iii) Next, write D = HDi(X ,) where Di(c) is diagonal with all diago nal elements equal to one except the ith diagonal element, which is c. Conclude that X(D) = HX(Di(Xi)).

(iv) Show that Di(c) = PD,(c)P' for some permutation matrix P E

on. Using this, show that X(D) = X(D,(X)) where A = HAi.

(v) For A E (0, co), set ((A) = X(DI(A)) and show that ( is a

continuous homomorphism on (0, oo) to (0, oo) so ((A) = AX for

some real /B. Now, complete the proof of X(A) = Idet Al'.

8. Let 9 be the set of all rank r orthogonal projections on Rn to Rn

(I < r < n - 1).

(i) Show that on acts transitively on 9X via the action x -- FxF',

F E On. For

x ( Ir ?)

E 9x

what is the isotropy subgroup? Show that the representation of x

in this case is x = 4A' where A: n X r consists of the first r

columns of F en

(ii) The group Or acts on fr n by { A', A E /9r. This induces an

equivalence relation on 5,r, (4n 1 I 2 iff 4, = 22A' for some

A E ?r), and hence defines a quotient space. Show that the map

[4] -- 4,4' defines a one-to-one onto map from this quotient

space to 9X. Here [4] is the equivalence class of 4.

9. Following the steps outlined below, show that every continuous homo

morphism on GT to (0, oo) has the form X(T) = HP(tii)c where

T: p x p has diagonal elements t11,..., tpp and cl,..., cP are real

numbers.

PROBLEMS 231

(i) Let

G= (TIT- (Tol 0) TTI:(p- 1)x(p-l)}

and

'p-I 0\'

G2={TIT= T21 tpp

Show that G1 and G2 are subgroups of GT and G2 is normal.

Show that every T has a unique representation as T = hk with

h E G1, k E G2.

(ii) An induction assumption yields X(h) = HP-I(ti1)c. Also for T = hk, X(T)= X(h)X(k)

(iii) Show that x(k) = (tpp)Cp for some real cp.

10. Evaluate the integral IY f I X'XITexp( - I tr X'X] dX where X ranges

over all n x p matrices of rank p. In particular, for what values of y is

this integral finite?

11. In the notation of Problems 1 and 2, find all of the relatively invariant

integrals on RP n Mc under the action of G(M).

12. In R", let 9 = (xlx E Rn, x t span(e)). Also, let S"_-(e) = (xIlxll =

1, x E R, x'e = 0) and let 6 = R' x (0, oo) x S I,(e). For x E 9X, set x = n- 'e'x and set s2(x) = j(Xi- x-)2. Define a mapping Ton 9

to W by T(x) = ({x, s,(x - xe)/s). (i) Show that T is one-to-one, onto and find -'. Let E"(e) = {rlr

e On, re = e) and consider a group G defined by G -

((a, b, F)Ia E (0, oo), b E RI, F E En(e)) with group composi

tion given by (a,, b1, rF)(a2, b2, F2) = (a1a2, a1b2 + b1, r, 2). Define G acting on 9 and 6@ by (a, b, r)x = aFx + be, x E 9, (a, b, F)(u, v, w) - (au + b, av, rw) for (u, v, w) E 64.

(ii) Show that (gx) = g(x), g E G. (iii) Show that the measure ,u(dx) = dx/s' is an invariant measure

on 9X.

(iv) Let y(dw) be the unique On(e) invariant probability measure on

Sn- (e). Show that the measure

v(d(u, v, w)) = du 2 y(dw)

is an invariant measure on tJ.



(v) Prove that Jff(x)u(dx) = kfJf(T-'(y))v(dy) for all integrable f where k is a fixed constant. Find k.

(vi) Suppose a random vector X E 9 has a density (with respect to

dx) given by

f(x) = ^(h. -Xel E= _X

where 8 c R' and a > 0 are parameters. Find the joint density

of X and s.

13. Let 'X = R' -.,(O) and consider X e 9 with an Q"-invariant distri

bution. Define 4 on 9X to (0, oo) x WI, by +(x) = (lixti, x/lIxli). The group 0,, acts on (0, oo) x 6Y1,, by F(u, v) = (u, Fv). Show that p(rx) = rp(x) and use this to prove that:

(i) XII and X/II IKX are independent.

(ii) X/IIXll has a uniform distribution on C, n.

14. Let X = (x E RniXi :

xj for all i *) j and let 'tJ = (y E Rnly, < Y2

< - * < Y,,). Also, let I9n be the group of n X n permutation matrices

so 9n

c en

and 'Pn

acts on 9 by x gx.

(i) Show that the map 4(g, y) = gy is one-to-one and onto from

9,, x 6?4 to 9. Describe -'.

(ii) Let X E 9 be a random vector such that P?(X) = E(gX) for

g E 6Pn. Write V- 1 (X) = (P( X), Y(X)) where P( X) e @n and

Y(X) E 6J. Show that P(X) and Y(X) are independent and that

P(X) has a uniform distribution on (n


1. For an alternative to Nachbin's treatment of invariant integrals, see

Segal and Kunze (1978).

2. Proposition 6.10 is the Radon measure version of a result due to Farrell

(see Farrell, 1976). The extension of Proposition 6.10 to relatively invariant integrals that are unique up to constant is immediate?the

proof of Proposition 6.10 is valid.

3. For the form of the measure v2 in Example 6.21, see Deemer and Olkin

(1951), Farrell (1976), or Muirhead (1982).

CHAPTER 7

First Applications

of Invariance

We now begin to reap some of the benefits of the labor of Chapter 6. The

one unifying notion throughout this chapter is that of a group of transfor

mations acting on a space. Within this framework independence and

distributional properties of random vectors are discussed and a variety of structural problems are considered. In particular, invariant probability

models are introduced and the invariance of likelihood ratio tests and maximum likelihood estimators is established. Further, maximal invariant statistics are discussed in detail.

7.1. LEFT EI,, INVARIANT DISTRIBUTIONS ON n X p MATRICES

The main concern of this section is conditions under which the two matrices 'I and U in the decomposition X = PU (see Example 6.20) are stochasti

cally independent when X is a random n x p matrix. Before discussing this

problem, a useful construction of the uniform distribution on Cp, is presented. Throughout this section, 9t denotes the space of n X p matrices of rank p so n > p. First, a technical result.

Proposition 7.1. Let XE E. ,n have a normal distribution with mean zero and Cov(X) = I,, Ip. Then P(X E = land the complement of 9 in

,,p nhas Lebesgue measure zero.

Proof. Let Xl,..., Xp denote the p columns of X. Thus X,,..., Ap are independent random vectors in Rn and E(Xi) = N(0, In), i = 1,..., p. It is

shown that P(X e C) = 0. To say that X E c% is to say that, for some

233



234 FIRST APPLICATIONS OF INVARIANCE

index i,

Xi E span{XJIj * i}.

Therefore,

P{ X 6C) = p U [Xi e span{Xjlj i}]

p < P{Xi E span(Xjlj * i}).

However, Xi is independent of the set of random vectors (Xjl j i} and the probability of any subspace M of dimension less than n is zero. Since p < n, the subspace span {Xjlj * i) has dimension less than n. Thus conditioning on

Xi for i * i, we have

P{Xi E span{Xjlj * i}} = SP(Xi e span{Xjlj * i)lXj, j * i) = 0.

Hence P(X E =C) = 0. Since VX has probability zero under the normal

distribution on Ep n and since the normal density function with respect to

Lebesgue measure is strictly positive on Ep, n 9 it follows the 9Cc has Lebesgue

measure zero. ol

If X E lip , is a random vector that has a density with respect to

Lebesgue measure, the previous result shows that P(X EE 9) = 1 since 9Xc has Lebesgue measure zero. In particular, if X EP n has a normal distribu

tion with a nonsingular covariance, then P(X 9C) = 1, and we often

restrict such normal distributions to X in order to insure that X has rank p. For many of the results below, it is assumed that X is a random vector in 9, and in applications X is a random vector in p n, which has been restricted

to 9 after it has been verified that C has probability zero under the

distribution of X.

Proposition 7.2. Suppose X E 9 has a normal distribution with E (X) = N(O, In ? Ip). Let X,,..., Xp be the columns of X and let EI' E p n, be the

random matrix whose p columns are obtained by applying the

Gram-Schmidt orthogonalization procedure to Xi,..., Xp. Then I has

the uniform distribution on Cp ,n that is, the distribution of I is the unique probability measure on 6Lp, n that is invariant under the action of ?,, on gp n (see Example 6.16).

PROPOSITION 7.3 235

Proof. Let Q be the probability distribution of 1 of 6p n. It must be

verified that

Q(rB) = Q(B), r FeOn

for all Borel sets B of 6p n. If r , On, it is clear that P,(FX) = E?(X). Also, it is not difficult to verify that 'I, which we now write as a function of X, say I(X), satisfies

I(Fx) = FI(X), F en 0n.

This follows by looking at the Gram-Schmidt Procedure, which defined the columns of '. Thus

Q(B) = P{I(X) e B) = P{'(FX) e B) = P(Fr(X) e B)

= P{I(X) E F'B) = Q(F'B)

for all F e On. The second equality above follows from the observation that

(X) = e(rX). Hence Q is an ?,-invariant probability measure on p n and the uniqueness of such a measure shows that Q is what was called the

uniform distribution on P, n. O

Now, consider the two spaces 9 and p n, X Gu. Let 4 be the function on % to C, n X Gu that maps X into the unique pair (4P, U) such that X = 'IU. Obviously, Ss '(I, U) = 'U e C.

Definition 7.1. If X e 9 is a random vector with a distribution P, then P is left invariant under 0,n if C(X) = IC(FX) for all F E -,n

The remainder of this section is devoted to a characterization of the

On-left invariant distributions on 9. It is shown that, if X E 9 has an On-left invariant distribution, then for +(X) = (I, U) E 6p n X Gt, I and U are stochastically independent and I has a uniform distribution on p, n. This assertion and its converse are given in the following proposition.

Proposition 7.3. Suppose X E 9C is a random vector with an On-left in variant distribution P and write (I, U) = +(X). Then 'I and U are stochastically independent and I has a uniform distribution on Spn

Conversely, if I Ee 6p and U E Gu are independent and if I has a uniform distribution on 6p, nJ then X = PU has an On-left invariant distribu tion on 9.




Proof. The joint distribution Q of (4', U) is determined by

Q(B1 x B2) = P(0-'(B1 X B2))

where B1 is a Borel subset of gp , and B2 is a Borel subset of Gt. Also,

ff(*, U)Q(d*, dU) = ff(4(x))P(dX)

for any Borel measurable function that is integrable. The group C. acts on the left of 1 nX Gnuby

1(4, U) = (IA, U)

and it is clear that

o (rx) = ro(x) for X E=g, rE=-n.

We now show that Q is invariant under this group action and apply

Proposition 6.10. For r e on,

fff(J&' U))Q(d4', dU) = f (F4(X))P(dX) = ft (G(rx))P(dX)

= ff(O(X))P(dX)

= Jlff(4, U)Q(d4', dU).

Therefore, Q is Qn-invariant and, by Proposition 6.10, Q is a product

measure Ql X Q2 where Q, is taken to be the uniform distribution on IFp n. That Q2 is a probability measure is clear since Q is a probability measure.

The first assertion has been established. For the converse, let Q, and Q2 be

the distributions of 4 and U so Q, is the uniform distribution on Cp, n and

Q, x Q2 is the joint distribution of (4, U) in ?p n X Gu. The distribution P of X = PU = 4V'(4, U) is determined by the equation

ff(X)P(dX) = f(f-1(, U))Q1(d4)Q2(dU),

for all integrable f. To show P is On-left invariant, it must be verified that

ff(FX)P(dX) = ff(X)P(dX)



PROPOSITION 7.4 237

for all integrable f and F E On. But

]7(rx)P(dx) = JJf(Fv- I(4, U))Q,(d4)Q2(dU)

= Jff(cr'(r, U))Q,(d4)Q2(dU)

=ff|f (v-1'(4 , U))QI(d*)Q2(dU) =f ( X)P(dX)

where the next to the last equality follows from the On-invariance of Q1. Thus P is Cn-left invariant. 0

When p = 1, Proposition 7.3 is interesting. In this case 9 = RI- (0) and

the On-left invariant distributions on 9 are exactly the orthogonally in variant distributions on Rn that have no probability at 0 E Rn. If X E R_

(0) has an orthogonally invariant distribution, then 4 = X/IIXI I E WIn is independent of U- II XI and I has a uniform distribution on C, n.

There is an analogue of Proposition 7.3 for the decomposition of X E 9 into (4, A) where 4 E ip n and A E Sp (see Proposition 5.5).

Proposition 7.4. Suppose X E 9 is a random vector with an On-left in variant distribution and write 4i(X) = (4, A) where 4 E andA EI are the unique matrices such that X = 4A. Then 4 and A are independent

and 4 has a uniform distribution on Cp n. Conversely, if 4 E IF n and

A E Sp are independent and if 4 has a uniform distribution on 'p then X = SPA has an On-left invariant distribution on 9X.

Proof. The proof is essentially the same as that of Proposition 7.3 and is left to the reader. O

Thus far, it has been shown that if X E 9 has an On-left invariant

distribution for X = 4U, 4 and U are independent and 4 has a uniform

distribution. However, nothing has been said about the distribution of U E Gt. The next result gives the density function of U with respect to the right invariant measure

Pr(dU) dU

in the case that X has a density of a special form.




Proposition 7.5. Suppose X e 9 has a distribution P given by a density function

fo(x'x), Xe'1X

with respect to the measure

u(dX) I /dX

on 9X. Then the density function of U (with respect to v,) in the representa tion X = IU is

go(U) = 2P(62J,) "w(n, p)fo(U'U).

Proof. If X E X, U(X) denotes the unique element of Gt such that X = 4U(X) for some 6E p ,. To show go is the density function of U, it

is sufficient to verify that

fh(U(X))fo(X'X)P(dX) = fh(U)go(U)vr(dU)

for all integrable functions h. Since X'X = U'(X)U(X), the results of Example 6.20 show that

fh(U(X))f0(U'(X)U(X))[t(dx) = c h(.U)fo(U'U)Vr(dU)

where c = 2P(v2rP)oPw(n, p). Since go(U) = cfo(U'U), go is the density

of U. a

A similar argument gives the density of S = X'X.

Proposition 7.6. Suppose X E 9 has distribution P given by a density function

fo(X'X), X E 9

with respect to the measure ,u. Then the density of S = X'X is



PROPOSITION 7.6 239

with respect to the measure

v(dS)- dS

Proof. With the notation S(X) = X'X, it is sufficient to verify that

fh(S(X))fo(X'X)L(dX) = h(S)go(S)v(dS)

for all integrable functions h. Combining the identities (6.4) and (6.5), we have

Jh(S(X))fo(X'X)p(dX) = cfh(S)fo(S)v(dS)

where c = (V?_T)fPw,(n, p). Since go = cfo, the proof is complete. O

When X E 9 has the density assumed in Propositions 7.5 and 7.6, it is

clear that the distribution of X is Q"-left invariant. In this case, for X = *U, I and U are independent, I has a uniform distribution on Sp, , and U has

the density given in Proposition 7.5. Thus the joint distribution of I and U has been completely described. Similar remarks apply to the situation treated in Proposition 7.6. The reader has probably noticed that the

distribution of S = X'X was derived rather than the distribution of A in the

representation X = IA for I Ec and A E Sp. Of course, S = A2 so A

is the unique positive definite square root of S. The reason for giving the distribution of S rather than that of A is quite simple-the distribution of A is substantially more complicated than that of S and harder to derive.

In the following example, we derive the distributions of U and S when X E 9 has a nonsingular (9,-left invariant normal distribution.

* Example 7.1. Suppose X E 9 has a normal distribution with a nonsingular covariance and also assume that E(X) -I(rX) for all r1 E en. Thus &: X = 1& Xfor all F E e ,, which implies that &X = 0. Also, Cov(X) must satisfy Cov((F ? Ip)X) = Cov(X) since e(X) - E((F e Ip)X). From Proposition 2.19, this implies that

COV(X) = In 4 2

for some positive definite E as Cov(X) is assumed to be nonsingu lar. In summary, if X has a normal distribution in 9 that is En-left



invariant, then

fE(X) = N(O, In ? 2).

Conversely, if X is normal with mean zero and Cov(X) = I,, X 2,

then e(X) = E(FX) for all r e On. Now that the En-left invariant

normal distributions on 9 have been described, we turn to the distribution of S = X'X and U described in Propositions 7.5 and 7.6. When e(X) = N(O, I,, ? E), the density function of X with

respect to the measure p(dX) = dX/iX'X1"12 is

f0(X) = 1,) ' exp - tr E: XXI X

Therefore, the density of S with respect to the measure

v(dS) dS

S E

5 +

is given by

go(S) = t(n, p)I2ISIn/2exp[I trE-S]

according to Proposition 7.6. This density is called the Wishart density with parameters 2, p, and n. Here, p is the dimension of S

and n is called the degrees of freedom. When S has such a density

function, we write 0(S) = W(E, p, n), which is read "the distribu

tion of S is Wishart with parameters :, p, and n." A slightly more

general definition of the Wishart distribution is given in the next chapter, where a thorough discussion of the Wishart distribution is

presented. A direct application of Proposition 7.5 yields the density

g(U) = 2Pw(n, p)IE -UPUIn/2exp[ - tr :-'U'U]

with respect to measure

vr(dU) =

U

when X = PU, P e IF and U E Gt. Here, the nonzero elements

of U are uij, 1i < j < p. When Y. = Ip, g, becomes

p

g1(U)vr(dU) = 2Pw(n, p) n u'iexp[ - 4 trU'U] vr(dU) i=

=-2Pw(n, p) rUinexp- 2 2

E dU. i= 1[ 1 <j ]

GROUPS ACTING ON SETS 241

In Gt, the diagonal elements of U range between 0 and x and the

elements above the diagonal range between - oo and + co. Writing

the density above as

p

g1(U)vJ(dU) = 2Pw(n, p)H(Un.-exp[- u2] duii)

x F1 (exp [-2 U,j] ij) i<I

we see that this density factors into a product of functions that are,

when normalized by a constant, density functions. It is clear by

inspection that

l (uij) = N(0, 1) for i < j.

Further, a simple change of variable shows that

((U2i) =

X2+ i = P,..p

Thus when 7. = Ip, the nonzero elements of U are independent, the

elements above the diagonal are all N(0, 1), and the square of the

ith diagonal element has a chi-square distribution with n - i + 1

degrees of freedom. This result is sometimes useful for deriving the distribution of functions of S = U'U.

7.2. GROUPS ACTING ON SETS

Suppose 9 is a set and G is a group that acts on the left of 'X according to

Definition 6.3. The group G defines a natural equivalence relation between elements of 9X-namely, write x, = x2 if there exists a g e G such that x = gx2. It is easy to check that = is in fact an equivalence relation. Thus

the group G partitions the set 9 into disjoint equivalence classes, say

6X= U a9

ax= uA aeA

where A is an index set and the equivalence classes 9Xa are disjoint. For each x E 9X, the set {gxIg E G) is the orbit of x under the action of G. From the

definition of the equivalence relation, it is clear that, if x E 9fa, then 9a is

just the orbit of x. Thus the decomposition of 9 into equivalence classes is


simply a decomposition of 9 into disjoint orbits and two points are equivalent iff they are in the same orbit.

Definition 7.2. Suppose G acts on the left of 9X. A function f on % to fJ is

invariant if f(x) = f(gx) for all x E 9 and g E G. The function f is

maximal invariant if f is invariant and f(xl) = f(x2) implies that xl = gx2

for some g E G.

Obviously, f is invariant iff f is constant on each orbit in X. Also, f is

maximal invariant iff it is constant on each orbit and takes different values on different orbits.

Proposition 7.7. Suppose f maps 9 onto '@J and f is maximal invariant.

Then h, mapping 9X into ?, is invariant iff h(x) = k(f(x)) for some function k mapping '?4 into Z.

Proof. If h(x) = k(f(x)), then h is invariant as f is invariant. Conversely,

suppose h is invariant. Given y E '6i, the set (xIf(x) = y) is exactly one

orbit in 9 since f is maximal invariant. Let z E !; be the value of h on this

orbit (h is invariant), and define k(y) = z. Obviously, k is well defined and

k(f(x)) = h(x). El

Proposition 7.7 is ordinarily paraphrased by saying that a function is invariant iff it is a function of a maximal invariant. Once a maximal

invariant function has been constructed, then all the invariant functions are

known-namely, they are functions of the maximal invariant function. If

the group G acts transitively on 9, then there is just one orbit and the only

invariant functions are the constants. We now turn to some examples.

* Example 7.2. Let 9X= R _-{o) and let G-=/9 act on 9 as a

group of matrices acts on a vector space. Given x E ', it is clear

that the orbit of x is (ylllyll = llxll). Let Sr = {xlllxll = r), so

X= U Sr r>O

is the decomposition of 9X into equivalence classes. The real number

r > 0 indexes the orbits. That f(x) = lixii is a maximal invariant

function follows from the invariance of f and the fact that f takes a

different value on each orbit. Thus a function is invariant under the

action of G on 9X iff it is a function of xlxii. Now, consider the space

S, x (0, oo) and define the function 4 on 9 to S, x (0, oo) by



PROPOSITION 7.7 243

+(x) = (x/llxll, llxll). Obviously, 4 is one-to-one, onto, and r- '(u, r) = ru for (u, r) E SI x (0, oo). Further, the group action

on 9 corresponds to the group action on S1 x (0, oc) given by

r(u, r) (Fu, r), F E n.

In other words, f(1x) = rF4(x) so S is an equivariant function (see

Definition 6.14). Since (On acts transitively on SI, a function h on

SI x (0, oo) is invariant iff h(u, r) does not depend on u. For this

example, the space 9 has been mapped onto S, x (0, oo) by 4) so

that the group action on 9 corresponds to a special group action on

SI x (0, oo)-namely, On acts transitively on S and is the identity

on (0, so). The whole point of introducing S, x (0, oo) is that the

function ho(u, r) = r is obviously a maximal invariant function due

to the special way in which O,n acts on S1 x (0, oo). To say it another

way, the orbits in S, x (0, oo) are S, x (r), r > 0, so the product

space structure provides a convenient way to index the orbits and hence to give a maximal invariant function. This type of product space structure occurs in many other examples.

The following example provides a useful generalization of the example above.

* Example 7.3. Suppose 9 is the space of all n x p matrices of rank

p, p < n. Then On acts on the left of 9 by matrix multiplication. The

first claim is that fo(X) = X'X is a maximal invariant function.

That fo is invariant is clear, so assume that fo(XI) = fo(X2). Thus

XlX1 = X2X2 and, by Proposition 1.31, there exists a r E n such that rX, = X2. This proves that f0 is a maximal invariant. Now, the

question is: where did fo come from? To answer this question, recall

that each X E 9 has a unique representation as X = TA where

TI e and A E 5'. Let 4 denote the map that sends X into the

pair (I, A) e IF. X )p such that X = 'A. The group On acts on (g;p n X

, nby 6Y3 x S by

r(4, A) = (rP, A)

and 4 satisfies

4)(rx) = F4(x).

It is clear that ho(4A) A is a maximal invariant function on

Ip n X Sp under the action of On since On acts transitively on Sp, n. p


Also, the orbits in IF X 5 p are pn X (A) for A E 5 -. It follows

immediately from the equivanance of 4 that

6- n X (A)) =

(XIX= IA for some' E p j

are the orbits in 9 under the action of 6n. Thus we have a

convenient indexing of the orbits in 9X given by A. A maximal invariant function on 9 must be a one-to-one function of an orbit index-namely, A E Sp . However, fo(X) = X'X = A2 when

XE {XIX= 4A, forsome4 E- 6p ).

Since A is the unique positive definite square root of A2 = X'X, we have explicitly shown why Jo is a one-to-one function of the orbit index A. A similar orbit indexing in 9C can be given by elements U E Gt by representing each X E 9 as X = 4U, 4' Ep n and

U e Gt. The details of this are left to the reader.

* Example 7.4. In this example, the set 9 is (RP - (0)) x Sp. The

group Glp acts on the left of 9 in the following manner:

A(y, S) (Ay, ASA')

for (y, S) E 9 and A E Gip. A useful method for finding a maxi

mal invariant function is to consider a point (y, S) E 9 and then

"reduce" (y, S) to a convenient representative in the orbit of (y, 5). The orbit of (y, S) is (A(y, S)IA E Glp). To reduce a given

point (y, S) by A E Glp, first choose A = rs- 1/2 where rE E p and S-1/2 is the inverse of the positive definite square root of S.

Then

ASA' = rs-112ss-1/2 r = rr' = IP

and

A(y, S) = (rs-1/2y, I),

which is in the orbit of (y, S). Since S-1/2y and IjS-1/2yIIE1 have

the same length (Es = (1, ..., 0)), we can choose r E 0P such that

rS- I/2y = IIS- 1/2y1jc1.

Therefore, for each (y, S) E 6, the point



PROPOSITION 7.7 245

is in the orbit of (y, S). Let

foMY, S) = y'S-Iy = 11S- /2yll2

The above reduction argument suggests, but does not prove, that fo is maximal invariant. However, the reduction argument does pro vide a method for checking that fo is maximal invariant. First, fo is

invariant. To showfo is maximal invariant, if fo(yl, SI) = fo(y2, S2),

we must show there exists an A E Glp such that A(y1, SI) = (Y2, S2).

From the reduction argument, there exists Ai E Glp such that

Ai (yi, Si) = (IIS?I/2yiIIej, Ip), i = 1, 2.

Sincefo(y1, S1) =fo(Y2, S2),

IIS 112ylll = jjSi-1/2

and this shows that

A1(yI, S1) = A2(Y2, S2).

Setting A = A)`A1, we see that A(yl, S1) = (Y2, S2) sofo is maxi

mal invariant. As in the previous two examples, it is possible to represent 9 as a product space where a maximal invariant is

obvious. Let

6 = ((u,S)ju

E RP,SE S' u'S-'u = 1).

Then Gip acts on the left of 'J by

A(u, S) (Au, ASA').

The reduction argument used above shows that the action of Glp is transitive on 6J. Consider the map 4 from 9 to 6 x (0, oo) given by

P(x, S) = ,-I

( x'S-IX) 1/2 )

The group action of Glp on6t x (0, oo) is

A((u, S), r) (A(u, S), r)




and a maximal invariant function is

f,((u, S), r) = r

since Glp is transitive on '14. Clearly, k is a one-to-one onto function and satisfies

4(A(x, S)) = A4)(x, S).

Thus f( (x, S)) = x'S- lx is maximal invariant.

In the three examples above, the space 9 has been represented as a product space6@ x in such a way that the group action on % corresponds

to a group action on 6 x i-namely,

g(y, z) =

(gy, z)

and G acts transitively on @. Thus it is obvious that

fM(Y, z) = z

is maximal invariant for G acting on 5 x S. However, the correspondence

4, a one-to-one onto mapping, satisfies

0(gx)=g g(x) forgEG,xEiXc.

The conclusion is that f1(p(x)) is a maximal invariant function on 6X. A

direct proof in the present generality is easy. Since

f, (0(gx)) = f,(g4(x)) = f, (o (0,

f1(0(x)) is invariant. If fl(4(xl)) = fA(4(x2)), then there is a g E G such that g4(x,) = +(x2) sincef1 is maximal invariant on 6J x S. But g4(x,) =

4(gxI) = +(x2), so gx1 = x2 as 4 is one-to-one. Thus f1(4(x)) is maximal

invariant. In the next example, a maximal invariant function is easily found but the product space representation in the form just discussed is not

available.

* Example 7.5. The group ?p acts on S' by

F(S)=IrsrJ, re?,,.

A maximal invariant function is easily found using a reduction argument similar to that given in Example 7.4. From the spectral



PROPOSITION 7.7 247

theorem for matrices, every S E Sp+ can be written in the form p S = J1DPF where r, E 0p and D is a diagonal matrix whose diag onal elements are the ordered eigenvalues of S, say

,I(S) >1 X2(S) > ...

>1 Ap(S) -

Thus rS1' = D, which shows that D is in the orbit of S. Let fo on

S' to RP be defined by: fo(S) is the vector of ordered eigenvalues p of S. Obviously, f0 is 0 -invariant and, to show f0 is maximal

invariant, suppose fo(SI) = fo(S2). Then S, and S2 have the same

eigenvalues and we have

S, = r1Dr,', i 1,2

where D is the diagonal matrix of eigenvalues of Si, i = 1, 2. Thus

r2r,s(r2r,)' = S2, so f0 is maximal invariant. To describe the

technical difficulty when we try to write S + as a product space, first p consider the case p = 2. Then S+ = 91U X2 where

I = (SIS E= S+ Al(S) = X2(S))

and

2 = {SjS E $+I, XA(S) > X2(S)).

That 02 acts on both 9, and 9X2 is clear. The function ,1 defined

on 9I by +,1(S) = XA(S) E (0, oo) is maximal invariant and 4),

establishes a one-to-one correspondence between 9X , and (0, oo). For 92, define 4)2 by

02()= (S (S) ) XI)

S0 o2 is a maximal invariant function and takes values in the set 6?

of all 2 x 2 diagonal matrices with diagonal elements y, and Y2,

Yi > Y2 > 0. Let ?2 be the subgroup of 02 consisting of those diagonal matrices with + 1 for each diagonal element. The argu ment given in Example 6.21 shows that the mapping constructed there establishes a one-to-one onto correspondence between 9X2 and (02/6D2) x 1, and 02 acts on (02/62) X @ by

r(z, Y) = (Fz, Y); (Z, Y) E (02/6P2) X 6




Further, p satisfies

o(r(z, y)) = r(o(z, y)).

Thus for p = 2, S2 has been decomposed into 9C and 92 which

are both invariant under 02. The action of 62 on 9 , is trivial in that

fx = x for all x e 9f and a maximal invariant function on I is

the identity function. Also, 9C2 was decomposed into a product space where 02 acted transitively on the first component of the product space and trivially on the second component. From this decomposition, a maximal invariant function was obvious. Similar decompositions for p > 2 can be given for S., but the number of

component spaces increases. For example, when p = 3, let X 1(S) >

X2(S) > X3(S) denote the ordered eigenvalues of S E 3. The

relevant decomposition for S + is

3= U 2 U X3 U C4

where

9, = {SIXI(S) = X2(S) = X3(S))

%2 = (SIXI(S) = A2(S) > A3(S)}

X3 = (SIAL(S) > A2(S) = X3(S)}

94= (SIXI(S) > X2(S) > X3(S)).

Each of the four components is acted on by 03 and can be written as

a product space with the structure described previously. The details

of this are left to the reader. In some situations, it is sufficient to

consider the subset 5 of S+ where p

S= {sIX(S) > 2) > ... > AX(S)).

The argument given in Example 6.21 shows how to write S as a

product space so that a maximal invariant function is obvious under

the action of Op on f.

Further examples of maximal invariants are given as the need arises. We

end this section with a brief discussion of equivariant functions. Recall (see Definition 6.14) that a function 4 on 9 onto 'fJ is called equivariant if

4(gx) = g4(x) where G acts on 9X, G acts on 6?J, and G is a homomorphic



PROPOSITION 7.8 249

image of G. If G = {e) consists only of the identity, then equivariant functions are invariant under G. In this case, we have a complete descrip tion of all the equivariant functions-namely, a function is equivariant iff it is a function of a maximal invariant function on 9X. In the general case

when G is not the trivial group, a useful description of all the equivariant functions appears to be rather difficult. However, there is one special case

when the equivariant functions can be characterized. Assume that G acts transitively on 9 and G acts transitively on 6J, where

G is a homomorphic image of G. Fix xo E 9 and let

Ho =

(g|gxo =

xo).

The subgroup Ho of G is called the isotropy subgroup of xo. Also, fix Yo E 6?4

and let

Ko =

({ggyo = Yo)

be the isotropy subgroup of yo.

Proposition 7.8. In order that there exist an equivariant function 4 on 9 to 64 such that 4(xo) = yo, it is necessary and sufficient that Ho c Ko. Here

Ho c G is the image of Ho under the given homomorphism.

Proof First, suppose that + is equivariant and satisfies 4(xo) = yo. Then,

for g E Ho,

0(x0) = j(gxo) = go(xo) = gYo = Yo

so g E KoI Thus Ho c Ko. Conversely, suppose that Ho C Ko. For x E 9, the transitivity of G on 9X implies that x = gxo for some g. Define 4 on 6X to

tJ by

(x) =

gyo where x = gxo.

It must be shown that 4 is well defined and is equivariant. If x = g1xo =

g2x0, then g- 'g, e Ho so g-'g, E KE o Thus

+(x) = giYo= g2Yo

since

g2 g1y= g2 g1Yo=Yo.




Therefore 4 is well defined and is onto '? since G acts transitively on 1J.

That 4 is equivariant is easily checked. [

The proof of Proposition 7.8 shows that an equivariant function is determined by its value at one point when G acts transitively on 'i. More precisely, if 4) and O2 are equivariant functions on 9 such that 4 (xo) =

(P2(xO) for some xo E X, then )(x) = 02(x) for all x. To see this, write

x = gxo so

4)1(X) = 4P1(gX0) = g01(xO) = g92(xO) = 4)2(gxO) = ?2(X).

Thus to characterize all the equivariant functions, it is sufficient to de termine the possible values of O(xo) for some fixed xo E 9. The following example illustrates these ideas.

* Example 7.6. Suppose 9X= = = S+ and G= G = Glp where the homomorphism is the identity. The action of Glp on Sp? is

A(S) = ASA'; A E Glp, S E S+. p

To characterize the equivariant functions, pick xo = Ip E An equivariant function 4 must satisfy

4)(Ip) = 0(rFr) = ro(ip)r

for all F E Op. By Proposition 2.13, a matrix U(Ip) satisfies this

equation iff O(Ip) = kIp for some real constant k. Since 4(Ip) E S +, k > 0. Thus

(p(Ip) = kIp, k > O

and for S E 5/,

+(S) = 4(Ss1/2s'/2) = S'/2(p(I )S1/2 = kS.

Therefore, every equivariant function has the form +(S) = kS for some k > O.

Further applications of the above ideas occur in the following sections after it is shown that, under certain conditions, maximum likelihood estima

tors are equivariant functions.



INVARIANT PROBABILITY MODELS 251

7.3. INVARIANT PROBABILITY MODELS

Invariant probability models provide the mathematical framework in which the connection between statistical problems and invariance can be studied.

Suppose (9X, 1i3) is a measurable space and G is a group of transformations

acting on 9 such that each g E G is a one-to-one onto measurable function

from 9C to 9X. If P is a probability measure on (9, (13) and g E G, the

probability measure gP on ('CX , I) is defined by

(gP)(B) =

P(g-'B); g E G, B E 6B.

It is easily verified that (g1g2)P = g1(g2P) so the group G acts on the space

of all probability measures defined on (6X, 61).

Definition 7.3. Let ' be a set of probability measures defined on (9c, 13). The set ?P is invariant under G if for each P e C, gP E VP for all g E G. Sets

of probability measures V are called probability models, and when VP is

invariant under G, we speak of a G-invariant probability model.

If X E 9 is a random vector with l&(X) = P, then E(gX) = gP for

g e G since

Pr(gX E B) = Pr{X E g-'B) = P{g-1B) = (gP)(B).

Thus f? is invariant under G iff whenever e(X) e @, E(gX) E VP for all g e G.

There are a variety of ways to construct invariant probability models from other invariant probability models. For example, if a,, a E A, are

G-invariant probability models, it is clear that

U (SPa and n6lP? aeA aeA

are both G-invariant. Now, given (9, ') and a G-invariant probability model @, form the product space

(X(n) = ,C X ,C X ... X 9C

and the product a-algebra ffi(n) on 9C(n). For P E VP, define p(n) on ff(n) by first defining

n * P(n)(B, X B2 X ...

X Bn) =H P(Bi) i=1




where Bi e fi3; Once p(n) is defined on sets of the form B1 x ** x Bn, its

extension to 913(n) is unique. Also, define G acting on %(n) by

g(Xl ..., IXn) =

(9X1,, 9XJ)

for x = (x,,. Xn x") (n).

Proposition 7.9. Let @(n) - p(n)p E 6i}. Then gp(n) is a G-invariant probability model on (6(n), &J-(n)) when VP is G-invariant.

Proof It must be shown that gp((n) E gp(n) for g e G and p(n) E @(n).

However, p(n) is the product measure

p(n) = p X P X ... X P;

and p(n) is determined by its values on sets of the form BI x ... x Bn. But

(gP)(n)(B, X ... X Bn) = P(n)(g-'B1 X g-'B2 x ... X g-1Bn)

n n = fP(g-'Bi) = (gP)(B1)

1 1

where the first equality follows from the definition of the action of G on

9(Xn). Then gp(n) is the product measure

gp(n) = (gp) X (gP) x ... x (gP),

which is in 6yj(n) as gP E 6P. O

For an application of Proposition 7.9, suppose X is a random vector with ?(X) E 'P where P is a G-invariant probability model on 9X. If X1,..., Xn are independent and identically distributed with C(Xi) E 6', then the ran

dom vector

Y ( XI ,.*, n

)E- 6X (n)

has distribution p(n) E @(n) when E(Xi)= P, i = 1,..., n. Thus 6j(n) is a G-invariant probability model for Y.

In most applications, probability models P are described in the form

VP = (P6 10 e 6) where 0 is a parameter and e is the parameter space. When

discussing indexed families of probability measures, the term "parameter space" is used only in the case that the indexing is one-to-one-that is,




Po, = Po2 implies that 0, = 02. Now, suppose PO = (P0 0 E ) is G-invariant.

Then for each g E G and 0 E 9, gP0 e C, so gP6 = P0,, for some unique

0' e 93. Define a function g on 9 to 9 by

gP0 = Pg, 0 E9.

In other words, gO is the unique point in 9 that satisfies the above equation.

Proposition 7.10. Each g is a one-to-one onto function from 9 to 9. Let

G = (glg E G). Then G is a group under the group operation of function

composition and the mapping g g is a group homomorphism from G to

G, that is:

(i) g1g2 =g1g2

(ii) g- I =g- 1.

Proof. To show that g is one-to-one, suppose gO, = gO2. Then

gP0 I = Pg0, = P 2 =gP2'

which implies that P0, = P02 so 01 = 02. The verification that g is onto goes

as follows. If 0 e 9, let 0' = g- 1. Then

Po, = gP6, = g(g1Po)

= (gg- )Po

= Po,

so gO' = 0. Equations (i) and (ii) follow by calculations similar to those above. This shows that G is the homomorphic image of G and G is a group.

0

An important special case of a G-invariant parametric model is the following. Suppose G acts on ('X, 6i3) and assume that v is a a-finite measure on (9, 6) that is relatively invariant with multiplier X, that is,

ff(g- lx)v(dx) =

X(g) f(x)v(dx), g e G

for all integrable functions f. Assume that VP = {Po0IO e 9) is a parametric

model and

PO(B)= JIB(x)P(xl0)P(dx)

for all measurable sets B. Thus p( .jO) is a density for PO with respect to v. If




'iP is G-invariant, then

gP =Ppi forgeG,0 e .

Therefore,

gP9(B)= Pg(g-'B) =IB(gx)p(xjO)v(dx)

= JIB(gx)P(g-'gx I) J(dx)

= X(g')JIB(x)p(g-'xIO)v(dx)

Pge (B) = JIB(x) p(xIgo)p(dx)

for all measurable sets B. Thus the density p must satisfy

X(g-')p(g-' 0x) =p(xlgo) a.e. (v)

or, equivalently,

p(xl0) = p(gxlg0)x(g) a.e. (v).

It should be noted that the null set where the above equality does not hold

may depend on both 0 and g. However, in most applications, a version of

the density is available so the above equality is valid everywhere. This leads

to the following definition.

Definition 7.4. The family of densities (p(-10)1O E 9) with respect to the

relatively invariant measure v with multiplier X is (G - G)-invariant if

p(XIo) = p(gxg0)X(g)

for all x, 0, and g.

It is clear that if a family of densities is (G - G)-invariant where G is a

homomorphic image of G that acts on 0, then the family of probability measures defined by these densities is a G-invariant probability model. A

few examples illustrate these notions.




* Example 7.7. Let 9 = R' and suppose f(lIx112) is a density with

respect to Lebesgue measure on Rn. For ,u E Rn and Y E 5 , set

p(xI, I ) = 1j1l2f((x - )-,(x - )

For each ,u and 2, p ( * IA, 2:) is a density on Rn. The affine group Aln acts on Rn by (A, b)x = Ax + b and Lebesgue measure is relatively

invariant with multiplier

X(A, b) = Idet(A)j

where (A, b) E Aln. Consider the parameter space Rn X SP+ and the

family of densities

p(P(A ly21y2) E=Rn XSp+).

The group Aln acts on the parameter space Rn x Sp+ by

(A, b)(Li, Y) = (Ay + b, AMA').

It is now verified that the family of densities above is (G- G) invariant where G = G = Aln. For (A, b) E Al",I

p((A, b)xI(A, b)(%, 2)) = p(Ax + bl(AI + b, AMA'))

= tAMA'j- /2f ((Ax + b - Att -

x (A2A') '(Ax + b - AA - b))

= IdetAl- IIII- /2f((x - )':-(x

X (A, b)

Therefore, the parametric model determined by the family of densi ties is Al,-invariant.

A useful method for generating a G-invariant probability model on a measurable space (9I, '13) is to consider a fixed probability measure P0 on (%X,j@) and set

? = (gPfIg

e G).

Obviously, 9' is G-invariant. However, in many situations, the group G does




not serve as a parameter space for P since gI Po = g2 Po does not necessarily

imply that g, = g2. For example, consider - = R' and let PO be given by

PO(B) = f IB(X)f(IXI2 ) dx Rn

where f(IIx112) is the density on Rn of Example 7.7. Also, let G = Al,. To obtain the density of gP0, suppose X is a random vector with C(X) = PO. For g = (A, b) E AlI, (A, b)X = AX + b has a density given by

p(xlb, AA') = Idet(AA')I-l/2f((x - b)'(AA') '(x - b))

and this is the density of (A, b)Po. Thus the parameter space for

6 = ((A, b)PO(A, b) e Aln)

is R' x Sn. Of course, the reason that Aln is not a parameter space for 'Y is

that

(r1O)PO = Po

for all n X n orthogonal matrices r. In other words, PO is an orthogonally

invariant probability on R'. Some of the linear models introduced in Chapter 4 provide interesting

examples of parametric models that are generated by groups of transforma

tions.

* Example 7.8. Consider an inner product space (V, [*, *]) and let P0 be a probability measure on V so that if 1(X) = PO, then SX = 0

and Cov(X) = I. Given a subspace M of V, form the group G

whose elements consist of pairs (a, x) with a > 0 and x E M. The

group operation is

(a,, x1)(a2, X2) (a,a2, a1x2 + XI).

The probability model '7 = (gPoig E G) consists of all the distri

butions of (a, x)X = aX + x where f&(X) = PO. Clearly,

&S(aX + x) = x and Cov(aX + x) = a2I.

Therefore, if P,(Y) E 'Y, then & Y E M and Cov(Y) = a2I for some

a 2 > 0, SO ' is a linear model for Y. For this particular example the




group G is a parameter space for @. This linear model is generated by G in the sense that 9P is obtained by transforming a fixed

probability measure P0 by elements of G.

An argument similar to that in Example 7.8 shows that the multivariate

linear model introduced in Example 4.4 is also generated by a group of

transformations.

* Example 7.9. Let ep n be the linear space of real n x p matrices

with the usual inner product ( , on ep n. Assume that P0 is a probability measure on e SO that, if C(X)= P0, then &X= 0

and Cov( X) = I,, 0 Ip. To define a regression subspace M, let Z be

a fixed n x k real matrix and set

M = (yly = ZB, B E ep k)

Obviously, M is a subspace of ep n. Consider the group G whose elements are pairs (A, y) with A E Glp and y e M. Then G acts on

Ep, n by

(A, y)x = xA' + y

= (In X A)x

+ y,

and the group operation is

(A1, yl)(A2, Y2) = (AIA2, y2A' + y).

The probability model 6J = {gPo0g E G) consists of the distribu tions of (A, y)X = (In 0 A)X + y where E(X) = P0. Since

&((In ? A)X+y) =y E M

and

Cov((I,, ?A)X+y)= I,n AA',

if E (Y) E @, then E&Y E M and Cov(Y) = I4 0 2: for some p x p

positive definite matrix Z. Thus 'Y is a multivariate linear model as described in Example 4.4. If p > 1, the group G is not a parameter

space for P, but G does generate 6Y.

Most of the probability models discussed in later chapters are examples of probability models generated by groups of transformations. Thus these

models are G-invariant and this invariance can be used in a variety of ways.




First, invariance can be used to give easy derivations of maximum likeli hood estimators and to suggest test statistics in some situations. In addition, distributional and independence properties of certain statistics are often best explained in terms of invariance.

7.4. THE INVARIANCE OF LIKELIHOOD METHODS

In this section, it is shown that under certain conditions maximum likeli hood estimators are equivariant functions and likelihood ratio tests are invariant functions. Throughout this section, G is a group of transforma tions that act measurably on ff, 'iB) and PJ is a a-finite relatively invariant

measure on (9C, 'MB) with multiplier X. Suppose that P = (P-{O e= E} is a

G-invariant parametric model such that each P6 has a density p( -IO), which

satisfies

p(XIO) = p(gxlg)X(g)

for all x e 'X,X 6Ei, and g e G. The group G= {gIg e G) is the homo

morphic image of G described in Proposition 7.10. In the present context, a

point estimator of 0, say t, mapping 9 into 1, is equivariant (see Definition

6.14) if

t(gx) = gt(x), gEG, xe %.

Proposition 7.11. Given the (G- G)-invariant family of densities

(p( -*I)IO ( )93}, assume there exists a unique function 0 mapping 9X into 0

that satisfies

supp(xlO) = p(xIO(x)). 0 e

Then 0 is an equivariant function-that is,

O(gx) = gO(x), x 9EC, g E G.

Proof. By assumption, O(gx) is the unique point in 0 that satisfies

supp(gxl6) = p(gxIO(gx)).

But

p(gxlO) = x(g-')p(xIg-'6)




so

supp(gxlO) = x(g-')supp(x1W-'0) = x(g-')supp(xIO)

8~~~~x( 8 (e X|g

=x(g',)p(xj,(x)) = p(gxlgO(x)).

Thus

p(gxlj(gx)) -p(gxlg,(x))

and, by the uniqueness assumption,

@(gx) - gA(x). o

Of course, the estimator @(x) whose existence and uniqueness is assumed in Proposition 7.11 is the maximum likelihood estimator of 0. That 0 is an equivariant function is useful information about the maximum likelihood estimator, but the above result does not indicate how to use invariance to find the maximum likelihood estimator. The next result rectifies this situa tion.

Proposition 7.12. Let (p(-0)I0 E- 0) be a (G - G)-invariant family of

densities on (9C, B). Fix a point xo E 9 and let _xo be the orbit of xo. Assume that

SUpp(Xo01) = P(xo10o) 8e0

and that 00 is unique. For x e C.,, define &(x) by

O(x) = gx@o wherex = gxxo.

Then 6 is well defined on 0xo and satisfies

(i) X(gx)=gO(x), x e.

(ii) supef0ip(xI0) = p(xj6(x)), x E DX0.

Furthermore, 6 is unique.

Proof. The density p( .0) satisfies

P(AM) = Pgylg0)x(g)



where X is a multiplier on G. To show 0 is well defined on QO., it must be

verified that if x = g.xo = h.xo, then gxO- hx0o. Set k = h; 'g," so kxo =

xo and we need to show that kGo = 00. But

p(x0100) = sup p(kx0IO) = X(k-') supp(x0IA-7'0)

= X(k-')supp(xoI0) = X(k-')p(xol0o)

= P(kxo0100)

= p(xok700)

By the uniqueness assumption, ko = so SO '0 is well defined on exO. To

establish (i), if x = gxxo, then gx = (ggx)xo so

0(gx) g ggX =

g(gXO ) = ().

For (ii), x = gxxO so

supp(xl0) = supp(gxxol0) = X(gx')supP(x0IgX'00)

= X(gj )supP(xoj0) = X(gX')P(X0100)

=p(gxxolgx0o) =p(x((x)).

To establish the uniqueness of 0, fix x E Oex and consider 01 * gx0o. Then

p(X01) =P(gxX0igX g;01) = X(gjx')P(X0Igx'01)

< (gx-, )P (xo@o) = p(x (XI ).

The strict inequality follows from the uniqueness assumption concerning 00. 0

In applications, Proposition 7.12 is used as follows. From each orbit in

the sample space QX, we pick a convenient point xo and show that p(xoIO) is

uniquely maximized at 00. Then for other points x in this orbit, write x = gxxo and set &(x) = gx0o. The function 0 is then the maximum likeli


hood estimator of 0 and is equivariant. In some situations, there is only one

orbit in 9 so this method is relatively easy to apply.

* Example 7.10. Consider 9 = 0 = Sp+ and let

p(SIY) = w(n, p)12-Is /2exp[- tr l ]

for SE Sp' and E:E- Sp'. The constant w(n, p), n > p, was defined

in Example 5.1. That p( IY) is a density with respect to the measure

v(dS) -_ dS

follows from Example 5.1. The group Gip acts on Sp by

A(S) ASA'

for A E Gip and S E S+ and the measure v is invariant. Also, it is clear that the densityp(- I) satisfies

p(ASA'IAEA') = p(SjE).

To find the maximum likelihood estimator of I E Sp', we apply the

technique described above. Consider the point Ip E S + and note that the orbit of Ipunder the action of Gip is S + so in this case there is only one orbit. Thus to apply Proposition 7.12, it must be verified that

sup P(Ipl-) =p(lP17-o)

where 20 is unique. Taking the logarithm of p(I.12) and ignoring the constant term, we have

sup [n log -, I

tr .i sup n

logBI -

I trB]

n P P = sup E log X,

where A,..., XA are the eigenvalues of B = l E p. However, for A > 0, n log A - A is a strictly concave function of A and is




uniquely maximized at X = n. Thus the function

n I n

n~- logXi _- fX

is uniquely maximized at X1 = X n =A = n, which means that

n logiBi - trB

is uniquely maximized at B = nI. Therefore,

sup pA(ipl) = (ipl-iP)

and (1/n)Ip is the unique point in Sp that achieves this supremum. p.A

To find the maximum likelihood estimator of Y., say l(S), write

S = AA' for A E Glp. Then

i(S) = A = AIA' 1-S.

In summary,

E= -s n

is the unique maximum likelihood estimator of Y. and

sup p(Sjy) = p(SI S) p

= Xo(n, p) (-S S sl/exp[-2 tr(-S S

= W(n, p)n nP/2exp

- np

The results of this example are used later to derive the maximum

likelihood estimator of a covariance matrix in a variety of multi

variate normal models.

We now turn to the invariance of likelihood ratio tests. First, invariant

testing problems need to be defined. Let VP = (PoIO E 0) be a parametric



probability model on (9X, 'I-?) and suppose that G acts measurably on 9X.

Let e0 and 0 , be two disjoint subsets of 03. On the basis of an observation

vector X E 9 with P,(X) E gP0 U '9 where

91 = {PolS EE 6i), i = O, 1,

suppose it is desired to test the hypothesis

Ho 15(X) E

go

against the alternative

HI: IS(X) E= 6y

Definition 7.5. The above hypothesis testing problem is invariant under G if 6i0 and 9, are both G-invariant probability models.

Now suppose that '0 = (PI0 E Q) and 6P1 = {P6IO e 91) are disjoint families of probability measures on (9X, B) such that each P has a density

p(-10) with respect to a a-finite measure v. Consider

sup p(xIO)

A(x) = sup p(xIO)

sEeoouel

For testing the null hypothesis that E(X) e 90 versus the alternative that P,(X) E- 6I, the test that rejects the null hypothesis iff A(x) < k, where k is

chosen to control the level of the test, is commonly called the likelihood ratio test.

Proposition 7.13. Given the family of densities (p(- 10)1 E- h0 u ( 1), assume the testing problem for E (X) E 6YP0 versus P, (X) E ,P1 is invariant

under a group G and suppose that

p(xIO) = p(gxlg0)x(g)

for some multiplier X. Then the likelihood ratio

sup p(xl0) A()

0 c 0

sup p(xjO) 0 eOu0,

is an invariant function.

Proof It must be shown that A(x) = A(gx) for x E 9 and g E G. For

g G,

sup p(gxIO) sup x(g-')p(x'g-'0) A (gx) =

sup p(gxl0) sup X(g-')P(XIg9-0) e 00 u 0 0E=-00u0

sup p(xIO)

sup p(xj9) = A(x). Eeeouej

The next to the last equality follows from the positivity of X and the invariance of 00 and 00 U 01. o

For invariant testing problems, Proposition 7.13 shows that that test function determined by A, namely

00 (x) (O if A(x) <

k,

is an invariant function. More generally, any test function 4 is invariant if

4p(x) = 4o(gx) for all x E 9 and g E G. The whole point of the above

discussion is to show that, when attention is restricted to invariant tests for

invariant testing problems, the likelihood ratio test is never excluded from consideration. Furthermore, if a particular invariant test has been shown to

have an optimal property among invariant tests, then this test has been

compared to the likelihood ratio test. Illustrations of these comments are

given later in this section when we consider testing problems for the

multivariate normal distribution. Comments similar to those above apply to equivariant estimators. Sup

pose (p(-1j0)1 Eje 8) is a (G - G)-invariant family of densities and satisfies

p(xIO) = p(gxlg0)X(g)

for some multiplier X. If the conditions of Proposition 7.12 hold, then an

equivariant maximum likelihood estimator exists. Thus if an equivariant

estimator t with some optimal property (relative to the class of all equiv

ariant estimators) has been found, then this property holds when t is

compared to the maximum likelihood estimator. The Pitman estimator, derived in the next example, is an illustration of this situation.

* Example 7.11. Let f be a density on RP with respect to Lebesgue

measure and consider the translation family of densities (p(- 10)1 e RP) defined by

p(xlO) = f(x - 0), x, 0 E RP.

For this example, % = e = G = RP and the group action is

g(x) = x + g, x, g E RP.

It is clear that

p(gxlg0) = p(x10),

so the family of densities is invariant and the multiplier is unity. It

is assumed that

xf (x) dx =

O and fIIxIt2f(x) dx < + x .

Initially, assume we have one observation X with 1E( X) E (p(- 10)10 E RP). The problem is to estimate the parameter 0. As a measure of how well an estimator t performs, consider

R(t, 0) -6ollt(X) - 0112.

If t(X) is close to 0 on the average, then R(t, 0) should be small.

We now want to show that, if t is an equivariant estimator of 0, then

R(t, 0) = R(t,O)

and the equivariant estimator to(X) = X minimizes R(t, 0) over all equivariant estimators. If t is an equivariant estimator, then

t(x + g) = t(x) + g

so, withg= - x,

t(x) = x + t(O).

Therefore, every equivariant estimator has the form t(x) = x + c

where c E RP is a constant. Conversely, any such estimator t(x) =


x + c is equivariant. For t(x) = x + c,

R(t, 9) = &911t(X) - 9112 = lilx + c - 9112f(x - 9) dx

= flix + C12f (x) dx = R(t, O).

To minimize R(t, 0) over all equivariant t, the integral

flix + cII2f(x) dx

must be minimized by an appropriate choice of c. But

&IIX + C112 = &IIX - &(X)112 + ll&(X) + c112

so

c= -&(X)=fxf(x)dx=o

minimizes the above integral. Hence to(X) = X minimizes R(t, 0) over all equivariant estimators. Now, we want to generalize this

result to the case when XI,..., X,, are independent and identically

distributed with (Xi) E {p( 1p)19 E RP), i = 1,. .., p. The argu

ment is essentially the same as when n = 1. An estimator t is

equivariant if

t(xl + g,..., xn + g) = t(x1,..., x") + g

so, setting g = - x,

t(x1,..., xn) = xl + t(0, x2 - xl,... , xn X1).

Conversely, if

t(xI,, xn) = xI + I(x2 -

xi,..., xn - xI)

then t is equivariant. Here, I is some measurable function taking

values in RP. Thus a complete description of the equivariant estima



DISTRIBUTION THEORY AND INVARIANCE 267

tors has been given. For such an estimator,

R(t, 0) = & jt(X1,. . ., xn) - 0112 =oiit(Xi- ,..., X - 0)112

-= &01t(Xi,, xJ)112 = R(t,O).

To minimize R(t, 0), we need to choose the function ' to minimize

R(t,0) = ?; +n'(X2-x1,..., x,-x1)112.

Let Ui = Xi - X1, i = 2,..., n. Then

R(t,O) = olIXI + I(U2 n)112

- &(&(11XI + 4(U2,.., n))II2IU2,..., U)).

However, conditional on (U2,..., Un) U,

6(11X? + I(U)II21U) = 6(11X, - &(X1;U) + &(X,IU) + *(U)1121U)

= S(11XI - S(X1IU)ii2jU) + 11&(X1IU) + I(U)112

Thus it is clear that

*0(U) --6(X,iU)

minimizes R(t, 0). Hence the equivariant estimator

tO(Xl,. , Xn) = X- &;o(X1jX2 - XI,..*, Xn -

XI)

satisfies

R(to, 0) = R(to,0) < R(t,0) = R(t, 0)

for all 0 E RP and all equivariant estimators t. The estimator to is

commonly called the Pitman estimator.

7.5. DISTRIBUTION THEORY AND INVARIANCE

When a family of distributions is invariant under a group of transforma

tions, useful information can often be obtained about the distribution of invariant functions by using the invariance. For example, some of the results in Section 7.1 are generalized here.


The first result shows that the distribution of an invariant function depends invariantly on a parameter. Suppose ('X, 1i) is a measurable space acted on measurably by a group G. Consider an invariant probability model 6 = (Po 0 - 6) and let G be the induced group of transformations on 6.

Thus

gPe=POe, 0 e6, gE G.

A measurable mapping T on (9, %) to (6p, e) induces a family of distribu tions on (6J, C), (QeIO E 6) given by

Qe(C) Po(T C(C)), Ce C, 0E6.

Proposition 7.14. If T is G-invariant, then Q6 = Qge for 0 E 6 and g e G.

Proof. For each C E C, it must be shown that

QO(C) Qo(C)

or, equivalently, that

po(T-l(C)) =pPj(T'(C)).

But

P90(T-'(c)) = (gP0)(T-'(C)) - PO(g-'T-'C)) = P0((g) I(c)).

Since Tg = T as T is invariant,

Qp(C) = P0((Tg)'(C)) = P0(T-'(C)) = Q6(C). E

An alternative formulation of Proposition 7.14 is useful. If E(X) e (P8j0 E 6) and if T is G-invariant, then the induced distribution of Y = T(X),

which is Q6, satisfies Q,9 = Qge. In other words, the distribution of an invariant function depends only on a maximal invariant parameter. By definition, a maximal invariant parameter is any function defined on 63 that is maximal invariant under the action of ( on 63. Of course, 6 is usually not

a parameter space for the family (Qo E e 6) as QO = Qkg, but any maximal

(-invariant function on 6 often serves as a parameter index for the

distribution of Y = T(X).

* Example 7.12. In this example, we establish a property of the

distribution of the bivariate sample correlation coefficient. Consider



a family of densities p( Ip, :) on R2 given by

P(XIlt, E) = 111'/2fo((x - L)ty--(X - ))

where It E R2 and l E 5'. Hence

and it is assumed that

JIXR12f (11XI12) dx < + 00.

Since the distribution on R2 determined by fo is orthogonally invariant, if Z e R2 has density f0(IIx12), then

&Z = 0 and Cov(Z) = cI2

for some c > 0 (see Proposition 2.13). Also, Z1 = 11/2Z + t has

density p(ILM L:) when Z has density O(1x112 ). Thus

&Z1 = ,t and Cov(Z,) = cY.

The group Al2 acts on R2 by

(A, b)x = Ax + b

and it is clear that the family of distributions, say 3' = ( P , ) E %2 x ), having the densities p(-u, E:), ,u Ee R2, E E 2 iS

2 ~~~~~~~~~~2' invariant under this group action. Lebesgue measure on R2 is relatively invariant with multiplier

x(A, b) = Idet(A)I

and

p(xlu, Y) = p((A, b)xIAM + b, A2A')X(A, b).

Obviously, the group action on the parameter space is

(A, b)(L, E) = (AL + b, ALA')

and


Now, let X,,..., X,, n > 3, be a random sample with C(Xi) E 9 so

the probability model for the random sample is Al2-invariant by Proposition 7.9. Consider X = ( Xi/n )EX and S = i - - X)' so X is the sample mean and S is the sample covariance

matrix (not normalized). Obviously, S = S(X,,..., X ) is a func

tion of X,..., Xn and

S(AX, + b,..., AXn + b) == AS(Xj,..., Xn)A'.

That is, S is an equivariant function on (R2)n to Cj where the

group action on 82 is

(A, b)(S) = ASA'.

Writing S E 5+ as

s =

( 22 12 =

521'

the sample correlation coefficient is

S12

Also, the population correlation coefficient is

012

0 1 Y22

when the distribution under consideration is P, , and

z _({ : 1 12)

-Va21 '22J

Now, given that the random sample is from P,L ,, the question is:

how does the distribution of r depend on (,, E)? To show that the

distribution of r depends only on p, we use an invariance argument.

Let G be the subgroup of Al2 defined by

( I~~~~~~a, 0 G = (A, b)I(A, b) e A12, A = a )>0,i= 1,2J.

For (A, b) E G, a bit of calculation shows that r r(X,,..., X") =




r(AX, + b,..., AXn + b) so r is a G-invariant function of X1,...,

X". By Proposition 7.14, the distribution of r, say Q, ,, satisfies

Qi= Q(A,bXs,) (A, b) E G.

Thus Q,u depends on ,u, E: only through a maximal invariant

function on the parameter space R2 X S2 under the action of G. Of course, the action of G is

(A, b)(y, Y) = (AM + b, AIA'), (A, b) e G.

We now claim that

P = P(L, E) =

0a 11022

is a maximal G-invariant function. To see this, consider (,u, 2) E R2

X 5'. By choosing

l 1ll/2 0 A

A (I/ -1/2)

A= t? 0~~22/

and b = -AM, (A, b) E G and

(A, b)(u = ((8) ( p I))

so this point is in the orbit of (IL, l) and an orbit index is p. Thus p is maximal invariant and the distribution of r depends only on

(,u, E) through the maximal invariant function p. Obviously, the distribution of r also depends on the function fo, but fo was

considered fixed in this discussion.

Proposition 7.14 asserts that the distribution of an invariant function depends only on a maximal invariant parameter, but this result is not

especially useful if the exact distribution of an invariant function is desired.

The remainder of the section is concerned with using invariance arguments, when G is compact, to derive distributions of maximal invariants and to characterize the G-invariant distributions.

First, we consider the distribution of a maximal invariant function when a compact topological group G acts measurably on a space (9X, 1 Suppose that Muo is a a-finite G-invariant measure on (9X, 6S3) and f is a density with




respect to ,uo. Let T be a measurable mapping from (9, %) onto (6@, (C).

Then T induces a measure on (94, C), say vO, given by

PO(C) = [0(T (C))

and the equation

fh(T(X))Mo(dx) =Jh(y)vo(dy)

holds for all integrable functions h on (64, (C). Since the group G is compact,

there exists a unique probability measure, say 6, that is left and right

invariant.

Proposition 7.15. Suppose the mapping T from ('X, 6l3) onto (64, C) is maximal invariant under the action of G on 9I. If X E 9 has density f with

respect to y0, then the density of Y = -(X) with respect to P0 is given by

q(T(x)) =ff(gX)S(dg).

Proof. First, the integral

ft(gx) 8 (dg)

is a G-invariant function of x and thus can be written as a function of the

maximal invariant T. This defines the function q on 94. To show that q is the

density of Y, it suffices to show that

Sk(Y) = fk(y)q(y)v0(dy)

for all bounded measurable functions k. But

Sk(Y) = &k(r(X)) = fk(T(x))f (x)Ao(dx)

= f k(T(X))f(gx)Lo(dx).

The last equality holds since AO is G-invariant and T is G-invariant. Since 8 is




a probability measure

Sk(Y) = jfk(T(x))f(gx)hto(dx)6(dg).

Using Fubini's Theorem, the definition of q and the relationship between Ito and vo, we have

Sk(Y) = f k(T(x))q(T(x))jUO(dx)= k(y)q(y)vo(dy). O

In most situations, the compact group G will be the orthogonal group or some subgroup of the orthogonal group. Concrete applications of Proposi tion 7.15 involve two separate steps. First, the function q must be calculated by evaluating

ff(gx)6(dg).

Also, given ,uO and the maximal invariant , the measure vo must be found.

* Example 7.13. Take 9 = R' and let AO be Lebesgue measure. The

orthogonal group On acts on Rn and a maximal invariant function is T(x) = IIxjj2 so q = [0, c). If a random vector X e Rn has a

density f with respect to Lebesgue measure, Proposition 7.15 tells us how to find the density of Y = IIX1 2 with respect to the measure vP.

To find P0, consider the particular density

fo(x) = (6) nexp[- lexp

Thus f-(X) = N(O, In), so f(Y) = x2 and the density of Y with respect to Lebesgue measure dy on [0, oo) is

n/2- exp[-Iy]

A(Y) n"/2]r(n

Therefore,

p(y) dy = qo(y)Po(dy)

where

qo(T(X)) =jfo(lx)S(dI).




Since fo( Ex) = fo(x), the integration of fo over en is trivial and

qo(y) = (4) exp - Y]

Solving for vo(dy), we have

vo(dy)- (2) y n/2- dy [r(2 ) n/2-dy

since r(4I ) = V. Now that P0 has been found, consider a general density f on Rn. Then

q(T(x)) =jf(rx)8(dr)

and q(y) is the density of Y IIXI12 with respect to Po. When the

density f is given by

f(x) = h(11XI12),

then it is clear that

q(y) = h(y), y E [O, o)

so the distribution of Y has been found in this case. The noncentral

chi-square distribution of Y = IIXI12 provides an interesting exam ple where the integration over Qn is not trivial. Suppose P&(X)=

N(2I, In) so

f(x) = (4-n exp[- - X_L2]

( 2,7)r exp[- l(11x112 - 2x',u + IItLI2)]

Thus

q(T(x)) = ( exp 4 1 L12]exp[- iIIIxt2]

xI exp[(rx)'j,L]8(dr).

Since x and lixil,- have the same length, x = IIxjjr,c, for some

rF E en where El is the first standard unit vector in R . Similarly,




It = IItu11r2e, for some r2 E Q, Setting A = II,L112,

q(y) = (4) exp[- 'X]exp[- 'y]

x f exp[jiK( err,e-)'Fr2e]j8(dr).

Thus to evaluate q, we need to calculate

H(u) =jexp[ue,lr,rr2e1 ]I8(dr).

Since 8 is left and right invariant,

H(u) =jexp[ue,r'i,j8(dr) =jexp[uyjjb8(dr)

where y11 is the (1, 1) element of r. The representation of the

uniform distribution on n given in Proposition 7.2 shows that when r is uniform on OnQ then

t(YIZ) = II

where 1(Z) = N(O, In) and Z, is the first coordinate of Z. Expand

ing the exponential in a power series, we have

001 U' Z H(u)= E . uj-1Uyi8(dr) =Eo j! -lZg

Thus the moments of U1 Z,/IIZII need to be found. Obviously, IC(U1) = 12(- U,), so all odd moments of U1 are zero. Also, U2 =

Z2/(Z2 + 'InZ2), which has a beta distribution with parameters -

and (n - 1)/2. Therefore,

a} (U) - rF(n/2+J)r( )

so

0=0



Hence

q(y) = (VY) exp[- X]exp[- y Y (2j)!

is the density of Y with respect to the measure v0. A bit of algebra

and some manipulation with the gamma function shows that

q(y)vo(dy) = ( E je ! (X/2)'hn+2j(Y)} dy j=O

where

ym/2 'exp[ - ly]

hm(Y)= 2m/2r(m/2)

is the density of a x2 distribution. This is the expression for the density of the noncentral chi-square distribution discussed in Chapter 3.

* Example 7.14. In this example, we derive the density function of

the order statistic of a random vector X E R'. Suppose X has a

density f with respect to Lebesgue measure and let Xl,..., X, be the

coordinates of X. Consider the space % c R n defined by

tJ = {yIy E Rn, y I< Y2 < -< Yn.

The order statistic of X is the random vector Y E fJ consisting of

the ordered values of the coordinates of X. More precisely, Y1 is the

smallest coordinate of X, Y2 is the next smallest coordinate of X, and so on. Thus Y = T(X) where T maps each x E Rn into the

ordered coordinates of x-say T(x) E tJ. To derive the density function of Y, we show that Y is a maximal invariant under a

compact group operating on Rn and then apply Proposition 7.15.

Let G be the group of all one-to-one onto functions from (1, 2,..., n) to (1, 2,. . ., n) -that is, G is the permutation group of (1, 2,..., n).

Of course, the group operation is function composition, the group

inverse is function inverse, and G has n! elements. The group G acts

on the left of Rn in the following way. For x E Rn and 7r E G,

define 7rx E Rn to have ith coordinate x(1T-' (i)). Thus the ith

coordinate of lTx is the 7T '(i) coordinate of x, so

(nTX)(i) X(7T-,(i)).

The reason for the inverse on -r in this definition is so that G acts on

the left of Rn that is,

(X1X7T2)X = l('7T2X).

It is routine to verify that the function T on 9 to 6J is a maximal

invariant under the action of G on Rn. Also, Lebesgue measure, say

1, is invariant so Proposition 7.15 is applicable as G is a finite group

and hence compact. Obviously, the density q of Y = T(X) is

q(T(x))= ! Yf(-rx)= =

XIT 1rT

so

q(y) = Ef TY) "IT

for y E 6?. To derive the measure Po on 6J, consider a measurable

subset C C 5. Then

<'(C)= U (OC) grfeG

and

VO(C) = i(T '(C)) = ( U (7TC)) = E 1(7T(C)) = n!l(C). 7reG ff C

The third equality follows since (7TC) n (r2C) has Lebesgue mea sure zero for -ri * 12 as the boundary of 6J in R' has Lebesgue

measure zero. Thus vo is just n! times 1 restricted to '6. Therefore, the density of the order statistic Y, with respect to Po restricted to 'J, is

q(y) n 2 f( TY) ir

When f is invariant under permutations, as is the case when X,..., Xn are independent and identically distributed, we have

q(y) = f(y), y e6E

The next example is an extension of Example 7.13 and is related to the results in Proposition 7.6.


* Example 7.15. Suppose X is a random vector in p, n n > p, which

has a density f with respect to Lebesgue measure dx on li n. Let T map E,, n onto the space of p x p positive semidefinite matrices, say

S7, by T(x)= x'x. The problem in this example is to derive the

density of S = ( X) = X'X. The compact group Q acts on lp, n and

a group element r E en sends x into rx. It follows immediately

from Proposition 1.31 that T is a maximal invariant function under the action of Q)n on p . Since dx is invariant under QI Proposition

7.15 shows that the density of S is

q(T(x)) - ff(rx)[L(dr)

with respect to the measure v0 on Sp+ induced by dx and T. To find

the measure vo, we argue as in Example 7.13. Consider the particu

lar density

fo(x) = ( h) nPexp[- ' tr(x'x)]

onp n SO ( X)= N(O, In X Ip). For this fo, the density of S is

q0(S) = qo(T(X)) =ffo(Fx)tl(dF) = (T)npexp[- 2 tr(S)]

with respect to vo. However, by Propostion 7.6, the density of S with respect to dS/lSI(P+ 1)/2 iS

q1(S) = w(n, p)lSln/2exp[_ I tr(S)].

Therefore,

q1 (S) dS

= qo(S)vo(dS)

so

w (n, p)ISI(n-p- 1)/2 exp[ - 2 tr(S)] dS

= (2 n)-pexp[- 2 tr(S)]v0(dS),

which shows that

v0(S) = ( )np,(n, p)ISI(n-p-1)12dS.




In the above argument, we have ignored the set of Lebesgue measure zero where x E- nhas rank less than p. The justification

for this is left to the reader. Now that vo has been found, the density

of S for a general density f is obtained by calculating

q(T(x)) = jf(rx)[t(dF).

When f(x) = h(x'x), then f(rx) = h(x'x) = h(T(x)) and q(S) =

h(S). In this case, the integration over Qn is trivial. Another example

where the integration over On is not trivial is given in the next

chapter when we discuss the noncentral Wishart distribution.

As motivation for the next result of this section, consider the situation discussed in Proposition 7.3. This result gives a characterization of the ?"-left invariant distributions by representing each of these distributions as a product measure where one measure is a fixed On-invariant distribution and the other measure is arbitrary. The decomposition of the space 9 into the product space IF . x Gu provided the framework in which to state this representation of On-left invariant distributions. In some situations, this product space structure is not available (see Example 7.5) but a product measure representation for ?n-invariant distributions can be obtained. It is established below that, under some mild regularity conditions, such a representation can be given for probability measures that are invariant under any compact topological group that acts on the sample space. We now tum to the technical details.

In what follows, G is a compact topological group that acts measurably on a measure space (9C, 'i3) and P is a G-invariant probability measure on ('X, f3f). The unique invariant probability measure on G is denoted by ,u and the symbol U E G denotes a random variable with values in G and

distribution p. The a-algebra for G is the Borel a-algebra of open sets so U is a measurable function defined on some probability space with induced

distribution ,I. Since G acts on 9, 9 can be written as a disjoint union of

orbits, say

x= U 6_

where (E is an index set for the orbits and 9aX nfl = 9 if a m a'. Let xa be

a fixed element of 9fa = (gxa,g e G). Also, set

6 = (x1la c A) c 9




and assume that 'J is a measurable subset of 9. The function T defined on

Xto ' by

T(X) = X iffx E X,

is obviously a maximal invariant function under the action of G on 9X. It is assumed that T is a measurable function from 9 to '@ where 6J has the

a-algebra inherited from 9X. A subset B1 c 'J is measurable iff B1 = 5 n B

for some B E @. If X E 9 has distribution P, then the maximal invariant

Y = T(X) has the induced distribution Q defined by

Q(BI) = P(T-'(Bl))

for measurable subsets B, c 6?. What we would like to show is that P is

represented by the product measure ,t X Q on G x '% in the following sense.

If Y E 624 has the distribution Q and is independent of U E G, then the

random variable Z = UY E QX has the distribution P. In other words,

1 (X) = I (UY) where U and Y are independent. Here, UY means the group

element U operating on the point Y E 9X. The intuitive argument that suggests this representation is the following. The distribution of X, condi tional on T(X) = x., should be G-invariant on 9Ca as the distribution of X

is G-invariant. But G acts transitively on 9fa and, since G is compact, there

should be a unique invariant probability distribution on 9X. that is induced

by , on G. In other words, conditional on T(X) = xa, X should have the

same distribution as Ux. where U is " uniform" on G. The next result makes

all of this precise.

Proposition 7.16. Consider %, Es, and G to be as above with their

respective a-algebras. Assume that the mapping h on G X 624 to 9 given by

h(g, y) = gy is measurable.

(i) If U E G and Y E 624 are independent with (U) = , and (Y) = Q, then the distribution of X = UYis a G-invariant distribution on 9c.

(ii) If X e 9 has a G-invariant distribution, say P, let the maximal

invariant Y = T(X) have an induced distribution Q on 62. Let

U E G have the distribution ,u and be independent of X. Then

E(X) = P,(UY).

Proof. For the proof of (i), it suffices to show that

if(X) = Rf(gX)




for all integrable functions f and all g e G. But

&f(gX) = Sf(g(UY)) = &f((gU)Y) = &&[f((gU)Y)fY]

- &&[f(UY)IY] = &f(UY) = &f(X).

In the above calculation, we have used the assumption that U and Y are

independent, so conditional on Y, C(U) = E(gU) for g E G. To prove (ii) it suffices to show that

if(X) = &f(UY)

for all integrable f. Since the distribution of X is G-invariant

&f(X) = &f(gX), g E G.

Therefore,

&f(X) = &U&Xf(UX),

as U and X are independent. Thus

ff(x)P(dx) = ff (gx)P(dx)tt(dg) = ff (gx),A(dg)P(dx).

However, for x e 9fa there exists an element k E G such that x = kxa. Using the definition of T and the right invariance of ,u, we have

ff (gx)t(dg) = f (gkx,)A(dg) = ff (gx,)A(dg)

- f(gT(x)) A (dg).

Hence

f (x)P(dx) = f (gT(X))t(dg)P(dx) = ff (gy),L(dg)Q(dy)

where the second equality follows from the definition of the induced measure Q. In terms of the random variables,

if(X) = &u;yf(UY)

where U and Y are independent as U and X are independent. O




The technical advantage of Proposition 7.16 over the method discussed in Section 7.1 is that the space N is not assumed to be in one-to-one

correspondence with the product space G x '54. Obviously, the mapping h on G x 'J to 6 is onto, but h will ordinarily not be one-to-one.

* Example 7.16. In this example, take 9 = 5 , the set of all p x p

symmetric matrices. The group G = 9, acts on p by

F(S)=Prsr, SGE,, Fe?1E.

For S E: 5, let

Y2

T(S) = Y = Y2E

yP

where yi > ... > yp are ordered eigenvalues of S and the off-diag

onal elements of Y are zero. Also, let 'J = (YIY = T(S), S E

The spectral theorem shows that T is a maximal invariant function under the action of Op and the elements of 'fJ index the orbits in S .

The measurability assumptions of Proposition 7.16 are easily veri fied, so every On-invariant distribution on SP, say P, has the repre sentation given by

jf(S)P(dS) = j ff(YF1')Q(dY),L(dF') Sp sp

where ,L is the uniform distribution on 9p and Q is the induced

distribution of Y. In terms of random variables, if e(s) = P and

E(FSF') = E(S) for all r E Op, then

is(S) = E(*T(S)* )

where I is uniform on 9p and is independent of the matrix of

eigenvalues of S. As a particular case, consider the probability

measure PO on Sp' c Sp with the Wishart density

po(S) = o(p, n)jSj(n-p-1)/2exp[- 2 trS]I(S)

where n > p, I(S) = 1 if S e SP and is zero otherwise. That po is a



density on Sp with respect to Lebesgue measure dS on S follows from Example 5.1. Also, po is 0 -invariant since dS is ?p-invaiant and po(fSr') = po(S) for all S E S and F E Op. Thus the above

results are applicable to this particular Wishart distribution.

The final example of this section deals with the singular value decomposi tion of a random n x p matrix.

* Example 7.17. The compact group 0,, x Op acts on the space p,, by

(rF A)x- rxA'; (r, A) E e,, X 0,, X E epn.

For definiteness, we take p < n. Define T on ep,n by

XI 0 A,

T(X)= 0 Ap

0

where AX > ** Ap > 0 and A2.., X2 are the ordered eigenval ues of X'X. Let p , e,,, be the range of Tso 'tJ is a closed subset of

e,, n. It is clear that (rXA') = T(X) for rE E enand A E , so Xr is

invariant. That T is a maximal invariant follows easily from the

singular value decomposition theorem. Thus the elements of 6' index the orbits in li,p nand every X E 0p, n can be written as

x -= FyA = (r, A)y

for some y E- 6 and (r A) E ,, X {9p. The measurability assump tions of Proposition 7.16 are easily checked. Thus if P is an

(,n x 0p)-invariant probability measure on Ep,,, and 0 (X)= P, then

e(x) = (F(rYA)

where (r, A) has a uniform distribution on ?n X Op, Y has a

distribution Q induced by T and P, and Y and (r, A) are indepen dent. However, we can say a bit more. Since 0,n X 0, is a product

group, the unique invariant probability measure on 0,n X 9,p is the


product measure /I X A 2 where #I(I02) is the unique invariant probability measure on ?O(?O). Thus r and A are independent and each is uniform in its respective group. In summary,

e(x) = (rYw).

where r, Y, and A are mutually independent with the distributions given above. As a particular case, consider the density

fo( X =- () nPexp[ - 2 tr( X'X)]

with respect to Lebesgue measure on lp n. Since fo(rXA') =fo(X) and Lebesgue measure is (On X (9p)-invariant, the probability mea sure defined by fo is (n X (9p)-invariant. Therefore, when E(X) =

N(O, In X Ip), X has the same distribution as rYAS where r and /v

are uniform and Y has the induced distribution Q on 6.

7.6. INDEPENDENCE AND INVARIANCE

Considerations that imply the stochastic independence of an invariant function and an equivariant function are the subject of this section. To

motivate the abstract discussion to follow, we begin with the familiar random sample from a univariate normal distribution. Consider X E %

with E(X) = N(,Ie, a2I") where ,u E R, a2 > 0, and e is the vector of ones

in Rn. The set 9 is Rn - span(e) and the reason for choosing this as the

sample space is to guarantee that E2n(xi - x2> 0 for x E 9. The coordi

nates of X, say Xi,..., Xn, are independent and E&(Xi) = N(,u, a2) for

i = l,..., n. When la and a2 are unknown parameters, the statistic t(X) =

(s, X) where

n n

X=EX., s2= E(X-X

is minimal sufficient and complete. The reason for using s rather than s in

the definition of t(X) is based on invariance considerations. The affine group Al, acts on 9 by

(a, b)x ax + be

for (a, b) E Al1. Let G be the subgroup of All given by G = ((a, b)l(a, b) e

Al,, a > 0) so G also acts on 9X.



INDEPENDENCE AND INVARIANCE 285

The probability model for X E 9 is generated by G in the sense that if

Z E 9 and C(Z) = N(O, In),

f ((a, b)Z) = ,(aZ + be) = N(be, a2In).

Thus the set of distributions 6-P = (N(,ue, a2I")Itt E R, a2 > 0) is obtained from an N(O, In) distribution by a group operation. For this example, the

group G serves as a parameter space for P. Further, the statistic t takes its

values in G and satisfies

t((a, b)X) = (a, b)(s, X ),

that is, t evaluated at (a, b)X = aX + be is the same as the group element

(a, b) composed with the group element (s, X). Thus t is an equivariant

function defined on 9C to G and G acts on both 9 and G. Now, which

functions of X, say h(X), might be independent of t(X)? Intuitively, since t(X) is sufficient, t(X) "contains all the information in X about the

parameters." Thus if h(X) has a distribution that does not depend on the

parameter value (such an h( X) will be called ancillary), there is some reason to believe that h(X) and t(X) might be independent. However, the group structure given above provides a method for constructing ancillary statistics. If h is an invariant function of X, then the distribution of h is an invariant

function of the parameter (,u, a2). But the group G acts transitively on the

parameter space (i.e., G), so any invariant function will be ancillary. Also, h

is invariant iff h is a function of a maximal invariant statistic. This suggests that a maximal invariant statistic will be independent of t(X). Consider the statistic

Z( X) = ( (X)) ', X = Ye

where the inverse on t(X) denotes the group inverse in G. The verification that Z(X) is maximal invariant partially justifies choosing t to have values in G. For (a, b) E G,

Z((a, b)X) = (t((a, b)X)) '(a, b)X = ((a, b)t(X)) '(a, b)X

= (t(X)) '(a, b) '(a, b) X = (t(X)) l'X = Z(X),

so Z is invariant. Also, if

(t(x))-'x= Z(X) = Z(Y) = (t(Y)) 'Y,




then

Y= t(y)wxw) X,

so X and Y are in the same orbit. Thus Z is maximal invariant and is an

ancillary statistic. That Z(X) and t(X) are stochastically independent for each value of ,u and a2 follows from Basu's Theorem given in the Appendix. The whole purpose of this discussion was to show that sufficiency coupled with the invariance suggested the independence of Z( X) and t( X). The role of the equivariance of t is not completely clear, but it is essential in the more abstract treatment that follows.

Let PO be a fixed probability on (9, 613) and suppose that G is a group

that acts measurably on (9, 6). Consider a measurable function t on

(~, 93d) to (', Cl) and assume that G is a homomorphic image of G that acts transitively on (6h, e,) and that

t(gx) = gt(x); x e6X, ge G.

Thus t is an equivariant function. For technical reasons that become apparent later, it is assumed that G is a locally compact and a-compact

topological group endowed with the Borel a-algebra. Also, the mapping (g, y) -- gy from G x 1?4 to 6 is assumed to be jointly measurable.

Now, let h be a measurable function on (9C, 6) to (;, 62), which is

G-invariant. If X E 9 and E(X) = PO, we want to find conditions under

which Y t(X) and Z h(X) are stochastically independent. The follow ing informal argument, which is made precise later, suggests the conditions needed. To show that Y and Z are independent, it is sufficient to verify that,

for all bounded measurable functions f on (E, C2),

H(y) = Spo(f(h(X))It(X) = y)

is constant for y e @. That this condition is sufficient follows by integrating

H with respect to the induced distribution of Y, say QO. More precisely, if k

is a bounded function on (154, C ,) and H(y) = H(yo) for y E '@, then

po [k(t( X))f(h(X))] = fpo [k(t(X))f(h (X))It(X) = y] Q0(dy)

= k(y)&pj[f(h(X))It(X) =y]Q0(dy)

= Jk(y)H(y)Qo(dy) = H(yo)fk(y)Qo(dy)



INDEPENDENCE AND INVARIANCE 287

and this implies independence. The assumption that H is constant justifies the next to the last equality while the last equality follows from

H(yo) = H(y)Qo(dy) =po f (h( X))

when H is constant. Thus under what conditions will H be constant? Since G acts transitively on '64, if H is U-invariant, then H must be constant and

conversely. However,

H(g-'y) = 6po [f(h(X))It(X) = g-1y] = &po [f(h (X))jgt(X) = y]

= po[f (h(X))It(gX) =y] =

Spo[f(h(gX))It(gX) = y]

= Sgpo [f(h (X))It(X) y].

The equivariance of t and the invariance of h justify the third and fourth equalities while the last equality is a consequence of Q(gX) = gP0 when C(X) = P0. Now, if t(X) is a sufficient statistic for the family ' = {gPolg E G}, then the last member of the above string of equalities is just H(y). Under this sufficiency assumption, H(y) = H(g- ly) so H is invariant and hence is a constant. The technical problem with this argument is caused by the nonuniqueness of conditional expectations. The conclusion that H(y) =

H(g- ly) should really be H(y) = H(g- 'y) except for y E Ng where Ng is a set of Q0 measure zero. Since this null set can depend on g, even the conclusion that H is a constant a.e. (Q0) is not justified without some further work. Once these technical problems are overcome, we prove that, if t(X) is sufficient for {gPo0g E G), then for each g E G, h(X) and t(X) are

stochastically independent when i ( X) = gP0. The first gap to fill concems almost invariant functions.

Definition 7.6. Let (X1, S @) be a measurable space that is acted on measurably by a group G, . If la is a o-finite measure on (6%X,

I 1i) and f is a

real-valued Borel measurable function, f is almost G,-invariant if for each g E G1, the set Ng = {xlf(x) * f(gx)) has I measure zero.

The following result shows that under certain conditions, an almost G1-invariant function is equal a.e. (,u) to a GI-invariant function.




Proposition 7.17. Suppose that G1 acts measurably on (9CXI, 1I) and that G,I is a locally compact and a-compact topological group with the Borel a-algebra. Assume that the mapping (g, x) -> gx from G1 x 9X 1 to 9I is

measurable. If ,u is a a-finite measure on (9X 1, 6 I) and f is a bounded almost

G -invariant function, then there exists a measurable invariant function f1 such thatf =f, a.e. (,u).

Proof This follows from Theorem 4, p. 227 of Lehmann (1959) and the proof is not repeated here. 0

The next technical problem has to do with conditional expectations.

Proposition 7.18. In the notation introduced earlier, suppose (9X, ffi) and (6J1 , C1) are measurable spaces acted on by groups G and G where G is a

homomorphic image of G. Assume that T is an equivariant function from %

to 6J. Let PO be a probability measure on (9C, 93) and let QO be the induced

distribution of T(X) when e(X) = PO. If f is a bounded G-invariant

function on 9, let

H(y) =&0po(f(X)lT(X) =y),

and

HI (y) = Cgp0(f( X)hT(X) y).

Then H,(gy) = H(y) a.e. (Q0) for each fixed g E G.

Proof The conditional expectations are well defined since f is bounded. H(y) is the unique a.e. (Q0) function that satisfies the equation

fk(y)H(y)Qo(dy) = k(T(x))f(x)Po(dx)

for all bounded measurable k. The probability measure gPo satisfies the

equation

Jfi(x)(gPo)(dx) =fi(gx)P0(dx)

for all bounded fl. Since T is equivariant, this implies that if P,(X) = gPo, then E(T(X)) = gQ0. Using this, the invariance off, and the characterizing




property of conditional expectation, we have for all bounded k,

fH(y)k(y)Qo(dy) = ff(x)k(T (x))Po(dx)

f f(gx)k(g-'T(gx))Po(dx)

- Jf(x)k(g-'T(x))(gPo)(dx)

= fHi(y)k(g-ly)(gQO)(dy)

-|Hi(gy)k(g-'gy)QO(dy)

= fH1(gy)k(y)Qo(dy).

Since the first and the last terms in this equality are equal for all bounded k, we have that H(y) = H,(gy) a.e. (QO). O

With the technical problems out of the way, the main result of this section can be proved.

Proposition 7.19. Consider measurable spaces (9X, '13) and ('AJ , e ), which are acted on measurably by groups G and G where G is a homomorphic

image of G. It is assumed that the conditions of Proposition 7.17 hold for the group G and the space (14, C1), and that G acts transitively on 6. Let T

on 9C to '% be measurable and equivariant. Also let (Y, C2) be a measurable space and let h be a G-invariant measurable function from 9 to ?. For a random variable X e 9f with P,(X) = PO, set Y =T(X) and Z = h(X) and

assume that T(X) is a sufficient statistic for the family {gP0Ig E G) of distributions on (fX, 613). Under these assumptions, Y and Z are indepen dent when e(X) gPo, g E G.

Proof. First we prove that Y and Z are independent when J,(X) = PO. Fix

a bounded measurable function f on Z and let

Hg(y) = GgpD(f(h(X))IT(X) = y).

Since (X) is a sufficient statistic, there is a measurable function H on 'EJ such that

Hg(y) = H(y) fory ? Ng

where Ng is a set of gQo-measure zero. Thus (gQo)(Ng) = Qo(g-'Ng) = 0.




Let e denote the identity in G. We now claim that H is a QO almost

G-invariant function. By Proposition 7.18, He(y) = Hg(gy) a.e. (Q0). How ever, H(y) He(y) a.e. QO and Hg(gy)= H(gy) for gy q Ng, where

Qo(g-'Ng) = 0. Thus Hg(gy) = H(gy) a.e. QO, and this implies that H(y) = H(gy) a.e. QO. Therefore, there exists a G-invariant measurable function, say H, such that H = H a.e. QO. Since G is transitive on 6@, H must be a

constant, so H is a constant a.e. QO. Therefore,

He(Y) = &pO(f(h(X))IT(X) =Y)

is a constant a.e. QO and, as noted earlier, this implies that Z = h(X) and

Y = T(X) are independent when e(X) = PO. When e(X) = g1Po, let PO =

g1P0 and note that {gPo0g E G) = {gPolg E G) so T(X) is sufficient for

(gP0Ig E G). The argument given for PO now applies for PO. Thus Z and Y

are independent when & ( X) = g I PO. U

A few comments concerning this result are in order. Since G acts

transitively on (gPoIg E G) and Z = h(X) is G-invariant, the distribution of Z is the same under each gPo, g E G. In other words, Z is an ancillary

statistic. Basu's Theorem, given in the Appendix, asserts that a sufficient statistic, whose induced family of distributions is complete, is independent of an ancillary statistic. Although no assumptions concerning invariance are

made in the statement of Basu's Theorem, most applications are to prob

lems where invariance is used to show a statistic is ancillary. In Proposition

7.19, the completeness assumption of Basu's Theorem has been replaced by the invariance assumptions and, most particularly, by the assumption that

the group G acts transitively on the space '54.

* Example 7.18. The normal distribution example at the beginning

of this section provided a situation where the sample mean and

sample variance are independent of a scale and translation invariant

statistic. We now consider a generalization of that situation. Let 6 = R-

- (span{e}) where e is the vector of ones in R' and

suppose that a random vector X E 9X has a density f(IIx I12) with

respect to Lebesgue measure dx on X. The group G in the example

at the beginning of this section acts on 9C by

(a, b)x = ax + be, (a, b) E G.

Consider the statistic t(X) = (s, X) where

n I

X=-EXI and S2= (ix




Then t takes values in G and satisfies

t((a, b)X) = (a, b)t(X)

for (a, b) E G. It is shown that t(X) and the G-invariant statistic

Z( X) -(t( X)) -, 1 X-X

are independent. The verification that Z(X) is invariant goes as follows:

Z((a, b)X) = (t((a, b)X)) '(a, b)X

= ((a, b)t(X)) '(a, b)X= (t(X)) 'X= Z(X).

To apply Proposition 7.19, let PO be the probability measure with density f(11x112) on 9 and let G = G = @. Thus t(X) is equivariant and Z(X) is invariant. The sufficiency of t(X) for the parametric family {gPo0g E G) is established by using the factorization theo rem. For (a, b) E G, it is not difficult to show that (a, b)Po has a

density k(xla, b) with respect to dx given by

k(xIa~)=-?f x -be 2) e~~ k(xla, b) = anf

,-| a X E= 9C.

Since

x - be 2 nn

a1 x bel2 = jL(,Xi2 -

2bExi + nb)2

the density k(xla, b) is a function of Ex2 and Ex1 so the pair (E2X2,?XX) is a sufficient statistic for the family {gPo0g E G).

However, t(X) = (s, X) is a one-to-one function of (EX i2,) so t(X) is a sufficient statistic. The remaining assumptions of Proposi tion 7.19 are easily verified. Therefore, t(X) and Z(X) are indepen dent under each of the measures (a, b)Po for (a, b) in G.

Before proceeding with the next example, an extension of Proposition 7.1 is needed.



Proposition 7.20. Consider the space iC, ,, n > p, and let Q be an n X n

rank k orthogonal projection. If k > p, then the set

B = (XiX e E, rank(QX) < p}

has Lebesgue measure zero.

Proof. Let X E, ,, be a random vector with E(X) = N(O, In X Ip) = Po. It suffices to show that PO(B) = 0 since P0 and Lebesgue measure are absolutely continuous with respect to each other. Also, write Q as

Q=F'DF, reOn

where

D (Ik 0)

0 01

Since

rank(F'DrX) = rank(DrX)

and E(FX) = E(X), it suffices to show that

PO(rank(DX) <p) = 0.

Now, partition X as

X= (X ) X1: k xp

so

DX - E( ? p) n

Thus rank(DX) = rank(XI). Since k > p and P,(X1) = N(0, Ik e Ip), Pro

position 7.1 implies that X, has rank p with probability one. Thus Po(B) = 0

so B has Lebesgue measure zero. o

* Example 7.19. This is generalization of Example 7.18 and deals

with the general multivariate linear model discussed in Chapter 4.


As in Example 4.4, let M be a linear subspace of p defined by

M= (xlx Eef p, x = ZB,BEf-P k

where Z is a fixed n x k matrix of rank k. For reasons that are

apparent in a moment, it is assumed that n - k > p. The orthogo

nal projection onto M relative to the natural inner product ( on fSp n is PM = PZ X IP where

p= z(z'z)-Iz

is a rank k orthogonal projection on R . Also, QM QZ C IP' where QZ = In - PZ is the orthogonal projection onto M1 and QZ is

a rank n - k orthogonal projection on Rn. For this example, the

sample space 9 is

(xx Ex= E J,, rank(Q,x) = p).

Since n - k > p, Proposition 7.20 implies that the complement of

9 has Lebesgue measure zero in CE n. In this example, the group G

has elements that are pairs (T, u) with T E GT where T is p x p

and u E M. The group operation is

(T1, u)(T2, U2) =(T1T2, U1 + U2T1)

and the action of G on 9X is

(T, u)x = xT' + u.

For this example, 4 = G = G and t on 9 to G is defined by

t(x) = (T(x), PMx) E G

where T(x) is the unique element in GT such that x'Qzx T(x)T'(x). The assumption that n - k > p insures that x'Qzx has

rank p. It is now routine to verify that

t((T, u)x) = (T, u)t(x)

for x E 'X and (T, u) E G. Using this relationship, the function

h(x) (t(x)) lx




is easily shown to be a maximal invariant under the action of G on

%. Now consider a random vector X e 9X with ( X) = PO where PO has a density f((x, x)) with respect to Lebesgue measure on 6. We apply Proposition 7.19 to show that t(X) and h(X) are indepen dent under gP0 for each g e G. Since 6?= G = G, t is an equiv

ariant function and G acts transitively on tJ. The measurability assumptions of Proposition 7.19 are easily checked. It remains to show that t(X) is a sufficient statistic for the family (gP0jg E G). For g = (T, ,u) E G, gP0 has a density given by

p(xj(T, it)) = ITT'l-n/2f(((x - x)(T-)-" X

Letting E = TT' and arguing as in Example 4.4, it follows that

((X -

xU) , X - IL = ((PMX

- L)2 -, PMX -)

+ tr(,E- Y'zx )

since ,u E M. Therefore, the density p(xI(T, IL)) is a function of the

pair (x'Q,x, PMx) so this pair is a sufficient statistic for the family

(gP0og e G}. However, T(x) is a one-to-one function of x'Qzx so

t(x) = (T(x), PMx)

is also a sufficient statistic. Thus Proposition 7.19 implies that t(X) and h(X) are stochastically independent under each probability

measure gP0 for g E G. Of course, the choice of f that motivated

this example is

f (w) = (427)- exp[ w]

so that P0 is the probability measure of a N(O, In ? Ip) distribution

on 9X.

One consequence of Proposition 7.19 is that the statistic h(X) is

ancillary. But for the case at hand, we now describe the distribution

of h(X) and show that its distribution does not even depend on the

particular density f used to define P0. Recall that

h(x) = (t(x)) x= (x - PMx)(T'(x)) = (QMx)(T'(X))Y

= (QZX)(F(X))x x

where T(X)T'(X) = x'Qzx and T(X) E- G+. Fix x E- IX and set




I = (Q_x)(T'(x))f. First note that

*= (T(x))-'x'Q,x(T'(x))Y' = IP

so I is a linear isometry. Let N be the orthogonal complement in R' of the linear subspace spanned by the columns of the matrix Z.

Clearly, dim(N) = n - k and the range of I is contained in N

since Q, is the orthogonal projection onto N. Therefore, I is an element of the space

6J-p(N) = {'I*'" = Ip, range(+) c N).

Further, the group

H = {1rlrE en r'(N) = N}

is compact and acts transitively on 6p(N) under the group action

+ r*, * - Jp (N), rE=-H.

Now, we return to the original problem of describing the distribu tion of W = h(X) when e(X)- PO. The above argument shows that W E- SJ( (N). Since the compact group H acts transitively on

15p(N), there is a unique invariant probability measure ' on p(N). It will be shown that l (W) = i by proving f (PW) = 15(W) for all

r E H. It is not difficult to verify that FQ, = Q,1 for F E H. Since

P1(rX) = E(X) and T(rX) = T(X), we have

c(rw) = f(rFh(x)) = (FQzX(T'(X))

= C(QzrX(T'('X)) ')= = (QzX(T'(X)) )

= P-(h(X))= =(W).

Therefore, the distribution of Wis H-invariant so f&(W) = P.

Further applications of Proposition 7.19 occur in the next three chapters. In particular, this result is used to derive the distribution of the determinant of a certain matrix that arises in testing problems for the general linear model.



PROBLEMS

1. Suppose the random n x p matrix X E 9 (X as in Section 7.1) has a

density given byf(x) = kIx'x Iexp[- 2 trx'x] with respect to dx. The constant k depends on n, p, and y (see Problem 6.10). Derive the density of S = X'X and the density of U in the representation X = IU

with UE GT and 4E- 'Fp

2. Suppose X E 9C has an ?n-left invariant distribution. Let P(X) = X(X'X)'X' and S(X) = X'X. Prove that P(X) and S(X) are inde

pendent.

3. Let Q be an n X n non-negative definite matrix of rank r and set

A = (xlx e Ep n x'Qx has rank p}. Show that, if r > p, then AC has

Lebesgue measure zero.

4. With X as in Section 7. 1, QnX Gip acts on 9 by x - xA' for F eOn

and A e Glp. Also, On X

Glp acts on 5+ by S- ASA'. Show that

+(x) = kx'x is equivariant for each constant k > 0. Are these the only

equivariant functions?

5. The permutation group 'fPn acts on Rn via matrix multiplication x -- gx,

g E gn Let (5 = yIy E- Rn y, < y2< * < Yn}. Define f: Rn

, 6

by f(x) is the vector of ordered values of the set (xl,..., xn) with multiple values listed.

(i) Show f is a maximal invariant.

(ii) Set 10(u) = 1 if u > 0 and 0 if u < 0. Define Fn(t) n- l=lno(t

-xi) for t e R'. Show Fn is also a maximal invariant.

6. Let M be a proper subspace of the inner product space (V, (, )). Let

Ao be defined by Aox = -x for x E M and Aox = x for x e M' .

(i) Verify that the set of pairs (B, y), with y E M and B either Ao or A , forms a subgroup of the affine group Al(V). Let G be this

group.

(ii) Show that G acts on M and on V.

(iii) Suppose t: V -* M is equivariant (t(Bx + y) = Bt(x) + y for

(B, y) E G and x E V). Prove that t(x) = PMX.

7. Let M be a subspace of Rn (M * Rn) so the complement of 9 = Rn

n MC has Lebesgue measure zero. Suppose X E 9 has a density given

by

p(x|,, a) n fO lix _

;2I2)

PROBLEMS 297

where ,u E M and a > 0. Assume that J11x112fO(1Ix112) dx < + oo. For

a > 0, F E 19(M), and b e M, the affine transformation (a, F, b)x =

aFx + b acts on IX.

(i) Show that the probability model for X (,u E M, a > 0) is in

variant under the above affine transformations. What is the induced group action on (,u, a2)?

(ii) Show that the only equivariant estimator of ,u is PMX. Show that

any equivariant estimator of a2 has the form kIIQx112 for some k > 0.

8. With X as in Section 7.1, suppose f is a function defined on Glp to

[0, oo), which satisfies f(AB) - f(BA) and

t ~~dx

J f(x'x ) /2 - 1.

(i) Show that f(x'x '), 7. e 5 is a density on 9 with respect to

dx/lx'xI"2 and under this density, the covariance (assuming it exists) is cI,, X L where c > 0.

(ii) Show that the family of distributions of (i) indexed by I e Sp is invariant under the group On X Glp acting on 9 by (r, A)x

FxA'. Also, show that (F, A)l = ALA'.

(iii) Show that the equivariant estimators of I all have the form kX'X, k > 0.

Now, assume that

sup f(C) =f(CO) cp

where CO E Sp' is unique.

(iv) Show CO = aIp for some a > 0.

(v) Find the maximum likelihood estimator of E: (expressed in terms of X and a in (iv)).

9. In an inner product space (V, (*, *)), suppose X has a distribution PO. (i) Show that C(IIXIl) = P,4IIYlI) whenever E(Y) = gPo, g e 6(V).

(ii) In the special case that e(X) = E(tL + Z) where it is a fixed

vector and Z has an 0(V)-invariant distribution, how does the distribution of 11XII depend on ,u?

10. Under the assumptions of Problem 4.5, use an invariance argument to show that the distribution of F depends on (,u, a2) only through the parameter (11,I112 - IIP ttII2)/a2. What happens when ,u E w?


11. Suppose XI,..., X, is a random sample from a distribution on RP (n > p) with density p(xJI p, ) = 12V- -'2f((x -

tL)'2; '(x - ,u)) where

, E RP and l E 2 p . The parameter 0 = det(2) is sometimes called

the population generalized variance. The sample generalized variance is V = det((l/n)S) where S = 2l(xi - Jf)(x, - x)'. Show that the dis

tribution of V depends on (It, 2) only through 0.

12. Assume the conditions under which Proposition 7.16 was proved. Given a probability Q on '54, let Q denote the extension of Q to

X-that is, Q(B) = Q(B n6 ) for B E '. For g E G, gQ is defined

in the usual way-(gQ)(B) = Q(g-'B).

(i) Assume that P is a probability measure on 9 and

(7.1) P =gQjV(dg), G

that is,

P(B) ==J(gQ)(B)M(dg); B e 6J.

Show that P is G-invariant.

(ii) If P is G-invariant, show that (7.1) holds for some Q.

13. Under the assumptions used to prove Proposition 7.16, let ?P be all the G-invariant distributions. Prove that T(X) is a sufficient statistic for the family VP.

14. Suppose XE R' has coordinates XI,..., Xn that are i.i.d. N(,u, 1),

,uE R'. Thus the parameter space for the distributions of X is the

additive group G = R'. The function t: Rn -- G given by t(x)= =

gives a complete sufficient statistic for the model for X. Also, G acts on

R' by gx = x + ge where e e Rn is the vector of ones.

(i) Show that t(gx) = gt(x) and that Z(X) = (t(X))-'X is an

ancillary statistic. Here, (t(X))-' means the group inverse of t(X) E G so (t(X))-'X = X - Xe. What is the distribution of

Z( X)? (ii) Suppose we want to find a minimum variance unbiased estima

tor (MVUE) of h(,u) = S,j(XI) where f is a given function.

The Rao-Blackwell Theorem asserts that the MVUE is & (f(X1)It(X) = t). Show that this conditional expectation is

| f(z + t)[1z2 1 jf(z +t) exp - dz 00 2,8 2 82~~~



where 82 = var(X, - X) = (n - 1)/n. Evaluate this for f(x) =

1 if x < uoandf(x)= Oifx > uo.

(iii) What is the MVUE of the parametric function (viT)- 'exp

[- 2(Xo - _1)2] where xo is a fixed number?

15. Using the notation, results, and assumptions of Example 7.18, find an unbiased estimator based on t(X) of the parametric function h(a, b) = ((a, b)Po)(X <, uo) where uo is a fixed number and Xl is the first

coordinate of X. Express the answer in terms of the distribution of Z1 -the first coordinate of Z. What is this distribution? In the case that

PO is the N(O, In) distribution, show this gives a MVUE for h(a, b).

16. This problem contains an abstraction of the technique developed in Problems 14 and 15. Under the conditions used to prove Proposition 7.19, assume the space ('J, (C,) is (G, 6G) and G = G. The equivari

ance assumption on T then becomes T(gx) = g o T(x) since T(x) E G.

Of course, T(X) is assumed to be a sufficient statistic for (gP0lg E G). (i) Let Z(X) = (T(X))-'X where (T(X))-' is the group inverse of

T(X). Show that Z(X) is a maximal invariant and Z(X) is

ancillary. Hence Proposition 7.19 applies. (ii) Let QO denote the distribution of Z when f&(X) is one of the

distributions gPo, g E G. Show that a version of the conditional expectation &(f(X)IT(X) = g) is &Qof(gZ) for any bounded

measurable f. (iii) Apply the above to the case when PO is N(O, In ? Ip) on 9 (as in

Section 7.1) and take G = G'. The group action is x -- xT' for

x E % and T E G'. The map T is T(X) = T in the representa

tionX= PT'with PE pandT, GT.WhatisQ0? (iv) When XE 6 is N(O, In X 2) with L E SP, use (iii) to find a

MVUE of the parametric function

wh)-Plr -

i/2sfd c i RUY- u

where uo is a fixed vector in RP.


1. For some material related to Proposition 7.3, see Dawid (1978). The extension of Proposition 7.3 to arbitrary compact groups (Proposition 7.16) is due to Farrell (1962). A related paper is Das Gupta (1979).


2. If G acts on % and / is a function from % onto % it is natural to ask if

we can define a group action on ^ (using / and G) so that / becomes

equivariant. The obvious thing to do is to pick jG^J, write y =

t(x\ and then define gy to be t(gx). In order that this definition make sense

it is necessary (and sufficient) that whenever t(x) =

t(x), then i(gx) =

t(gx) for all g e G. When this condition holds, it is easy to show that G then acts on ty via the above definition and t is equivariant. For some

further discussion, see Hall, Wijsman, and Ghosh (1965).

3. Some of the early work on invariance by Stein, and Hunt and Stein, first appeared in print in the work of other authors. For example, the

famous Hunt-Stein Theorem given in Lehmann (1959) was established in 1946 but was never published. This early work laid the foundation

for much of the material in this chapter. Other early invariance works

include Hotelling (1931), Pitman (1939), and Peisakoff (1950). The paper by Kiefer (1957) contains a generalization of the Hunt-Stein

Theorem. For some additional discussion on the development of invari ance arguments, see Hall, Wijsman, and Ghosh (1965).

4. Proposition 7.15 is probably due to Stein, but I do not know a

reference.

5. Make the assumptions on 9C, 6H9 and G that lead to Proposition 7.16, and note that % is just a particular representation of the quotient space

%/G. If v is any a-finite G-invariant measure on 9G, let 8 be the measure on fy defined by

8{C) =

p(t-1(C)), C?l

Then (see Lehmann, 1959, p. 39),

f h(r(x))v(dx) = / h(y)S(dy)

for all measurable functions h. The proof of Proposition 7.16 shows that

for any ^-integrable function /, the equation

(7.2) /f(x)v(dx) =

f f f(gy)fi(dg)S(dy)

holds. In an attempt to make sense of (7.2) when G is not compact, let




/ir denote a right invariant measure on G. For / g %(%), set

Assuming this integral is well defined (it may not be in certain examples

?e.g., % = Rn - (0} and G = G/?), it follows that /(Ax)

= /(jc) for

A e G. Thus / is invariant and can be regarded as a function on

^ = %/G. For any measure 5 on ^, write ffdS to mean the integral of

/, expressed as a function of y, with respect to the measure 8. In this

case, the right-hand side of (7.2) becomes

J(f) = /(/c/(g*K(?g))

di = ff d?.

However, for h e G, it is easy to show

j(hf) = ̂(h)j(f) so J is a relatively invariant integral. As usual, Ar is the right-hand

modulus of G. Thus the left-hand side of (7.2) must also be relatively invariant with multiplier A 7l. The argument thus far shows that when /i in (7.2) is replaced by /xr (this choice looks correct so that the inside

integral defines an invariant function), the resulting integral / is rela

tively invariant with multiplier A 7*. Hence the only possible measures v

for which (7.2) can hold must be relatively invariant with multiplier

A71. However, given such a v, further assumptions are needed in order

that (7.2) hold for some S (when G is not compact and ft is replaced by

/xr). Some examples where (7.2) is valid for noncompact groups are

given in Stein (1956), but the first systematic account of such a result is

Wijsman (1966), who uses some Lie group theory. A different approach due to Schwarz is reported in Farrell (1976). The description here

follows Andersson (1982) most closely.

6. Proposition 7.19 is a special case of a result in Hall, Wijsman, and

Ghosh (1965). Some version of this result was known to Stein but never

published by him. The development here is a modification of that which

I learned from Bondesson (1977).



CHAPTER 8

The Wishart Distribution

The Wishart distribution arises in a natural way as a matrix generalization of the chi-square distribution. If XI,..., X, are independent with P,(Xi)= N(O, 1), then Y,2X has a chi-square distribution with n degrees of freedom. When the Xi are random vectors rather than real-valued random variables

say Xi e RP with fE(Xi) = N(O, Ip), one possible way to generalize the

above sum of squares is to form the p x p positive semidefinite matrix

S = 'XLi Xi'. Essentially, this representation of S is used to define a Wishart

distribution. As with the definition of the multivariate normal distribution, our definition of the Wishart distribution is not in terms of a density function and allows for Wishart distributions that are singular. In fact, most

of the properties of the Wishart distribution are derived without reference to densities by exploiting the representation of the Wishart in terms of normal

random vectors. For example, the distribution of a partitioned Wishart matrix is obtained by using properties of conditioned normal random vectors.

After formally defining the Wishart distribution, the characteristic func tion and convolution properties of the Wishart are derived. Certain gener alized quadratic forms in normal random vectors are shown to have Wishart

distributions and the basic decomposition of the Wishart into submatrices is

given. The remainder of the chapter is concerned with the noncentral Wishart distribution in the rank one case and certain distributions that arise

in connection with likelihood ratio tests.

8.1. BASIC PROPERTIES

The Wishart distribution, or more precisely, the family of Wishart distribu tions, is indexed by a p x p positive semidefinite symmetric matrix l, by a

302



PROPOSITION 8.1 303

dimension parameter p, and by a degrees of freedom parameter n. Formally, we have the following definition.

Definition 8.1. A random p x p symmetric matrix S has a Wishart distri

bution with parameters 1, p, and n if there exist independent random

vectors Xl,..., Xn in RP such that C(Xi) = N(O, :), i = 1,..., n and

( S=) = c( x,).

In this case, we write E(S) = W(2, p, n).

In the above definition, p and n are positive integers and 2 is a p x p

positive semidefinite matrix. When p = 1, it is clear that the Wishart

distribution is just a chi-square distribution with n degrees of freedom and scale parameter I > 0. When . = 0, then Xi = 0 with probability one, so

S = 0 with probability one. Since YIXX i is positive semidefinite, the

Wishart distribution has all of its mass on the set of positive semidefinite matrices. In an abuse of notation, we often write

n

S = xi,

when f(S) = W(2, p, n). As distributional questions are the primary con cern in this chapter, this abuse causes no technical problems. If X E E n

has rows Xl,..., X", it is clear that E(X) = N(O, In X ) and X'X =

IX1XiX'. Thus if C(S) = W(2, p, n), then c(S) = f&(X'X) where E,(X) =

N(O, In 0 ) in ep n. Also, the converse statement is clear. Some further

properties of the Wishart distribution follow.

Proposition 8.1. If C(S) = W(2, p, n) and A is an r X p matrix, then

C(ASA') = W(AE2A', r, n).

Proof. Since ft(S) = W(Y, p, n),

C(s) = E(x'X)

where E (X) = N(O, I,, 0 X) in lp, n. Thus f&(ASA') = f (AX'XA') =

e[((In A)X)'(In 0 A)XI. But Y = (In 0 A)X satisfies e (Y) =

N(0, In 0 (ALA')) in Er n and (IY'Y) = (ASA'). The conclusion follows from the definition of the Wishart distribution. o



304 THE WISHART DISTRIBUTION

One consequence of Proposition 8.1 is that, for fixed p and n, the family of distributions {W(X, p, n)IY. > 0} can be generated from the W(Ip, p, n) distribution and p x p matrices. Here, the notation L >? 0 (2 > 0) means

that X is positive semidefinite (positive definite). To see this, if C(S)=

W(Ip, p, n) and I = AA', then

C (ASA') = W(AA', p, n) = W(2, p, n).

In particular, the family (W(2, p, n)JI2 > 0) is generated by the W(Ip, p, n) distribution and the group Glp acting on Sp by A(S) ASA'. Many proofs are simplified by using the above representation of the Wishart distribution. The question of the nonsingularity of the Wishart distribution is a good example. If C(S) = W(2, p, n), then S has a nonsingular Wishart distribu

tion if S is positive definite with probability one.

Proposition 8.2. Suppose ?(S) = W(2, p, n). Then S has a nonsingular Wishart distribution iff n > p and Y. > 0. If S has a nonsingular Wishart

distribution, then S has a density with respect to the measure v(dS)=

dSlISI(P+ )I2 given by

p(SIE) = t(n, p)IE- ISIn/2exp[ tr E- S].

Here, w(p, n) is the Wishart constant defined in Example 5.1.

Proof Represent the W(1, p, n) distribution as C(AS,A') where fC(S) =

W(Ip, p, n) and AA' = . Obviously, the rank of A is the rank of : and

Y. > 0 iff rank of I is p. If n < p, then by Proposition 7.1, if l (Xi) =

N(O, Ip), i = 1,..., n, the rank of EjXiX,' is n with probability one. Thus

SI = y2nX1X' has rank n, which is less thanp, and S = ASIA' has rank less

than p with probability one. Also, if the rank of I is r < p, then A has rank

r so ASIA' has rank at most r no matter what n happens to be. Therefore, if

n p and I is positive definite. Then S, = Y'Xi Xi' has rank p with probability one by Proposition 7.1, and A has rank p.

Therefore, S = AS,A' has rank p with probability one.

When L > 0, the density of X e E n is

f(X) =

(, -

2/-nply tr z-'X'X]

when e(X) = N(0, In ? E). When n > p, it follows from Proposition 7.6

that the density of S with respect to v(dS) isp(SI2). O

PROPOSITION 8.3 305

Recall that the natural inner product on Sp, when ? is regarded as a

subspace of e is

(SI, S2)=trS1S2, Si e?, =1, 2.

The mean vector, covariance, and characteristic function of a Wishart distribution on the inner product space (S , >) are given next.

Proposition 8.3. Suppose E(S) = W(2, p, n) on (sp, ( , )). Then

(i) &S = nl.

(ii) Cov(S) = 2n X E2.

(iii) +(A)=&exp[i(A, S)] = lIp-2i2A[j/2.

Proof. To prove (i) write S = E4XiX,' where E(Xi) = N(O, 2:), and X1,.... X,, are independent. Since &;XiX,i' 2=, it is clear that &S = n2. For (ii), the

independence of Xl,..., X,, implies that

n n

Cov(S) = Cov(EXiX) = E Cov( XX,') = n Cov( X, XI')

=n Cov(X 1o XI)

where X, o X, is the outer product of X, relative to the standard inner product on RP. Since E (XI) = C(CZ) where e(Z) = N(O, Ip) and CC' = 2, it follows from Proposition 2.24 that Cov(X1 0 XI) = 22 0 2. Thus (ii)

holds. To establish (iii), first write C'AC = PD F' where A E S , CC' 2,

F E Qn, and D is a diagonal matrix with diagonal entries Al,..., Ap. Then

?(A) = I;exp[itr(AS)] = &;exp[itrA (YX1x;)j

n n

= a; IH exp[itrAXjXj'] =

H&6exp[itrAXjXj'] j1 j=1

[&;exp[itrAX,Xj]]n = [&exp[iX,AXj]] (t(A

Again, E4(XI) = P(CZ) where E(Z) = N(O, Ip). Also, E (PZ) = ES(Z) for



F e Therefore,

((A) = Sexp[iXAXl] = & exp[iZ'C'ACZ]

= &exp[iZ'DZ] = Sexp[i E AjZj

where Z1,..., Zp are the coordinates of Z. Since Z1,..., Zp are independent with E(Zj) N(O, 1), Zj, has a x2 distribution and we have

p p

0(A) = 17 6 exp[iXjZj2] = l (1 - 2iXj) / ]=1 J=1

= II, - 2iDi-'/2 II, - 2iFDF'j-/2 - I - 2iC'AC- 1/2

= IIp - 2iCC'A'-7/2 = Ip- 2iDAj-'/2.

The next to the last equality is a consequence of Proposition 1.35. Thus (iii) holds. O

Proposition 8.4. If fS(Si) = W(E, p, n ) for i = 1,2 and if S, and S2 are

independent, then 1(S, + S2) = W(E, p, n1 + n2))

Proof An application of (iii) of Proposition 8.3 yields this convolution result. Specifically,

2 .p(A) = & exp[i(A, S, + S2) =]Hl C exp i(A, Sj)

j=1

2

= H IIp - 2i.AI-'72 = -I - 2iEAI(n?n2)/2 j=1

The uniqueness of characteristic functions shows that l (S, + S2)= W(2, p, n, + n2).

It should be emphasized that <, ) is not what we might call the standard inner product on Sp when Sp is regarded as a [p(p + 1)/2]

dimensional coordinate space. For example, if p = 2, and S, T E Sp, then

KS, T) = trST = s,,t,, + S22t22 + 2s12tl2

while the three-dimensional coordinate space inner product between S and

PROPOSITION 8.5 307

T would be sH1tH1 + s22t22 + s12t12. In this connection, equation (ii) of

Proposition 8.3 means that

cov((A, S), (B, S)) = 2n(A, (2 ? 2)B)

= 2n(A,2BE) = 2ntr(A>2BE),

that is, (ii) depends on the inner product < , on p and is not valid for

other inner products. In Chapter 3, quadratic forms in normal random vectors were shown to

have chi-square distributions under certain conditions. Similar results are available for generalized quadratic forms and the Wishart distribution. The following proposition is not the most general possible, but suffices in most situations.

Proposition 8.5. Consider X e p, where E(X) = N(,u, Q X E:). Let S=

X'PX where P is n x n and positive semidefinite, and write P = A2 with A

positive semidefinite. If A QA is a rank k orthogonal projection and if P = 0, then

e (S) = W(2:, p, k).

Proof. With Y = AX, it is clear that S = Y'Y and

e (Y) = N(Aft,(AQA) ? Y).

Since 9L(A) = 6(P) and Pu= 0, Ap = O so

(Y)= N(O,(AQA) 2).

By assumption, B = AQA is a rank k orthogonal projection. Also, S = Y'Y = Y'BY + Y'(I - B)Y, and f ((I - B)Y) = N(O, 0 0 E) so Y'(I - B)Y is zero with probability one. Thus it remains to show that if E(Y) = N(O, B 0

Y.) where B is a rank k orthogonal projection, then S = Y'BY has a

W(E, p, k) distribution. Without loss of generality (make an orthogonal transformation),

B =(Ik 0): n X n. P a

Partitioning Y into Y, : k X p and Y2: (n -

k) X p, it follows that S = Y'Yl

30)8 THE WISHART DISTRIBUTION

and

e(Y,) = N(O, Ik ? E)*

Thus ls(S) = W(Y, p, k).

* Example 8.1. We again return to the multivariate normal linear model introduced in Example 4.4. Consider X E Ep n with

e,(X) = N(p, In X ? )

where ,u is an element of the subspace M c (Ep n defined by

M = (XIX E epn, X = ZB, BE ep,k)

Here, Z is an n x k matrix of rank k and it is assumed that

n - k >? P. With P, = Z(Z'Z)-'Z', PM = P, X IP is the orthogonal

projection onto M and QM = QZ ? IP, QZ = I - PZ, is the orthogo

nal projection onto M' . We know that

I =PMX = (P, Ci) x = P,ZX

is the maximum likelihood estimator of ,u. As demonstrated in

Example 4.4, the maximum likelihood estimator of 2 is found by

maximizing

P(XI,, ) = ll-n/2exp[-2 tr E- lx Qzx]

Since n - k > p, x'Q,x has rank p with probability one. When

X'QZ X has rank p, Example 7.10 shows that

1

n

is the maximum likelihood estimator of E. The conditions of

Proposition 8.5 are easily checked to verify that S X'Qz X has a

W(X, p, n - k) distribution. In summary, for the multivariate lin

ear model, ,u = PMX and E = n X'QzX are the maximum likeli

hood estimators of ,I and E. Further, t and I are independent and



PROPOSITION 8.6 309

8.2. PARTITIONING A WISHART MATRIX

The partitioning of the Wishart distribution considered here is motivated partly by the transformation described in Proposition 5.8. If l&(S)=

W(Y., p, n) where n > p, partition S as

= S, S12'\ S21 S22 /

where S2, = Sj2 and let

S11.2 = Sl - S2S2'21.

Here, Sij is pi X pj for i, j = 1, 2 so PI + P2 = p. The primary result of this

section describes the joint distribution of (5Si 2' S21, S22) when I is nonsin gular. This joint distribution is derived by representing thte Wishart distribu tion in terms of the normal distribution. Since s(S) = W(1, p, n), S = X'X

where E(X) = N(O, I,, ? E). Discarding a set of Lebesgue measure zero, X

is assumed to take values in 9, the set- of all n x p matrices of rank p. With

x =(XI IX2), Xi: nX Pi, i=1,2,

it is clear that

Si= X, XJ fori,j= 1,2.

Thus

511.2 = x;x, - X,X2(X2X2) 'X2XI = X'QX,

where

Q =In -X2 ( X2X2 X2- In - P

is an orthogonal projection of rank n - P2 for each value of X2 when X e 9. To obtain the desired result for the Wishart distribution, it is useful to first give the joint distribution of (QXI, PX1, X2).

Proposition 8.6. The joint distribution of (QXI, PXI, X2) can be described as follows. Conditional on X2, QX1 and PX, are independent with

E(QX1IX2) = N(O, Q 0 E 11-2)




and

f (PX1IX2) =

N(X21221Y21, p ? 111 2)

Also,

f (X2) = N(O, In X 222)

Proof. From Example 3.1, the conditional distribution of X1 given X2, say

e(XiIX2), is

E(XllX2) = N (X2 -2-2y 2 1, In 0 1-I1 .2)

Thus conditional on X2, the random vector

Q ? 1P, Qx

w- iP@ |x (P :(2n) xp,

is a linear transformation of X1. Thus W has a normal distribution with mean vector

p 0 8 IP IX 11 3(? _

1

since QX2= 0 and PX2 = X2. Also, using the calculational rules for parti

tioned linear transformations, the covariance of W is

/Q 9 'P, IQ 0\ (p g

IP )(In (' Y-11 2)(Q >' Ipl, P qg Ipl) (0 p) / 11-2

since QP = 0. The conditional independence and conditional distribution of QX1 and PX1 follow immediately. That X2 has the claimed marginal

distribution is obvious. [l

Proposition 8.7. Suppose l (S) = W(Y., p, n) with n > p and . > 0. Parti

tion S into Sij, i, j = 1, 2, where S-i is pi x pj, PI + P2 = p, and partition l

similarly. With S1l.2 = Sl -

SU2Sj2'5S21 SH.2 and (S21 S22) are stochasti

cally independent. Further,

(S11 2)-= W(Y11.2, pl, n - P2)



PROPOSITION 8.7 311

and conditional onS22

E(S21IS22) = N(S22"21-22, S22 ? 211.2).

The marginal distribution of S22 is W(122, P2' n).

Proof. In the notation of Proposition 8.6, consider X E 9 with D ( X) =

N(O, In ,, Y) and S = X'X. Then Sij = X,'Xj for i, j = 1, 2 and

Si.2 = XlQXI. Since PX2 = X2 and S21 = X2X, we see that S21

(PX2)'XI = X2PX,, and conditional on X2,

E(S21IX2) = N(XX222 2121(X2X2) ?112

To show that SI I.2 and (S21, S22) are independent, it suffices to show that

Sf(S1.-2)h(S21, S22) = &f(Si1.2)&h(S21, S22)

for bounded measurable functions f and h with the appropriate domains of

definition. Using Proposition 8.6, we argue as follows. For fixed X2, QXI and PX, are independent so S'11.2 = XQQX1 and S21 = X2PX, are condi

tionally independent. Also,

E(QX1jX2) = N(O, Q X E 11 2)

and Q is a rank n - P2 orthogonal projection. By Proposition 8.5,

e (XlQXI I X2) = W(Y1 I . 2, P I, n -

P2) for each X2 so XlQX1 and X2 are

independent. Conditioning on X2, we have

&f(sII.2)h(S21, S22) = &[f(X1QX1)h(X2PX1, X2X2)1X2]

= &[G(f(x;'Qx1)Ix2)&q(h( X2PX1, X2X2)IX2)]

= & [&f(X'QX1) & (h (X2PX1, X2 X2)1X2)]

= &f(X,QX1)&g [& (h(X2PX1, X2X2)IX2)]

= &f ( Xl QX) Sh ( X2PXI, X2X2)

= &f(SI 12)&h(S21, S22).

Therefore, S, .2 and (S21, S22) are stochastically independent. To describe




the joint distribution of S2I and S22, again condition on X2. Then

E(S2 IX2) = ?22219(X2

and this conditional distribution depends on X2 only through S22 = X2X2. Thus

(S2l1IS22) = N(S22222 21,9 S22 ?

11.2)

That S22 has the claimed marginal distribution is obvious. O

By simply permuting the indices in Proposition 8.7, we obtain the following proposition.

Proposition 8.8. With the notation and assumptions of Proposition 8.7, let S22-1 = S22 - S21Sj',SS2. Then S22.1 and (S1H, S12) are stochastically inde

pendent and

e (S22 -I W(Y(22.1, P2, n - Pi).

Conditional on SI ,,

C(S121sI1) = N(S11 j'12 S11 ? 22.1)

and the marginal distribution of S11 is W(Z:1, pI, n).

Proposition 8.7 is one of the most useful results for deriving distributions of functions of Wishart matrices. Applications occur in this and the remain

ing chapters. For example, the following assertion provides a simple proof of the distribution of Hotelling's-T2, discussed in the next chapter.

Proposition 8.9. Suppose SO has a nonsingular Wishart distribution, say

W(2, p, n), and let A be an r x p matrix of rank r. Then

5 ((ASo- 'A") ) =

W((AY. -

'A') 1, r, n -p + r).

Proof. First, an invariance argument shows that it is sufficient to consider

the case when L = I. More precisely, write E = B2 with B > 0 and let

C = AB-'. With S = B-1SOB-', E(S) = W(I, p, n) and the assertion is

that

((CS- IC,) 1 W((CC') ', r,n - p + r).



PROPOSITION 8.9 313

Now, let I = C'(CC') 1/2, SO the assertion becomes

E ((*IS-'*)-') = W(Ir, r, n - p + r).

However, ' is p x r and satisfies V'I = Jr-that is, ' is a linear isometry. Since S(r'Sr) = fS(S) for all F E op

( (wrts - Ir+)l C,((* ,s-l)

Choose F so that

r = (Ir) p X r. 0

For this choice of F, the matrix (4"F'S- 'TF)-' is just the inverse of the

r x r upper left corner of S-', and this matrix is

Sl - SI2'S21 V

where V is r x r. By Proposition 8.7,

fE(V) = W(I,r,n-p + r)

since c (S) = W(I, p, n). This establishes the assertion of the proposition. E

When r = 1 in Proposition 8.9, the matrix A' is nonzero vector, say A' = a E RP. In this case,

a (-z la) ~~ =2

a a'Sa/=Xl+

when C(S) = W(2, p, n). Another decomposition result for the Wishart distribution, which is sometimes useful, follows.

Lemma 8.10. Suppose S has a nonsingular Wishart distribution, say fE(S) = W(2, p, n), and let S = TT' where T e GT. Then the density of T with

respect to the left invariant measure v(dT) = dT/Ht' is

p(TjE) = 2Yw6(n, p)J12-TT'In/2exp[ tr , - 'TT'].




If S and T are partitioned as

iSI~ I 12 ( , TI 0 s

S21 S22 T

T21 T22)

where Sis Pi x pj, PI + P2 = P, then SI = T1IT1l, 12 = T IT27, and

22 .1 = T22T52. Further, the pair (Tl,, T21) is independent of T22 and

S22~~~~ (1 =2 T22 T2 22 (T Esll:l,IlX22

sE(T'1(1)N(T'1j1'II12, 'P, ? E22.1)'

Proof. The expression for the density of T is a consequence of Proposition 7.5, and a bit of algebra shows that S, = ll T, 512 = 'II7'l, and S22 =1 T22 T22- The independence of (T,1, T2I) and T22 follows from Proposition 8.8

and the fact that the mapping between S and T is one-to-one and onto.

Also,

E (S121S1l) = N( SI I 11 12I 11l 22 -1 -

Since S,, and T,, are one-to-one functions of each other and S12 = TT2l,

E(TjIT2'jjTjI) =

N(TTj,YT.-IY,212 TIITI'l 0 -2 221)

Thus

C(T21IT11) I= (l:l2:l2 'pi 6' 222 1),

as

T21= (T?, I IP2)(TI IT21)

and T, is fixed.

Proposition 8.11. Suppose S has a nonsingular Wishart distribution with

C (S) = W(E, p, n) and assume that : is diagonal with diagonal elements

oH'*. .,.Opp. If S = TT' with T e GT, then the random variables (t1ji >1)

are mutually independent and

E (tij) = N(O, aii) fori >j

and

C(t2) = i y2i




Proof. First, partition S, 1, and T as

(S21 S22) ( 222 T21 22

where S11 is I x 1. Since y12 = 0, the conditional distribution of T2, given

T,1 does not depend on T11 and 22 has diagonal elements U22,*, pp . It follows from Proposition 8.10 that t, T, and T ally indepen

1'21' 22 are mutualyidpn

dent and

E(T21) = N(0, 122)

The elements of T21 are t21, t31,..., tpl, and since 22 is diagonal, these are

independent with

C(til)=N(O,aii), i-2, .,p.

Also,

E(t2 UX= 2

Gxn

and

C(S22 1) = e(T22T22) W(222, p - 1, n - 1).

The conclusion of the proposition follows by an induction argument on the dimension parameter p. o

When C(S) = W(1, p, n) is a nonsingular Wishart distribution, the random variable ISI is called the generalized variance. The distribution of 15I is easily derived using Proposition 8.1 1. First, write l = B2 with B > 0 and

let S1 = B-'SB-'. Then C(S,) = W(I, p, n) and ISI = XIZjSI. Also, if

TT' = SI, TE GT, then fE(t2)= X2_ for i = 1,...,p, and tl, I,_ tpp are mutually independent. Thus

e (jSj) = e(i2IISci) = =

Therefore, the distribution of ISI is the same as the constant III times a product of p independent chi-square random variables with n - i + 1 degrees of freedom for i = 1,. . , p.




8.3. THE NONCENTRAL WISHART DISTRIBUTION

Just as the Wishart distribution is a matrix generalization of the chi-square distribution, the noncentral Wishart distribution is a matrix analog of the noncentral chi-square distribution. Also, the noncentral Wishart distribu tion arises in a natural way in the study of distributional properties of test statistics in multivariate analysis.

Definition 8.2. Let X E l, n have a normal distribution N(pu, I,, X 2). A

random matrix S E p has a noncentral Wishart distribution with parame

ters 2, p, n, and A -I',u if C(S) = P-(X'X). In this case, we write

E(S) = W(2:, p, n; A).

In this definition, it is not obvious that the distribution of X'X depends on 1i only through A = ft',u. However, an invariance argument establishes this. The group Q acts on l,, by sending x into rx for x E n and

r E on. A maximal invariant under this action is x'x. When C(X) =

N(,u, In ? 2), I (rx) = N(rF, In ? 2) and we know the distribution of

X'X depends only on a maximal invariant parameter. But the group action

on the parameter space is (I,, 2) -* (JA, 2Y) and a maximal invariant is

obviously (,u't,u 2). Thus the distribution of X'X depends only on (4't, 2).

When A = 0, the noncentral Wishart distribution is just the W(2, p, n)

distribution. Let Xl,..., Xn be the rows of X in the above definition so

XI,..., Xn are independent and P-(Xi) = N(1ii, 2) where pL/,..., p', are the

rows of ,u. Obviously,

P_(Xixil) =W(Y-,p I ; Ai)

where Ai = fAi,uji. Thus Si = XiX,', i = 1,..., n, are independent and it is

clear that, if S = X'X, then

e =(s) eEsi

In other words, the noncentral Wishart distribution with n degrees of

freedom can be represented as the convolution of n noncentral Wishart

distributions each with one degree of freedom. This argument shows that, if

L (Si) = W(2, p, ni; Ai) for i = 1, 2 and if S, and S2 are independent, then

f(S1 + S2) = W(2, p, n1 + n2, A1 + A2)- Since

SX X,' X= 2 + 11/;




it follows that

6S= nl + A

when e (S) = W(12, p, n; A). Also,

n

Cov(S)= E COv(Si)

but an explicit expression for Cov(S1) is not needed here. As with the central Wishart distribution, it is not difficult to prove that, when e (S) =

W(2, p, n; A), then S is positive definite with probability one iff n > p and

E > 0. Further, it is clear that if 1F(S) = W(1, p, n; A) and A is an r x p

matrix, then E(ASA') = W(AYA', r, n; AAA'). The next result provides an expression for the density function of S in a special case.

Proposition 8.12. Suppose C(S) = W(2, p, n; A) where n > p and : > 0,

and assume that A has rank one, say A = qq' with q E RP. The density of S

with respect to P(dS) = dS/lSI(p+ 1)/2 is given by

p1 (SIE, A) = p(SIM)exp[- Il- 1X H((B'Q- IS- 1q) /2)

where p(SI ) is the density of a W(1, p, n) distribution given in Proposi tion 8.2 and the function H is defined in Example 7.13.

Proof: Consider X E ep n with C(X) = N(,u, I,, ? ) where it

E n and

',u = A. Since S = X'X is a maximal invariant under the action of on on

p, n, the results of Example 7.15 show that the density of S with respect to

the measure vo(dS) = (v2&? )np,(n, p)1SI(n-p- 1)/2 dS is

h(S) = jf(rx)uo(dr).

Here, f is the density of X and ,uo is the unique invariant probability measure on 0,,. The density of X is

f(X) (=)-" lZ -n/2eXp tr(X - u) (X - ]

Substituting this into the expression for h(S) and doing a bit of algebra shows that the density pI(SI 2, A) with respect to v is

p1(SIE, A) = p(SII)exp[- 2 tr I 'A] exp[tr rXi-'L]Lo(dr).




The problem is now to evaluate the integral over (9,. It is here where we use

the assumption that A has rank one. Since A = ,',, ,t must have rank one so

, = q' where ( E R', 1111 = 1, and q EC RP, A = yq'. Since 11j11 = 1, t = F1cE for some F, E On where E e R n is the first unit vector. Setting u =

(rt2 - Is -) 1nl/2, X2 -

1q = ur2e, for some 12 E eOnas uE1 and XI'-i,q have

the same length. Therefore,

f exp[tr rX-' '],uo(dr) = f exp[tr rX- '] o (dr)

= exp[t'rx2- 'q]1Lo(dF) exp[ue'rTr'2e],,IO(dF)

= f exp[ue'rEJp]Eo(dF) = f exp[uy11],uo(dr) H(u).

The right and left invariance of Ito was used in the third to the last equality

and y11 is the (1, 1) element of r. The function H was evaluated in Example 7.13. Therefore, when A =

pI(Sl:, A) =p(SIY )exp[-j2i'q'- 1] X H((71'.2-S2q)'72). ?

The final result of this section is the analog of Proposition 8.5 for the

noncentral Wishart distribution.

Proposition 8.13. Consider X E P nwhere ?(X) = N(ji, Q ? E) and let

S = X'PX where P > 0 is n x n. Write P = A2 with A > 0. If B AQA is

a rank k orthogonal projection and if AQPji = A,u, then

fS(S) = W( :, p, k; ,u'P,u).

Proof. The proof of this result is quite similar to that of Proposition 8.5

and is left to the reader. o

It should be noted that there is not an analog of Proposition 8.7 for the

noncentral Wishart distribution, at least as far as I know. Certainly, Proposition 8.7 is false as stated when S is noncentral Wishart.

8.4. DISTRIBUTIONS RELATED TO LIKELIHOOD RATIO TESTS

In the next two chapters, statistics that are the ratio of determinants of

Wishart matrices arise as tests statistics related to likelihood ratio tests.



DISTRIBUTIONS RELATED TO LIKELIHOOD RATIO TESTS 319

Since the techniques for deriving the distributions of these statistics are intimately connected with properties of the Wishart distribution, we have chosen to treat this topic here rather than interrupt the flow of the

succeeding chapters with such considerations. Let X Epsm and S E S+ be independent and suppose that E(X) =

N(p,, Im X 2) and l (S) = W(2, p, n) where n > p and : > 0. We are

interested in deriving the distribution of the random variable

u = isi IS + x'xI

for some special values of the mean matrix ,u of X. The argument below

shows that the distribution of U depends on (,i, E) only through 2-l/2#,2l1/2 where zI/2 iS the positive definite square root of E. Let S = 2:l/25E:I/2 and Y = X:-12. Then S1 and Y are independent, E (S,) =

W(I, p, n), and C(Y) = N(,ul:- /2, Im ? Ip). Also,

U 1Si _ Is,I Is + x,xI is, + Y,Yl

However, the discussion of the previous section shows that Y'Y has a noncentral Wishart distribution, say E(Y'Y) = W(I, p, n; A) where A = 2 -

1/2t- 1/2. In the following discussion we take . = Ip and denote the

distribution of U by

EC(U) = U(n, m, p; A\)

where A = ,u'. When , = 0, the notation

E (U) = U(n, m, p)

is used. In the case that p = I,

s + x'x

where 1(S) = x2. Since C(X) = N(,u, Im), (X'X) = (A) where A =

j,',l > 0. Thus

U= . ~~~1 1 + X2(A)/X2




When x2 (A) and x2 are independent, the distribution of the ratio

F(m, n;A) 2 Xn

is called a noncentral F distribution with parameters m, n, and A. When A = 0, the distribution of F(m, n; 0) is denoted by Fm n and is simply called an F distribution with (m, n) degrees of freedom. It should be noted that this usage is not standard as the above ratio has not been normalized by the

constant n/m. At times, the relationship between the F distribution and the beta distribution is useful. It is not difficult to show that, when x2 and x2 are independent, the random variable

2 V X

n

Xn + Xm

has a beta distribution with parameters n/2 and m/2, and this is written as

E&(V) = B(n/2, m/2). In other words, V has a density on (0, 1) given by

A(V) = r -

where a = n/2 and ,B= m/2. More generally, the distribution of the random variable

2

Xn

fl

X2 + X2 /

is called a noncentral beta distribution and the notation E(V( A))= JB(n/2, m/2; A) is used. In summary, whenp = 1,

e(U @(n ml\ 1 (U) = 1(jJj!i2

where A = ,u > 0.

Now, we consider the distribution of U when m = 1. In this case,

= N(tL', Ip) where X' E RP and

U= is + 'Xi = -ip + s-x'xr-' = (1 + xs-'x').

The last equality follows from Proposition 1.35.




Proposition 8.14. When m = 1,

(U)_g3( 2 ,2;6

where 6 = yy' > 0.

Proof It must be shown that

C (XS-'X') = F( p, n - p + ,)

For X fixed, X * 0, Proposition 8.10 shows that

_ _ _ _ ~~~2

XS IX/) Xn-p+1

when 15(S) = W(I, p, n). Since this distribution does not depend on X, we

have that (XX')/XS- 'X' and XX' are independent. Further,

E(XX') = x2(8)

since l?(X') = N(,u', I.). Thus

P-(XS 'X')=e c x XX ')=F(n - p +1,p; 8).

The next step in studying e(U) is the case when m > 1, p > 1, but J. = 0.

Proposition 8.15. Suppose X and S are independent where t(S)= W(I, p, n) and e(X) = N(O, Im ?, Ip). Then

f(U) = i

where U,,..., Un are independent and 4(U1) = B((n - p + i)/2, p/2).

Proof The proof is by induction on m and, when m = 1, we know

(U) = fi((n - p + 1)/2, p/2).




Since X'X = EX1 X,' where X has rows X1,..., Xm,

lSI lSI [~~~~s + XI X, u= - = S x I Is + X'Xl IS + X1X;I IS + X1X? + 2X1X,'

The first claim is that

U1-S I1 S + XI X,I

and

W IS + X 1x:I is + XxXx + 2MXiX,|

are independent random variables. Since X,,..., Xm are independent and independent of S, to show U1 and W are independent, it suffices to show that U1 and S + XI X are independent. To do this, Proposition 7.19 is

applicable. The group Glp acts on (S, XI) by

A(S, X1) = (ASA, AX1)

and the induced group action on T = S + X1 X1 sends T into A TA'. The

induced group action is clearly transitive. Obviously, T is an equivariant function and also U, is an invariant function under the group action on

(S, XI). That T is a sufficient statistic for the parametric family generated

by Glp and the fixed joint distribution of (S, X1) is easily checked via the

factorization criterion. By Proposition 7.19, U1 and S + X1X, are indepen

dent. Therefore,

(U) = (U1W)

where U, and W are independent and

' -p + Ip

However, E (S + XI X) = W(I, p, n + 1) and the induction hypothesis ap

plied to W yields




where W,,..., Wmi, are independent with

e ( wi ) n

e I p

P i, P2

Setting Ui = Wi- >, i = 2,..., m, we have

(U i= 1, )

where U1,..., U. are independent and

(Ui) = (f-2+ P22 ) 2

The above proof shows that Ui's are given by

Is + _i ~ U + i= 1,..,m

and that these random variables are independent. Since C(S + Ei-'X)X) = W(I, p, n + i - 1), Proposition 8.14 yields

t(Ue)=@( 2 p2.

In the special case that A has rank one, the distribution of U can be derived

by an argument similar to that in the proof of Proposition 8.15.

Proposition 8.16. Suppose X and S are independent where e(S)= W(I, p, n) and P,(X) = N(JL, I,, ? Ip). Assume that la = q' with ( E Rm,

= 1, andX E= RP. Then

e()= (U (P Ui)

where U,,..., Un are independent,

E(Ui)= ( 2Pn ,+ ), i=1,...,m-1

and

e(Um )-93 +m p( ' 2 )




Proof Let Em be the mth standard unit in RK. Then F' = Eim for some

O m em as ismil. Since

Is + X-Xl Is + xrsrxi

and P(rX) = N(Cm7l', Im X Ip), we can take ( = Em without loss of general

ity. As in the proof of Proposition 8.15, X'X = EX XiX' where X1,..., Xm are independent. Obviously, C (Xi) = N(O, Ip), i = 1,..., m - 1, and

E?4Xm) = N(rq, Ip). Now, write U = HmUi where

IS + _=jI L1 = , =1 M. is + x_ j1

The argument given in the proof of Proposition 8.15 shows that

U S} is+ x1x;;I

and (S + X X., X2,..., Xm) are independent. The assumption that X1 has

mean zero is essential here in order to verify the sufficiency condition

necessary to apply Proposition 7.19. Since U2,..., Um are functions of

(S + Xi, X, AT2,..., Xm), U1 is independent of (U2,..., Um). Now, we sim

ply repeat this argument m - I times to conclude that U,,..., Um are

independent, keeping in mind that XI,. . ., Xmr all have mean zero, but Xm

need not have mean zero. As noted earlier,

J 6(i)3 2 192) il...,Im-1

By Proposition 8.14,

t (t = (IS + Emx1 nt )p = m3

p ;t 1

Now, we return to the case when u = 0. In terms of the notation

(U)-= U(n, m, p), Proposition 8.14 asserts that




Further, Proposition 8.15 can be written

m U(n, m, p) HU(n + i - i,l, p)

i= I

where this equation means that the distribution U(n, m, p) can be repre sented as the distribution of the product of m independent random variables

with distribution U(n + i - 1, 1, p) for i = 1,..., m. An alternative repre sentation of U(n, m, p) in terms of p independent random variables when

m > p follows. If m > p and

Is + X'XI

with l (S) = W(I, p, n) and f (X) = N(O, Im , Ip), the matrix T = X'X

has a nonsingular Wishart distribution, E(T) = W(I, p, m). The following technical result provides the basic step for decomposing U(n, m, p) into a

product of p independent factors.

Proposition 8.17. Partition S into Sij where Sij is pi x pj, i, j = 1, 2, and

PI + P2 = p. Partition T similarly and let

Z = S'2Sj,'S12 + Tl2TIT1T2 - (S12 + T12)'(S11 + THY) (S12 + T12).

Then the five random vectors S1, T11, 22 1 22., and Z are mutually independent. Further,

f5(Z) = W(I, P2' PI).

Proof. Since S and T are independent by assumption, (S 1 S12, S22 .1) and (T1, T12, T22.1) are independent. Also, Proposition 8.8 shows that (S, S12) and S22., are independent with

- (S22 1) = W(I, P2, n - p )

( SUISI I) = N(O, S,l IP2)'

and

C(S,) = W(I, p , n).

Similar remarks hold for (T,1, T12) and T22.1 with n replaced by m. Thus the




four random vectors (SH, S12), S22.1, (T11 T,2), and T22.1 are mutually independent. Since Z is a function of (S, , S12) and (T11, T12), the proposi tion follows if we show that Z is independent of the vector (S11, T 1).

Conditional on (S, 1, T 11),

S

(T2 ) (SI,, T l )) = N 0, S

ol

0

P

Let A(B) be the positive definite square root of S1 I(TI I ). With V = A - IS12

and W = B-'T12,

v( [ (SI s1, Ti )) = N(O, I2p, ? Ip) Also,

- 12 11 12 2112 1 1J 11 I SJv12 + 12J Z =S 'S1S2 + T12T111T`12 -

(SI2 + T12)'S11 + T1l)(S2+r)

= [ W][ W] - [ ] [A ](A2 + B2)'[A][ v] = [ I] V[ v]

where

Q = I2p [j(A + B [

However, Q is easily shown to be an orthogonal projection of rank Pi By

Proposition 8.5,

f(ZI(S11, T11)) = W(I, P2, PI)

for each value of (SI , T11). Therefore, Z is independent of (S1, T1 ) and

the proof is complete. o

Proposition 8.18. If m > p, then.

p

U(n, m, p) =

r U(n -p + i,m,l).

Proof. By definition,

U(n, m, p) = ( TIS )

where S and Tare independent, fI(T) = W(I, p, m) and a(S) = W(I, p, n)




with n > p. In the notation of Proposition 8.17, partition S and T with

pi = I andp2 =p

- 1. Then S,, Tll, S22.1, T22. and

Z = S S12 + T12T'T12 - (S12 + T12)'(S11 + Tl)'(SI2 + T12)

are mutually independent. However,

(SI = ISIIIIS22 1

and

IS + T( = IS,, + T11l(S + T)22.11 = jSII + T111 S22.1 + T22*1 + Z(.

Thus

151 lIl, I S22 , I1 x IS + Tl ISI1 + Till IS22.1 + T22*1 + Z(

and the two factors on the right side of this equality are independent by Proposition 8.17. Obviously,

= U(n, m, 1).

Since (T22.)= W(I,p- 1,m- 1), e(Z)

= W(I, p - 1, 1), and T22.

and Z are independent, it follows that

f-(T22.1 + Z) = W(I, p - 1, m).

Therefore,

( S+22 + - U(n - 1, m, p -1),

which implies the relation

U(n, m, p) = U(n, m, 1)U(n - 1, m, p - 1).

Now, an easy induction argument establishes

p U(n, m, p) = U(n - i + 1, m, 1),

i=l



which implies that

p

U(n, m, p) = H U(n - p + i, m, 1)

and this completes the proof. o

Combining Propositions 8.15 and 8.18 leads to the following.

Proposition 8.19. If m > p, then

U(n, m, p) = U(n - p + m, p, m).

Proof. For arbitrary m, Proposition 8.15 yields

U(n, m, P)= = Ij ( + p)

where this notation means that the distribution U(n, m, p) can be repre sented as the product of m independent beta-random variables with the

factors in the product having a 'i3((n - p + i)/2, p72) distribution. Since

U(n -p + i, i,) ( n + m

Proposition 8.18 implies that

p p n-p m

U(n, m, p) = H U(n - p + i,in,l) = 2 ( - ?

Applying Proposition 8.15 to U(n - p + m, p, m) yields

U(n p +in,pin)= H (- P+imn-im+Q ) ,-1 (2 2)

i= t g 2 '2 )

which is the distribution U(n, m, p). o

In practice, the relationship U(n, m, p) = U(n - p + m, p, m) shows

that it is sufficient to deal with the case that m < p when tabulating the

PROBLEMS 329

distribution U(n, m, p). Rather accurate approximations to the percentage points of the distribution U(n, m, p) are available and these are discussed in detail in Anderson (1958, Chapter 8). This topic is not pursued further here.

PROBLEMS

1. Suppose S is W(I, 2, n), n >? 2, 7. > 0. Show that the density of

r = s12/ s11S22 can be written as

p(rlp) - 2( 2)2%(2, n)(l - p2)n/2(1 - r2)(f l)/24,(pr)

where p = a72/ 1qa22 and 4 is defined as follows. Let XI and X2 be

independent chi-square random variables each with n degrees of freedom. Then { (t) = &exp[t(XIX2)'/21 for Iti < 1. Using this repre sentation, prove that p(rIp) has a monotone likelihood ratio.

2. The gamma distribution with parameters a > 0 and X > 0, denoted by

G(a, A), has the density

p(xla,A) = -aF() exp[ A ] x > 0

with respect to Lebesgue measure on (0, oo). (i) Show the characteristic function of this distribution is (1 - iAtf (ii) Show that a G(n/2, 2) distribution is that of a x2 distribution.

3. The above problem suggests that it is natural to view the gamma family as an extension of the chi-squared family by allowing nonin tegral degrees of freedom. Since the W(E, p, n) distribution is a generalization of the chi-squared distribution, it is reasonable to ask if we can define a Wishart distribution for nonintegral degrees of free dom. One way to pose this question is to ask for what values of a is

OJ(A) = lIp - 2iAl a, A E SP, a characteristic function. (We have taken

2 = Ip for convenience).

(i) Using Proposition 8.3 and Problem 7.1, show that 4a is a characteristic function for a = 1/2,...,(p - 1)/2 and all real a > (p - 1)/2. Give the density that corresponds to 0P for

a > (p - 1)/2. W(Ip, p, 2a) denotes such a distribution.

(ii) For any I > 0 and the values of a given in (i), show that

4ja(YA), A E S- , is a characteristic function.


4. Let S be a random element of the inner product space (5,, K *)) where ( ,) is the usual trace inner product on 5,. Say that S has an

Op-invariant distribution if C(S) = W4FSF') for each F E ?,p. Assume S has an Op-invariant distribution.

(i) Assuming SS exists, show that SS = cIp where c = SsI, and sij

is the i, j element of S.

(ii) Let D e Sp be diagonal with diagonal elements d,,..., dp. Show

that var((D, S)) = (y - 3)Edd2 + f3(EPd )2 where y = var(s11) and /3 - cov{s11, S22).

(iii) For A e Sp, show that var((A, S)) = (-y - /3)(A, A) +

fB(Ip, A)2. From this conclude that Cov(S) = (y - I3)Ip 9 Ip +

1PIPE, IP,,. 5. Suppose S e Sp has a density f with respect to Lebesgue measure dS

restricted to Sp+. For each n > p, show there exists a random matrix

X E fSp n that has a density with respect to Lebesgue measure on Ep n and l(X'X) = C(S).

6. Show that Proposition 8.4 holds for all n,, n2 equal to 1, 2,.. ., p - 1

or any real number greater than p - 1.

7. (The inverse Wishart distribution.) Say that a positive definite S C S' p has an inverse Wishart distribution with parameters A, p, and v if

t(S-')= W(A-,p,v+p- 1). Here AE S. and v is a positive integer. The notation C(S) = IW(A, p, v) signifies that C (S') =

W(A- 1,p, v + p- 1).

(i) If f&(S) = IW(A, p, v) and A is r x p of rank r, show that

C(ASA') = IW(AA A', r, v). (ii) If C(S) =IW(Ip,p, p) and FE-e p, show that C(FSF') = E(S).

(iii) If C(S)= IW(A, p, v), show that 6(S) = (v - 2)-'A. Show

that Cov(S) has the form c1A ? A + c2A C A-what are c1 and

C2?

(iv) Now, partition S into 51: q x q, S12: q X r, and S22: r x r

with S as in (iii). Show that C(S,1) = IW(A11, q, v). Also show that C(S22.1) = IW(A22.1, r, i + q).

8. (The matric t distribution.) Suppose X is N(O, Ir ? Ip) and S is

W(Ip, p, m), m >? p. Let S- 1/2 denote the inverse of the positive

definite square root of S. When S and X are independent, the matrix

T = XS- 1/2 iS said to have a matric t distribution and is denoted by

C(T) = T(m - p + 1, Ir, Ip).



PROBLEMS 331

(i) Show that the density of T with respect to Lebesgue measure on

lp,r is given by

(T) o(m, P) ( 4r)rp w(m + r, p) lIp + T'TI(m+r)/2

Also, show that C(T) = C(rIT') for r E Or and A E 9.. Using this, show ST = 0 and Cov(T) = chr 0 Ip when these exist.

Here, c1 is a constant equal to the variance of any element of T.

(ii) Suppose V is IW(Ip, p, v) and that T given V is N(O, Ir ? V). Show that the unconditional distribution of T is T(v, Ir, I).

(iii) Using Problem 7 and (ii), show that if T is T(v, Ir, Ip), and T, I is

the k x q upper left-hand corner of T, then T,, is T(v, Ik Iq).

9. (Multivariate F distribution.) Suppose S1 is W(Ip, p, m) (for m = 1, 2,... ) and is independent of S2, which is W(Ip, p, ' + p - 1) (for

p = 1,2,... ). The matrix F = S7 l/2S S7- 1/2 has a matric F distribu

tion that is denoted by F(m, ', Ip). (i) If S is IW(Ip, p, v) and V given S is W(S, p, m), show that the

unconditional distribution of V is F(m, j, Ip). (ii) Suppose T is T(v, Ir Ip). Show that T'T is F(r, v, Ip). (iii) When r > p, show that the F(r, v, I.) distribution has a density

with respect to dF/IFI(P+ 1)/2 given by

p(F) =

'o(r, p)w( + p - 1, p) IFr/2

co(r + p + p -

1, p) II + Fl(,+p+r-1)/2

(iv) For r > p, show that, if F is F(r, v, Ip), then F-1 is F(v + p -

1, r - p + 1, Ip). (v) If F is F(r, v, Ip) and F, I is the q x q upper left block of F, use

(ii) to show that F11 is F(r, v, Iq). (vi) Suppose X is N(0, Ir ? Ip) with r ? p, X and S independent. Show that XS5-'X' is F(p, m - p

+ 1, Ir).

10. (Multivariate beta distribution.) Let S, and S2 be independent and suppose l (Si) = W(Ip, p, min), i = 1, 2, with mIn + M2 > p. The ran

dom matrix B = (SI + S2)-1/2S1(SI + S2)f-1/2 has a p-dimensional multivariate beta distribution with parameters m, and M2. This is

written C (B) = B(m1, IM2, Ip) (when p = 1, this is the univariate beta distribution with parameters ml/2 and m2/2).

(i) If B is B(mi, 2, I2p) show that E(fBr') = e(B) for all r E op. Use Example 7.16 to conclude that e(B) = e(ID'I) where I Ee p is uniform and is independent of the diagonal matrix D

with elements XA > - > Xp > 0. The distribution of D is de

termined by specifying the distribution of .1I..., AXp and this is the distribution of the ordered roots of (S1 + S2)-'/2S1(S1 +

S2 -1/2

(ii) With Sl and S2 as in the definition of B, show that S112(S1 +

S2)S- IS1/2 is B(m1, m2, Ip).

(iii) Suppose F is F(m, v, Ip). Use (i) and (ii) to show that (I + F)' is B(p + v - 1, m, Ip) and F(I + F)- is B(m, p + v - 1, Ip).

(iv) Suppose that X is N(O, Ir X Ip) and that it is independent of S, which is W(Ip, p, m). When r ? p, show that X(S +

X'X)-'X' is B(p, r + m - p, Ir)

(v) If B is B(mi, M2, I.) and m1 > p, show that det(B) is distrib

uted as U(m1, m2, p) in the notation of Section 7.4.


1. The Wishart distribution was first derived in Wishart (1928).

2. For some alternative discussions of the Wishart distribution, see Ander son (1958), Dempster (1969), Rao (1973), and Muirhead (1982).

3. The density function of the noncentral Wishart distribution in the

general case is obtained by "evaluating"

(8.1) ( exp[trrx2-y]/*o(?r) \

(see the proof of Proposition 8.12). The problem of evaluating

4,(A)=[ av[txTA]p0(dT)

for A g tn n has received much attention since the paper of James

(1954). Anderson (1946) first gave the noncentral Wishart density when


[x has rank 1 or rank 2. Much of the theory surrounding the evaluation

of \p and series expansions for ^ can be found in Muirhead (1982).

4. Wilks (1932) first proved Proposition 8.15 by calculating all the mo

ments of U and showing these matched the moments of nu?. Anderson

(1958) also uses the moment method to find the distribution of U. This

method was used by Box (1949) to provide asymptotic expansions for

the distribution of U (see Anderson, 1958, Chapter 8).



CHAPTER 9

Inference for Means

in Multivariate

Linear Models

Essentially, this chapter consists of a number of examples of estimation and testing problems for means when an observation vector has a normal distribution. Invariance is used throughout to describe the structure of the models considered and to suggest possible testing procedures. Because of space limitations, maximum likelihood estimators are the only type of estimators discussed. Further, likelihood ratio tests are calculated for most of the examples considered.

Before turning to the concrete examples, it is useful to have a general model within which we can view the results of this chapter. Consider an n-dimensional inner product space (V, (-, )) and suppose that X is a

random vector in V. To describe the type of parametric models we consider for X, let f be a decreasing function on [0, oo) to [0, oo) such that f [(x, x)] is

a density with respect to Lebesgue measure on (V, (*, *)). For convenience, it is assumed that f has been normalized so that, if Z E V has density f,

then Cov(Z) = I. Obviously, such a Z has mean zero. Now, let M be a

subspace of V and let y be a set of positive definite linear transformations on V to V such that I E y. The pair (M, y) serves as the parameter space

for a model for X. For ,u Ee M and L E y,

(xIAL Y) 121-n/2f [(X - A, 7 I(x -

is a density on V. The family

( P ( l, *ZE Mg Y- E= ey}

334



INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS 335

determines a parametric model for X. It is clear that if p(-jL,, Y.) is the

density of X, then &X = ,u and Cov(X) = E. In particular, when

(u=(S)_ n

exp I u], u > O ,

then X has a normal distribution with mean L e M and covariance L E y.

The parametric model for X is in fact a linear model for X with parameter

set (M, y). Now, assume that E(M) = M for all Y e y. Since I e -y, the

least-squares and Gauss-Markov estimator of ,u are equal to PX where P is

the orthogonal projection onto M. Further, ai PX is also the maximum likelihood estimator of ,t. To see this, first note that PY. = :P for : e y

since M is invariant under E e Ey. With Q = I - P, we have

(x - ', E-'(x - ,u)) = (P(x - ,u) + Qx, I-?(P(x - ,u) + Qx))

= (Px -

,u, 2:-'(Px -

,)) + (Qx, 2IQx).

The last equality is a consequence of

(Qx, E:-7'P(x - ,)) = (x, QYJ- IP(x - ,u)) = (x, QP7 '(x - ,u)) = 0

as QP = 0 and E- 'P P P-'. Therefore, for each l E y,

(x - A, E- (X - n)) > (Qx, Y.-'Qx)

with equality iff t = Px. Since the function f was assumed to be decreasing, it follows that a = PX is the maximum likelihood estimator of ,u, and ,u is unique if f is strictly decreasing. Thus under the assumptions made so far, A = PX is the maximum likelihood estimator for it. These assumptions hold for most of the examples considered in this chapter. To find the maximum likelihood estimator of E, it is necessary to compute

sup 11:|nJf [(Qx, y- IQx)]

and find the point E E y where the supremum is achieved, assuming it exists. The solution to this problem depends crucially on the set y and this is what generates the infinite variety of possible models, even with the assump tion that E2M = M for E E y. The examples of this chapter are generated by

simply choosing some y's for which I can be calculated explicitly. We end this rather lengthy introduction with a few general comments

about testing problems. In the notation of the previous paragraph, consider a parameter set (M, y) with I E y and assume EM = M for E e y. Also, let



336 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODELS

Mo c M be a subspace of V and assume that Y2Mo = M. for : E y.

Consider the problem of testing the null hypothesis that y E Mo versus the alternative that ,u E (M - Mo). Under the null hypothesis, the maximum likelihood estimator for yt is yI0 = POX where P0 is the orthogonal projection onto Mo. Thus the likelihood ratio test rejects the null hypothesis for small values of

sup t2:| /f [(Qox, 2 IQx)]

sup ly.1-n12f [(Qx, IQx)]

where Q0 = I - P0. Again, the set -y is the major determinant with regard to

the distribution, invariance, and other properties of A(x). The examples in this chapter illustrate some of the properties of y that lead to tractable

solutions to the estimation problem for L and the testing problem described

above.

9.1. THE MANOVA MODEL

The multivariate general linear model introduced in Example 4.4, also

known as the multivariate analysis of variance model (the MANOVA

model), is the subject of this section. The vector space under consideration

is l&p n with the usual inner product K ,*) and the subspace M of E? n is

M = (Xlx = Zf3, r E p, k}

where Z is a fixed n x k matrix of rank k. Consider an observation vector

X E lP- n and assume that

E(X) = N(fA, In C Y.)

where IL E M and . is an unknown p x p positive definite matrix. Thus the

set of covariances for X is

Y (In C 21| E Sp )

and (M, y) is the parameter set of the linear model for X. It was verified in

Example 4.4 that M is invariant under each element of -y. Also, the

orthogonal projection onto M is P = P, 0 Ip where

Pz = Z(Z'Z)'z'.

Further, Q = -P = Qz 0 Ip is the orthogonal projection onto M' where



THE MANOVA MODEL 337

Q, = In-P . Thus

a=PX= (PZ X IP)X= PzX

is the maximum likelihood estimator of ,u E M and, from Example 7.10,

2= -X'QzX n

is the maximum likelihood estimator of l when n - k > p, which we

assume throughout this discussion. Thus for the MANOVA model, the maximum likelihood estimators have been derived. The reader should check

that the MANOVA model is a special case of the linear model described at

the beginning of this chapter. We now turn to a discussion of the classical MANOVA testing problem.

Let K be a fixed r x k matrix of rank r and consider the problem of testing

Ho: K/ = 0 versus H,: Kp 1 0

where ,A = Zf, is the mean of X. It is not obvious that this testing problem is

of the general type described in the introduction to this chapter. However, before proceeding further, it is convenient to transform this problem into what is called the canonical form of the MANOVA testing problem. The essence of the argument below is that it suffices to take

Z = ZO- ( k) K= Ko-(r? K 0=K0(hr0)

in the above problem. In other words, a transformation of the original problem results in a problem where Z = Z0 and K = Ko. We now proceed with the details. The parametric model for X e Ep , is

e, (X) = N(Z3, In ? 7.)

and the statistical problem is to test Ho: K,B = 0 versus H1: K/ * 0. Since Z has rank k, Z = tU for some linear isometry I: n X k and some k X k matrix U E Gt. The k columns of I are the first k columns of some F E Q so

'I = -(k =FZ0 S0,

Setting X = r"X, 1~=Up, and k = KU-', we have




and the testing problem is Ho: K/B = 0 versus H1: K/I * 0. Applying the

same argument to K' as we did to Z,

for some A E ?k and some r x r matrix U1 in Gt. Let

0=( O

In -k n)

and set Y = rT'X, B = A'/l. Since

rIZo ( n-k)() () ZB,

it follows that

E(Y) = N(ZOB, InO ? )

and the testing problem is Ho: KoB = 0 versus HI: KoB * 0. Thus after

two transformations, the original problem has been transformed into a

problem with Z = ZO and K = Ko. Since Ko = (Ir 0), the null hypothesis is

that the first r rows of B are zero. Partition B into

B = ( ); B1:rXp, B2:(k-r)Xp B22

and partition Y into

Y,

Y= Y2 ; Y1:rXp, Y2:(k-r)Xp, Y3:(n-k)xp.

Y3

Because Cov(Y) = I,, ? 1, Y,, Y2, and Y3 are mutually independent and it

is clear that

E(Y,) = N(B1, Ir ? 1)

(Y2) = N(B2, I(k-r) ? )

E(Y3 =

N(O, I(n-k) ?

2)*



THE MANOVA MODEL 339

Also, the testing problem is Ho: B1 = 0 versus H1: B1 * 0. It is this form of

the problem that is called the canonical MANOVA testing problem. The only reason for transforming from the original problem to the canonical problem is that certain expressions become simpler and the invariance of the MANOVA testing problem is more easily articulated when the problem is expressed in canonical form.

We now proceed to analyze the canonical MANOVA testing problem. To simplify some later formulas, the notation is changed a bit. Let Y1, Y2, and

Y3 be independent random matrices that satisfy

C(Y,) = N(B,, Ir X 2)

(Y2) = N(B2, Is X Y)

E(Y3) = N(O, Im ? Y)

so B1 is r x p and B2 is s x p. As usual 2 is a p x p unknown positive

definite matrix. To insure the existence of a maximum likelihood estimator for Y., it is assumed that m > p and the sample space for Y3 is taken to be

the set of all m x p real matrices of rank p. A set of Lebesgue measure zero

has been deleted from the natural sample space l m of Y3. The testing problem is

Ho: B1 = 0 versus H1: B1 0.

Setting n = r + s + m and

Y Y2 n

Y3

15(Y) = N(,u, In ? E) where , is an element of the subspace

M =(a, =(B2 | B I E= P-p, r B2 E= Ep, s) 0~~~

In this notation, the null hypothesis is that It E MO c M where

Mo= {t1 = ( 2);B2CE s}




Since M and MO are both invariant under I,, X 2 for all 2 > 0, the testing

problem under consideration is of the type described in general terms earlier, and

-y = (In

0 2j2 > 0).

When the model for Y is (M, y), the density of Y is

p(YIB, B2, 2) = (V2.)-l2-n/2

x exp[-2 tr(Y -B1)2'(Y1 - B1)'

-2tr(Y2 - B2) 2(Y2 - B2)' - 2 trY2-'Y].

In this case, the maximum likelihood estimators of B1, B2, and 2 are easily seen to be

A ^ ̂A A

B1 = Y1, B2 =

Y2, 1Y33

When the model for Y is (MO, -y), the density of Y is p(Yp, B2, 2) and the

maximum likelihood estimators of B2 and 2 are

B2 = Y2, 3=-(Y3 + YY1). n

Therefore, the likelihood ratio test rejects for small values of

(Y) = p(Y10, B2, 2) 3j-n/2 y3In/2

p(YAI1, B2, :) IY3Y3 + ylYiIn/2*

Summarizing this, we have the following result.

Proposition 9.1. For the canonical MANOVA testing problem, the likeli hood ratio test rejects the null hypothesis for small values of the statistic

U= Y3'Y3I IY3'Y3 + Y'Y,l

Under Ho, E (U) = U(m, r, p) where the distribution U(m, r, p) is given in

Proposition 8.15.



PROPOSITION 9.1 341

Proof The first assertion is clear. Under Ho, E(Yi) = N(O, Ir ? 2) and

E(Y3) = N(O, Im X .). Therefore, E(Y,Y,) = W(2, p, r) and E(Y3Y3) =

W(Y2, p, m). Since m >? p, Proposition 8.18 implies the result. O

Before attempting to interpret the likelihood ratio test, it is useful to see first what implications can be obtained from invariance considerations in the canonical MANOVA problem. In the notation of the previous para graph, (M, -y) is the parameter set for the model for Y and under the null

hypothesis, (MO, y) is the parameter set for Y. In order that the testing problem be invariant under a group of transformations, both of the parame ter sets (M, y) and (MO, y) must be invariant. With this in mind, consider

the group G defined by

G = {glg = (rl , r2, F3, (, A); ri E er r2 E es,

r3Enm,IEfps9 A EGIP)

where the group action on the sample space is given by

Y, r, Y, A'

r,2, 13 , (= A) Y2 r2Y2A( + .

Y3 r3Y3A'

The group composition, defined so that the above action is a left action on the sample space, is

(rF, F2, 1'3, (, A) (A1, A2, A3,X,C) = (rFA, r2A2, r3A3, r2-qA'+t, AC).

Further, the induced group action on the parameter set (M, y) is

(IF, F2, F3, (, A)(B1, B2, 2-) = (FlB1A', r2B2A' + (, AY.A'),

where the point

Bl

B2 E M, (In ( Y) G Y

has been represented simply by (B,, B2, l:). Now it is routine to check that when Y has a normal distribution with & Y E M(6 Y E MO) and Cov(Y) E

y, then &gY E M(&gY E MO) and Cov(gY) E y, for g e G. Thus the



hypothesis testing problem is G-invariant and the likelihood ratio test is a G-invanant function of Y. To describe the invariant tests, a maximal invariant under the action of G on the sample space needs to be computed.

The following result provides one form of a maximal invariant.

Proposition 9.2. Let t = min(r, p), and define h(Y,, Y2, Y3) to be the

t-dimensional vector (XA,..., X,)' where Xl > .-- > XA are the t largest

eigenvalues of YY, (Y3Y3)- '. Then h is a maximal invariant under the action

of G on the sample space of Y.

Proof. Note that YY1 (Y3Y3)-1 has at most t nonzero eigenvalues, and these t eigenvalues are nonnegative. First, consider the case when r < p so

t = r. By Proposition 1.39, the nonzero eigenvalues of YY1Y(Y3Y3) - are the

same as the nonzero eigenvalues of Y1 ( YY3 ) - 'Y;, and these eigenvalues are obviously invariant under the action of g on Y. To show that h is maximal

invariant for this case, a reduction argument similar to that in Example 7.4

is used. Given

Y = Y2'

we claim that there exists a go E G such that

(DO) r

_ 0

where D is r x r and diagonal and has diagonal elements VK i, .., Ar. For

g = (I',, F2, F3, , A),

(F1Y1A'

gY= r2Y2A' +

F3Y3A'

By Proposition 5.2, Y3 = I3U3 where % E ?;p m and U3 E Gu is p x p.

Choose A' = U3- 'A where A E Ep is, as yet, unspecified. Then

r,Y,A' = r Y U- A

and, by the singular value decomposition theorem for matrices, there exists

PROPOSITION 9.2 343

a e E ?r and a A Ee-p such that

F,Y,U;-'A = (DO)

where D is an r x r diagonal matrix whose diagonal elements are the square

roots of the eigenvalues of Y,(U3U3)- 'Y = Y,(Y3Y3)- 'Y. With this choice for A E Cr it is clear that Y3A' = Y3U3-'A e ECp m so there exists a r3 E (i

such that

[,3U-,'A= (IP

Choosing r2 = I = - Y2A', and setting

90 =

(ri, is, r3, - Y2u3- 1A, (u;- 1\),),

g0Y has the representation claimed. To show h is maximal invariant,

suppose h(Y,, Y2, Y3) = h(Z,, Z2, Z3). Let D be the r x r diagonal matrix, the squares of whose diagonal elements are the eigenvalues of Y, ( Y3Y3)f 'Y

and Z,(Z3Z3)-Z'Z. Then there exist go and g, E G such that

(DO) 0

goY= =g,Z

so Y = g- lgI Z. Thus Y and Z are in the same orbit and h is a maximal

invariant function. When r > p, basically the same argument establishes that h is a maximal

invariant. To show h is invariant, if g = (r,, r2, L3, (, A), then the matrix

YfY,(Y3Y3)-' gets transformed into A YiY,(Y3Y3)- A- when Y is trans formed to gY. By Proposition 1.39, the eigenvalues of AYIY,(Y3Y3)-'A-' are the same as the eigenvalues of Y,Y, (Y3Y3 )- , so h is invariant. To show h

is maximal invariant, first show that, for each Y, there exists a go E G such

that

( )0 ep r1

g?Y = EU Ep,

we0 D sqar r

where D is the p X p diagonal matrix of square roots of eigenvalues



(Y,Y, )(Y3Y3)- '. The argument for this is similar to that given previously and is left to the reader. Now, by mimicking the proof for the case r < p, it follows that h is maximal invariant. o

Proposition 9.3. The distribution of the maximal invariant h(Yf, Y2, Y3) depends on the parameters (B1, B2, L:) only through the vector of the t largest eigenvalues of B B1 .

Proof Since h is a G-invariant function, the distribution of h depends on (B,, B2, E) only through a maximal invariant parameter under the induced action of G on the parameter space. This action, given earlier, is

(Fg, F2, F3, (, A)(B1, B2, :) = ( F,B1A', I2B2A' + (, AEA').

However, an argument similar to that used to prove Proposition 9.2 shows that the vector of the t largest eigenvalues of BB Y.1 is maximal invariant

in the parameter space. o

An alternative form of the maximal invariant is sometimes useful.

Proposition 9.4. Let t = min{r, p) and define h,(Y,, Y2, Y3) to be the

t-dimensional vector (Os,..., 0,)' where 01 < -.. < 0, are the t smallest

eigenvalues of Y3Y3(Y3Y3 + YjY1)-'. Then 0i = 1/(1 + Xi), i = 1 t, where X 's are defined in Proposition 9.2. Further, h,(Y1, Y2, Y3) is a

maximal invariant.

Proof For A E [0, oo), let 0 = 1/(1 + A). If X satisfies the equation

Y,Y,(Y3Y3)'- XIp= =,

then a bit of algebra shows that 0 satisfies the equation

Y3Y3(Y3Y3 + - oip = 0,

and conversely. Thus 0i = 1/(1 + AX), i = 1,..., t, are the t smallest eigen values of Y3Y3(Y3Y3 + YlY1 1. Since hI(Y,, Y2, Y3) is a one-to-one function

of h (Y,, Y2, Y3), it is clear that h, (Y,, Y2, Y3) is a maximal invariant. El

Since every G-invariant test is a function of a maximal invariant, the

problem of choosing a reasonable invariant test boils down to studying tests

based on a maximal invariant. When t min{p, r} = 1, the following result

shows that there is only one sensible choice for an invariant test.

PROPOSITION 9.5 345

Proposition 9.5. If t = I in the MANOVA problem, then the test that

rejects for large values of X, is uniformly most powerful within the class of G-invariant tests. Further, this test is equivalent to the likelihood ratio test.

Proof. First consider the case when p = 1. Then Y,Y,(Y3Y3)-1 is a non

negative scalar and

Y;Y,

Y3'Y3

Also. E(Y,) = N(B1, a2Ir) and E(Y3) = N(O, G2Im) where E: has been set

equal to a2 to conform to classical notation when p = 1. By Proposition 8.14,

= F(r, m; 6)

where 8 = BBI/o2 and the null hypothesis is that 8 = 0. Since the non

central F distribution has a monotone likelihood ratio, it follows that the

test that rejects for large values of X, is uniformly most powerful for testing

8 = 0 versus 6 > 0. As every invariant test is a function of X , the case for

p = I follows.

Now, suppose r = 1. Then the only nonzero eigenvalue of YlY(Y3Y3)-' is Y1 (Y3Y3) Y1 by Proposition 1.39. Thus

xi =

YI ( Y3'Y3 )y,'

and, by Proposition 8.14,

E(XI) =F(p,m - p +1;8)

where 8 = BE- 'B' >, 0. The problem is to test 8 = 0 versus 8 > 0. Again,

the noncentral F distribution has a monotone likelihood ratio and the test that rejects for large values of A1 is uniformly most powerful among tests

based on A1.

The likelihood ratio test rejects Ho for small values of

A- _Y3Y3_ I I

jY3;Y3 + Y1 -

'Ip + ylyI ( Y3Y3

If p = 1, then A = (I + X)-' and rejecting for small values of A is

equivalent to rejecting for large values of A1. When r = 1, then

lip + YfY1(Y3Y3) 1 = I + Y1(Y3'Y3) Y' - 1 + As

soagainA = (I + XI)1. a




When t > 1, the situation is not so simple. In terms of the eigenvalues

A1X,.. AX, the likelihood ratio criterion rejects Ho for small values of

A_== 1 1 H

iYY3 + YY1 II, + YlY(Y"Yo)l i 1 + A>

However, there are no compelling reasons to believe that other tests based on A1,..., A, would not be reasonable. Before discussing possible alterna tives to the likelihood ratio test, it is helpful to write the maximal invariant statistic in terms of the original variables that led to the canonical MANOVA problem. In the original MANOVA problem, we had an observation vector X E ,, such that

e(X) =N(Zpg In ) z )

and the problem was to test K,B = 0. We know that

A= (z'z)-'zx

and

1 1 E=-X'Q,X--S n n2

are the maximum likelihood estimators of ,B and L:.

Proposition 9.6. Let t = min(p, r). Suppose the original MANOVA prob lem is reduced to a canonical MANOVA problem. Then a maximal in

variant in the canonical problem expressed in terms of the original variables is the vector (A1, ..., A,)', AX >_ ** > At, of the t largest eigenvalues of

V-[(Kp) (K(Z'Z) -K') -K,]-#

Proof. The transformations that reduced the original problem to canonical form led to the three matrices Y1, Y2, and Y3 where Y1 is r x p, Y2 is

(k - r) X p, and Y3 is (n - k) X p. Expressing Y, and Y3 in terms of X, Z,

and K, it is not too difficult (but certainly tedious) to show that

YY1 ( Y3Y3 ) = V.

By Proposition 9.2, the vector (A1,..., A,)' of the t largest eigenvalues of



PROPOSITION 9.6 347

YjY1 ( Y3Y3) -' is a maximal invariant. Thus the vector of the t largest

eigenvalues of V is maximal invariant. U

In terms of X, Z, and K, the likelihood ratio test rejects the null

hypothesis if

A= lsI 1i3'K'(K(Z'Z)-K') 'K/ + SI

is too small. Also, the distribution of A under Ho is given in Proposition 9.1

as U(n - k, r, p). The distribution of A when K13 * 0 is quite complicated when t > 1 except in the case when ,B has rank one. In this case, the

distribution of A is given in Proposition 8.16. We now turn to the question of possible alternatives to the likelihood

ratio test. For notational convenience, the canonical form of the MANOVA problem is treated. However, the reader can express statistics in terms of the original variables by applying Proposition 9.6. Since our interest is in invariant tests, consider Y, and Y3, which are independent, and satisfy

E(YI) = N(B1, In C Y)

e (Y3) = N(O, Im X ).

The random vector Y2 need not be considered as invariant tests do not

involve Y2. Intuitively, the null hypothesis Ho: B1 = 0 should be rejected, on the basis of an invariant test, if the nonzero eigenvalues X, > * * * > XA of Y,Y1(Y3Y3)' are too large in some sense. Since E(Y1) = N(B1, Ir ? 2),

=YBYl=BBI + r.

Also, it is not difficult to verify that (see the problems at the end of this chapter)

Y3'Y3 m-p-1

when m - p - 1 > 0. Since Y, and Y3 are independent,

'3-' =

'r1+ I

BI-. F9 Yl'Yl ( Y3 =m-p

- P m - I

I i

Therefore, the further B, is away from zero, the larger we expect the



eigenvalues of B,B1 I- to be, and hence the larger we expect the eigen values of Y,YI(Y3Y3)- 1to be. In particular,

tr Yl Y1 (Y'Y3 )'= m-p-1+m - I + t B B I I I

and tr B'B l is just the sum of the eigenvalues of B'B I-'.

The test that rejects for large values of the statistic

Xi= trY,Y1(Y3Y3)

is called the Lawley-Hotelling trace test and is one possible alternative to the likelihood ratio test. Also, the test that rejects for large values of

= tr Y,Y1 ( Y3Y3 + Y9y)

was introduced by Pillai as a competitor to the likelihood ratio test. A third competitor is based on the following considerations. The null hypothesis

Ho: B1 = 0 is equivalent to the intersection over u E Rr, liull = 1, of the

null hypotheses Hu: u'B, = 0. Combining Propositions 9.5 and 9.6, it

follows that the test that accepts HU iff

U'yl ( Y3Y3 ) Y1u < C

is a uniformly most powerful test within the class of tests that are invariant

under the group of transformations preserving Hu. Here, c is a constant.

Under H.,

e (u'Y(3'Y3) 'Y; U) = Fp,m-p+I

so it seems reasonable to require that c not depend on u. Since Ho is

equivalent to nl(HuIIIuII = 1, u E Rr), Ho should be accepted iff all the HU are accepted-that is, Ho should be accepted iff

SUp u'Yl(Y3Y3)Ylu s< C. IluII= I

However, this supremum is just the largest eigenvalue of Y1 ( Y3Y3 ) -Y, which is XI. Thus the proposed test is to accept Ho iff A1 < c or equivalently,

PROPOSITION 9.6 349

to reject Ho for large values of X,. This test is called Roy's maximum root

test. Unfortunately, there is very little known about the comparative behavior

of the tests described above. A few numerical studies have been done for small values of r, m, and p but no single test stands out as dominating the

others over a substantial portion of the set of alternatives. Since very

accurate approximations are available for the null distribution of the likelihood ratio test, this test is easier to apply than the above competitors. Further, there is an interesting decomposition of the test statistic

A- IY3 31 Y3W3 + YIY1j

which has some applications in practice. Let S = Y3Y3 so ft(S)= W(12, p, m) and let Xl,..., Xr' denote the rows of Y1. Under Ho: B1 =0,

Xi,..., Xr are independent and f (Xi) = N(0, E). Further,

lsi ~~~r

A A~~~~~i= S + EI ,XiXi| =I

where

Is' A1 =

IS + Xi

and

? + Xi Xi,2

Proposition 8.15 gives the distribution of Ai under Ho and shows that Al,..., A, are independent under Ho. Let Pr',..., /3 denote the rows of B, and consider the r testing problems given by the null hypotheses

Hi: {(1 * r)#1 -2- ' 0 )

and the alternatives

Hi: r ( Obviousrl lH0 = 2 = ' n h li rt ts f

fori= 19,. . . , r. Obviously, Ho = nrlHi and the likelihood ratio test for




testing Hi against Hi rejects Hi iff Ai is too small. Thus the likelihood ratio test for Ho can be thought of as one possible way of combining the r independent test statistics into an overall test of nr Hi.

9.2. MANOVA PROBLEMS WITH BLOCK DIAGONAL COVARIANCE STRUCTURE

The parameter set of the MANOVA model considered in the previous section consisted of a subspace M = (.LJ,u = ZB,B E Ep k} CEP,

and a set

of covariance matrices

=Y (In X 212 E S p;}

It was assumed that the matrix I was completely unknown. In this section, we consider estimation and testing problems when certain things are known about 2. For example, if L = a p with U2 unknown and positive, then we

have the linear model discussed in Section 3.1. In. this case, the linear model with parameter set {M, y) is just a univariate linear model in the sense that

In 0 1 = a21n 0 Ip and In @ Ip is the identity linear transformation on the

vector space fp& n. This model is just the linear model of Section 9.1 when

p = 1 and np plays the role of n. Of course, when 2 = a2I p, the subspace M

need not have the structure above in order for Proposition 4.6 to hold.

In what follows, we consider another assumption concerning I and treat

certain estimation and testing problems. For the models treated, it is shown that these models are actually "products" of the MANOVA models dis

cussed in Section 9.1. Suppose Y E ep n is a random vector with &Y E M where

M = (It I = ZB, B E ep k}

and Z is a known n x k matrix of rank k. Write p = pI + P2, Pi > 1, for

i = 1, 2. The covariance of Y is assumed to be an element of

(0

In ?

2 0 22

'

)i P

Thus the rows of Y, say Yj,..., 1'Yn, are uncorrelated. Further, if 1' is

partitioned into Xi E RP' and Wi E RP2, Y' = (Xi', Wi'), then Xi and Wi are

also uncorrelated, since

Cov(Yi) = Cov{( )} ( E2)



PROPOSITION 9.7 351

Thus the interpretation of the assumed structure of y0 is that the rows of Y are uncorrelated and within each row, the first p, coordinates are uncorre lated with the last P2 coordinates. This suggests that we decompose Y into

x e n and WE n where

Y = (X, W) E

Obviously, the rows of X(W) are X1,..., Xn(WI,..., Wn). Also, partition B E

=p,k into B E E and B2 eP2 k. It is clear that

&X E Ml y = ZB,, B, E eP, k)

and

SW E M2 {(A21P2 = ZB2, B2 E Ep2,k

Further,

CoV(X) E Y1 -y (I, 0 7111z 1l E= 5PI)

and

Cov(W) e Y2 {(In & 1221122 C= P2

Since X and W are uncorrelated, if Y has a normal distribution, then X and

W are independent and normal and we have a MANOVA model of Section

9.1 for both X and W (with parameter sets (Ml, Y1) and (M2, y2)). In

summary, when Y has a normal distribution, Y can be partitioned into X

and W, which are independent. Therefore, the density of Y is

f(Yj1, E) = fl(XjtL1, I1)fi2( W12, 22)

where f, fl, and f2 are normal densities on the appropriate spaces. Since we

have MANOVA models for both X and W, the maximum likelihood estimators of , 2 tL2, Y I and 222 follow from the result of the first section.

For testing the null hypothesis Ho: KB = 0, K: r x k of rank r, a similar

decomposition occurs. As B = (BIB2), Ho: KB = 0 is equivalent to the two

null hypotheses Ho: KB, = 0 and H02: KB2 = 0.

Proposition 9.7. Assume that n - k > max(p1, P2). For testing Ho: KB =

0, the likelihood ratio test rejects for small values of A = A,A2 where

IX'QZXI A X (A) XfQ X + (KB )'( K( Z'Z) - lK') KB,|



and

A2- ~~~IW'QzW1

I W'Q,W + (KB2)'(K(Z'Z) 'K'< B21

Here, Q, = I - P, where P, = Z(Z'Z) Z' and

B, (z'z)'z'x, B2 = (z'z) 'zw.

Proof. We need to calculate

sup f(YlM, E)

t(y)~~~~- =

-

()Ho

sup f(YII, E) (,y, S)EOT

where 611 is the set of (it, Y.) such that ,u e M and I,, ? E e yo. As noted

earlier,

f(YILy, 2) =f MPI Ely, OMW)221y2 -222)

Also, (,u, 7.) e Ho iff (p,,, 1) E Ho and (M2' 222) E H02. Further, (y, Y) E

GT iff (pi, ii) E 6YIj where 6Xi is the set of (,ui, Iii) such that Li E Mi and

In ? ii c GYi for i = 1, 2. From these remarks, it follows that

I(Y) = 'I(X)42(W)

where

sup f1(XIlLi, 11)

( = 1, (11eHo

sup fl(XLY,1 211)

and

sup (W1A2 Y"22)

*2(Wv) (112. 22)E=HO2

sup f2(WjL2, 22) (IL2, Y22) E G2

However, I,(X) is simply the likelihood ratio statistic for testing Ho in the

PROPOSITION 9.8 353

MANOVA model for X. The results of Propositions 9.6 and 9.1 show that

+I, (X) = (Al)n/2. Similarly, I2(W) = (A2)n/2. Thus I(Y)= (AIA2)n/2 so the test that rejects for small values of A = AA2 is equivalent to the

likelihood ratio test. El

Since X and W are independent, the statistics A1 and A2 are indepen

dent. The distribution of A, when Ho is true is U(n - pi, r, pi) for i = 1, 2.

Therefore, when Ho is true, A,A2 is distributed as a product of independent beta random variables and the results in Anderson (1958) provide an approximation to the null distribution of AIA2*

We now turn to a discussion of the invariance aspects of testing Ho: KB

0 on the basis of the observation vector Y. The argument used to reduce the MANOVA model of Section 9.1 to canonical form is valid here, and this leads to a group of transformations GI, which preserve the testing problem

Ho for the MANOVA model for X. Similarly, there is a group G2 that preserves the testing problem H2 for the MANOVA model for W. Since Y = (X, W), we can define the product group GI X G2 acting on Y by

(g1, g2)Y- (g1X, g2W)

and the testing problem Ho is clearly invariant under this group action. A maximal invariant is derived as follows. Let ti = min(r, pi) for i = 1, 2, and in the notation of Proposition 9.7, let

V1 = [(KB')'(K(Z'Z)y'K') KB1](X'QzX)'

and

V2 = [(KB2)'(K(Z'Z)-'K') KB2](W'Qw)

Let m > * > n t, be the tI largest eigenvalues of V1 and 01> * > Ot2 be

the t2 largest eigenvalues of V2.

Proposition 9.8. A maximal invariant under the action of GI x G2 on Y is

the (t1 + t2)-dimensional vector ..... , qt,; 01,. .., t= h(Y) (h1(X); h2(W)). Here, h,(X) = (j,..., l,) and h2(W) = (.1..., t2).

Proof. By Proposition 9.6, h,(X)(h2(W)) is maximal invariant under the action of GI(G2) on X(W). Thus h is G-invariant. If h(Y,) = h(Y2) where

Y, = (X,, Wl) and Y2 = (X2, W2), then h,(X,) = h1(X2) and h2(WI)= h2(W2). Thus there exists gI E G1(g2 E G2) such that g1X,

= X2(g2W1 =




W2). Therefore,

(g1, g2)Y1 = (g1XI, g2W') = (X2, W2) = Y2

so h is maximal invariant. o

As a function of h(Y), the likelihood ratio test rejects Ho if

A = AA2=H12)~( 9 I I + Ii)H + Oi)

is too small. Since t1 + t2 > 1, the maximal invariant h(Y) is always of

dimension greater than one. Thus the situation described in Proposition 9.5 cannot arise in the present context. In no case will there exist a uniformly

most powerful invariant test of Ho: KB = 0 even if K has rank 1. This

completes our discussion of the present linear model. It should be clear by now that the results described above can be easily

extended to the case when E has the form

~222

Ess

where the off-diagonal blocks of E are zero. Here 7. E Sp and :E +p,, p lp,

Es pi = p. In this case, the set of covariances for Y E Cp , is the set y0, which consists of all In X Y where : has the above form and each Yii is

unknown. The mean space for Y is M as before. For this model, Y can be

decomposed into s independent pieces and we have a MANOVA model in

p, for each piece. Also, the matrix B(&Y = ZB) decomposes into B,...,

Bs, B1 E ep k and a null hypothesis Ho: KB = 0 is equivalent to the

intersection of the s null hypotheses Ho: KBi = 0, i = 1,..., s. The likeli

hood ratio test of Ho is now based on a product of s independent statistics,

say A = n,Ai, where E(Ai) = U(n - pi, r, pi) and thus A is distributed as

a product of independent beta random variables when Ho is true. Further, invariance considerations lead to an s-fold product group that preserves the

testing problem and a maximal invariant is of dimension t, + *.. + ts where ti = min(r, pi), i = l,..., s. The details of all this, which are mainly

notational, are left to the reader.

In this section, it has been shown that the linear model with a block

diagonal covariance matrix can be decomposed into independent compo



PROPOSITION 9.9 355

nent models, each of which is a MANOVA model of the type treated in Section 9.1. This decomposition technique also appears in the next two sections in which we treat linear models with different types of covariance structure.

9.3. INTRACLASS COVARIANCE STRUCTURE

In some instances, it is natural to assume that the covariance matrix of a

random vector possesses certain symmetry properties that are suggested by the sampling situation. For example, if n measurements are taken under the same experimental conditions, it may be reasonable to suppose that the order in which the observations are taken is immaterial. In other words, if X...., Xp denote the observations and X' = (XI,..., Xp) is the observation vector, then X and any permutation of X have the same distribution. Symbolically, this means that E(X) = C(gX) where g is a permutation matrix. If - Cov( X) exists, this implies that 2 = gYg' for g e 6, where

6. denotes the group of p x p permutation matrices. Our first task is to

characterize those covariance matrices that are invariant under 9. -that is,

those covariance matrices that satisfy E: = glg' for all g E @p. Let e E RP

be the vector of ones and set Pe = (l/p)ee' so Pe is the orthogonal

projection onto span{e). Also, let Qe = Ip - Pe.

Proposition 9.9. Let E be a positive definite p x p matrix. The following are equivalent:

(i) I = gl:g' for g E 6p.

(ii) 2 = aPe + 1Qe for a > O andB > 0.

(iii) I = a2A(p) where a2 > 0, - l/(p - 1) < p < 1, and A(p) is a

p x p matrix with elements aii = 1, i = ,...,p, and aij(p) = p

for i *].

Proof. Since

A(p) = (1 - p)Ip + pee' = (1- p)Ip + PPPe

= (1 - P)Qe + (1 + (P l)P)Pe'

the equivalence of (ii) and (iii) follows by taking a = a2(1 + (p - l)p) and

,B = a2(I - p). Since ge = e for g E @, , gPe = Peg. Thus if (ii) holds, then

gg' = agPeg' + f3gQeg = aPe + I#Qe =


so (i) holds. To show (i) implies (ii), let X e RP be a random vector with Cov(X) = 1. Then (i) implies that Cov(X) = Cov(gX) for g C ip. There fore,

var(X) =var(Xj), i, j= ,.. ., p

and

COV(xj, xj) =COV(xj,, Xj,); i =*j, i' *j'.

Let y = var(X,) and 8 = cov(X,, X2). Then

I = See' + (y - 8)Ip = P8Pe + (Y - 8)(Pe + Qe)

= (y + (p - l)8)Pe + (Y - 8)Qe = aPe + I3Qe

where a = y + (p - 1)6 and /3 = y - 8. The positivity of a and /3 follows

from the assumption that 2 is positive definite. O

A covariance matrix 7. that satisfies one of the conditions of Proposition 9.9 is called an intraclass covariance matrix and is said to have intraclass

covariance structure. Now that intraclass covariance matrices have been

described, suppose that X c Ep, n has a normal distribution with it - X E

M and Cov( X) e y where M is a linear subspace of fP, n and

y ={(In > 212 E Sp+,

= aPe + 3Qe, a > O,,B > O).

The covariance structure assumed for X means that the rows of X are

independent and each row of X has the same intraclass covariance structure.

In terms of invariance, if F ? g eQ?n ) ( , and In 0 Y E Y, it is clear that

Cov((F ? g)X) = COv(X)

since

(F ? g)(In ? Y)(r ? g)' = (rInrF) ? (gyg') = In ? E.

Conversely, if T is a positive definite linear transformation on Ep,n

that

satisfies

(r F g)T(r g)' = T forgr e g ? on ,

it is not difficult to show that T E y. The proof of this is left to the reader.



PROPOSITION 9. 10 357

Since the identity linear transformation is an element of y, in ord'er that the

least-squares estimator of ,u E M be the maximum likelihood estimator, it is sufficient that

(In X 2) M C M for I, 0 7 E- y.

Our next task is to describe a class of linear subspaces M that satisfy the

above condition.

Proposition 9.10. Let C be an r X p real matrix of rank r with rows

cl,. cr. If u,,..., ur is any basis for N span{c,,..., Cr) and U is an

r x p matrix with rows u',..., u1, then there exists an r X r nonsingular

matrix A such that A U = C.

Proof. Since u1,. .Ur is a basis for N,

r

ci - aikUk i = 1,..., r k=1

for some real numbers aik* Setting A = (aik), A U = C follows. As the basis (U..-, Ur) is mapped onto the basis (c,,..., Cr) by the linear transforma

tion defined by A, the matrix A is nonsingular. o

Given positive integers n and p, let k and r be positive integers that

satisfy k < n and r < p. Define a subspace M C C, by

M = = Z1BZ2; B E r,rk)

where Z1 is n x k of rank k, Z2 is r x p of rank r, and assume that e E RP

is an element of the subspace spanned by rows of Z2, say e E N =

span(z,..., Zr) and the rows of Z2 are z',..., z,. At this point, it is

convenient to relabel things a bit. Let u, = e/i/F, u2,..., ur, be an

orthonormal basis for N and let U: r x p have rows u',..., u,. By Proposi

tion 9.10, Z2 = A U for some r x r nonsingular matrix A so

M = - u = Z1BU, B E

Summarizing, X E C is assumed to have a normal distribution with GX E M and Cov(X) E y where M and y are given above. To decompose

this model for X into the product of two simple univariate linear models, let

F (9 , have u',..., u' as its first r rows. With Y = (In ? F)X,

&Y= &xrf = Z BUP'


and

COv(Y) = (In ? F)COV(X)(In 0 F)'

= (In rF)(In 0 (aPe + I3Qe))(In 0 F)

=1In ? (aFrper + I3rQer')

However,

ur' = (ro) E= ,

FPeF' = e1E'

and

rQe r = Ip - e1ej

where E' = (1,0,..., 0). Therefore, the matrix D arPerF + BrQer' is

diagonal with diagonal elements d1,..., dp given by di = a and d2 =

-dp = . Let Y1,..., Yp be the columns of Y and let bl,..., br be the

columns of B. Then it is clear that Y,,..., Yp are independent,

c (Y) = N(Zlbl, aIn)

f (Y,) = N(Z1bi, Ij) i = 2,. .., r,

and

P-(Yi) = N(O, /3Ij, i = r+ 1, ..., p.

To piece things back together, set m = n(p - 1) and let V E Rm be given

by V = (Y21,Y31,..., Y;). Then

f (V) = N(Z8, f3Im)

where 8 E R(r- 1)p, 6' = (b',..., br), and

zl 0

Z, :mX((r-l)p).

0 Z

0




Thus X has been decomposed into the two independent random vectors Y1 and V and the linear models for Y1 and V are given by the parameter sets

(M1, yT) and (M2, Y2) where

ml = {Alyl = Zlbl; b, e R k

Y= (aI'1a > 0)

M2 = 1{I21I2 = z8O 6 e R(r1-)P}

and

Y2 {(/,ml/ > 0).

Both of these linear models are univariate in the sense that y1 and Y2 consist of a constant times an identity matrix.

It is obvious that the general theory developed in Section 9.1 for the MANOVA model applies directly to the above two linear models individu ally. In particular, the maximum likelihood estimators of b1, a, 8, and 1B can simply be written down. Also, linear hypotheses about b, or 8 can be tested separately, and uniformly most powerful invariant tests will exist for such testing problems when the two linear models are treated separately. How ever, an interesting phenomenon occurs when we test a joint hypothesis about b1 and 6. For example, suppose the null hypothesis Ho is that b, = 0

and 8 = 0 and the alternative is that b1 * 0 or 8 * 0. This null hypothesis is

equivalent to the hypothesis that B = 0 in the original model for X. By simply writing down the densities of Y1 and V and substituting in the

maximum likelihood estimators of the parameters, the likelihood ratio test for Ho rejects if

A I - z112 n/2 IV - 2'112 \ /2

is too small. Here, 11 denotes the standard norm on the coordinate Euclidean space under consideration. Let

WI = - Z,b,ll2

1 lly1112

and

liV- Zj112 w

, -

_IV2




so W1 and W2 are independent and each has a beta distribution. When p > 3, then m = n( p - 1) > n and it follows that A2/ = Wl2m/ is not in

general distributed as a product of independent beta random variables. This is in contrast to the situation treated in Section 9.2 of this chapter.

We end this section with a brief description of what might be called multivariate intraclass covariance matrices. If X E RP and Cov(X) = Y.,

then I is an intraclass covariance matrix iff Cov(gX) = Cov(X) for all

g e ?p. When the random vector X is replaced by the random matrix

Y: p x q, then the expression gY = (g ? Iq)Y still makes sense for g E

and it is natural to seek a characterization of Cov(Y) when Cov(Y) = Cov((g ? Iq)Y) for all g E p. For g E 57p, the linear transformation g X Iq just permutes the rows of Y and, to characterize T = Cov(Y), we must

describe how permutations of the rows of Y affect T. The condition that Cov(Y) = Cov((g ? Iq)Y) is equivalent to the condition

T = (g X Iq)T(g X Iq)t

g GE p .

For A and B in q, consider

ToPe P? A + Qe? B.

Then To is a self-adjoint and positive definite linear transformation on lIq p to Eq, p. It is readily verified that

To = (g X Iq)To(g Iq), g E- p

That To is a possible generalization of an intraclass covariance matrix is fairly clear-the positive scalars a and ,B of Proposition 9.9 have become the

positive definite matrices A and B. The following result shows that if T is

(9ip ? Iq)-invariant-that is, if T satisfies T = (g ? Iq)T(g 0 Iq)'-then T

must be a To for some positive definite A and B.

Proposition 9.11. If T is positive definite and (60 X Iq)-invariant, then there exist q X q positive definite matrices A and B such that

T = Pe ? A + Qe ? B.

Proof The proof of this is left to the reader.

Unfortunately, space limitations prevent a detailed description of linear models that have covariances of the form I,, 0 T where T is given in



SYMMETRY MODELS: AN EXAMPLE 361

Proposition 9.11. However, the analysis of these models proceeds along the lines of that given for intraclass covariance models and, as usual, these

models can be decomposed into independent pieces, each of which is a MANOVA model.

9.4. SYMMETRY MODELS: AN EXAMPLE

The covariance structures studied thus far in this chapter are special cases of a class of covariance models called symmetry models. To describe these, let (V, (., .)) be an inner product space and let G be a compact subgroup of

C(V). Define the class of positive definite transformations YG by

YG-{(2JE IeC(V,V),E > 0,g2g' = L forallg EG).

Thus YG is the set of positive definite covariances that are invariant under G in the sense that I = g:g' for g e G. To justify the term symmetry model for YG' first observe that the notion of "symmetry" is most often expressed in terms of a group acting on a set. Further, if X is a random vector in V

with Cov(X) = 2, then Cov(gX) = glg'. Thus the condition that I = g:g' is precisely the condition that X and gX have the same covariance-hence, the term symmetry model.

Most of the covariance sets considered in this book have been symmetry models for a particular choice of (V, (-, *)) and G. For example, if G = 6 (V), then

YG ( 2t , = 2I2 > 0),

as Proposition 2.13 shows. Hence @(V) generates the weakly spherical symmetry model. The result of Proposition 2.19 establishes that when (V,I ) (p n' ( * , * >) and

G=(gig=FrIi, reo),

then

YG (21Y =

In 4 A, A E )

Of course, this symmetry model has occurred throughout this book. Using techniques similar to that in Proposition 2.19, the covariance models consid ered in Section 9.2 are easily shown to be symmetry models for an appropriate group. Moreover, Propositions 9.9 and 9.11 describe sets of




covariances (the intraclass covariances and their multivariate extensions) in exactly the manner in which the set YG was defined. Thus symmetry models are not unfamiliar objects.

Now, given a closed group G C 69(V), how can we explicitly describe the model yG? Unfortunately, there is no one method or approach that is appropriate for all groups G. For example, the results of Proposition 2.19 and Proposition 9.9 were proved by quite different means. However, there is a general structure theory known for the models YG (see Andersson, 1975), but we do not discuss that here. The general theory tells us what YG should look like, but does not tell us how to derive the particular form of YG.

The remainder of this section is devoted to an example where the methods are a bit different from those encountered thus far. To motivate the considerations below, consider observations X1, . . ., Xp,, which are taken at p

equally spaced points on a circle and are numbered sequentially around the

circle. For example, the observations might be temperatures at a fixed cross

section on a cylindrical rod when a heat source is present at the center of the

rod. Impurities in the rod and the interaction of adjacent measuring devices

may make an exchangeability assumption concerning the joint distribution of X1,..., Xp unreasonable. However, it may be quite reasonable to assume

that the covariance between Xj and Xk depends only on how far apart XJ

and Xk are on the circle-that is, cov(Xj, Xj, ) does not depend on j, j =1,. . ., p, where Xp + 1= X1; cov(Xj, Xj+2) does not depend on j, j= 1,.. .,p, where XP+2 =X2, and so on. Assuming that cov(Xj, Xj) does not

depend onj, these assumptions can be succinctly expressed as follows. Let XE RP have coordinates X,, ... ., Xp and let C be a p x p matrix with

Cp= I= 1, j = 1,...,p - 1

and the remaining elements of C zero. A bit of reflection will convince the reader that the conditions assumed on the covariances is equivalent to the condition that Cov(CX) = Cov(X). The matrix C is called a cyclic permu tation matrix since, if x E RP has coordinates xl,..., xp, then Cx has coordinates x2, X3,..., XP, XI. In the case that p = 5, a direct calculation

shows that

= Cov(X) = Cov(CX) = CEC'

iff I has the form

1 PI P2 P2 P1 1 P1 P2 P2

z =o2 1 p1 P2 1 p1



SYMMETRY MODELS: AN EXAMPLE 363

where a2> 0. The conditions on p1 and P2 so that 7. is positive definite are

given later. Covariances that satisfy the condition E = C2C' are called cyclic covariances. Some further motivation for the study of cyclic covari ances can be found in Olkin and Press (1969).

To begin the formal treatment of cyclic covariances, first observe that Cp = Ip so the group generated by C is

Go{p, C, C2'... Cp- I

Since C generates Go, it is clear that C2C' = I iff glg' = I for all g E Go. In what follows, only the case of p = 2q + 1, q > 1, is treated. When p is even, slightly different expressions are obtained but the analyses are similar. Rather than characterize the covariance set YGo directly, it is useful and instructive to first describe the set

G0= (BIBC = CB, B E C-).

Recall that ?P is the complex vector space of p-dimensional coordinate complex vectors and e,, is the set of all p x p complex matrices. Consider the complex number r exp[2i7i/pJ and define complex column vectors

wk E cP with jth coordinate given by

Wkj = P-1/2exp e 2 xi( - 1)(k - 1) i= .p

for k = I,... , p. A direct calculation shows that

WZ*W=k, k,l= 1,..., p

so w,,..., wp is an orthonormal basis for CP. For future reference note that

WI = p -/2e, Wk = Wp_ k+2' k = 2,..., q + 1

where p = 2q + 1, q > 1. Here, the bar over wk denotes complex conjugate, and e is the vector of ones in 4P. The basic relation

Cwk = rk IWk, k =l ,.,p

shows that

p (9.1) C = r rk- WkWk*

k=1




As usual, * denotes conjugate transpose. Obviously, 1, r,. . ., rPIl are eigenvalues of C with corresponding eigenvectors w,. .., wp. Let Do E ep be diagonal with dkk = rk-l, k = I,...,p and let UE C,, have columns

w1,..., Wk. The relation (9.1) can be written C = UDOU*. Since UU* = Ip U is a unitary complex matrix.

Proposition 9.12. The set CG0 consists of those B E Cp that have the form

p (9.2) B =EIkWkwk

where /31,..., ,Bp are arbitrary complex numbers.

Proof. If B has the form (9.2), the identity BC = CB follows easily from (9.1). Conversely, suppose BC = CB. Then

BUDOU* = UDOU*B

so

U*BUDO = DOU*BU

since U*U = Ip. In other words, U*BU commutes with Do. But Do is a

diagonal matrix with distinct nonzero diagonal elements. This implies that U*BU must be diagonal, say D, with diagonal elements /3,,..., ,Bp. Thus

U*BU = D so B = UDU*. Then B has the form (9.2). El

The next step is to identify those elements of dGo that are real and

symmetric. Consider B E- Go so

p B = EkkwkwkZ

Now, suppose that B is real and symmetric. Then the eigenvalues of B,

namely /3,,..., ,Bp, are real. Since /,3...-, /3,p are real and B is real, we have

p p

E/3kwkwk = B = B= E/3kWkW .

The relationship wk = W,,k+2, k = 2,.. .., q + 1, implies that /k = /3p-k+2.




k = 2,...,q+ 1, so

q+ I

(9.3) B = wlww + 1k(Wkwk + wkwk).

k=2

But any B given by (9.3) is real, symmetric, and commutes with C and

conversely. We now show that (9.3) yields a spectral form for the real

symmetric elements of dGo. Write Wk = Xk + iYk with Xk, Yk C RP, and

define Uk E RP by

Uk = Xk + Yk, k =,...,p.

The two identities

Wk*W=- 8k1 k,l= 1,..., p

Wk = Wp-k+2, k = 2,..., p

and the reality of w, yield the identities

UtUz k k, I= 1,..., p

WkWk + WkWk = UkUk + Up-k+2Up-k+2 k = p

Thus u,,..., up is an orthonormal basis for RP. Hence any B of the form

(9.3) can be written

q+ 1

B =3lU1U1 + E 8k(UkUk + Up-k+2Up-k+2) 2

and this is a spectral form for B. Such a B is positive definite iff Pk > 0 for k = l,..., q + 1. This discussion yields the following.

Proposition 9.13. The symmetry model YG0 consists of those covariances X

that have the form

q+ I

(9.4) = alulu + E a,i(Uu + Up_k+2up-k+2) k=2

where ak > O for k = 1,..., q + 1.




Let F have rows u ,..., up. Then r is a p X p symmetric orthogonal

matrix with elements

Yjk = Cos[ 27 (j - 1)(k - l) + sin[ 23 (j - 1)(k - I)]

for j, k = 1,. . , p. Further, any L given by (9.4) will be diagonalized by r

-that is, I'l is diagonal, say D, with diagonal elements

dk =ak, k = 1,..., q + 1; dp-k+2= ak, k =2,..., q +.

Since F simultaneously diagonalizes all the elements of YG.,

F can sometimes

be used to simplify the analysis of certain models with covariances in YG0* This is done in the following example.

As an application of the foregoing analysis, suppose Y,,..., Y, are independent with Y.

E RP, p = 2q + 1, and i_(YJ) = N(,j, E), j = 1,..., n.

It is assumed that I is a cyclic covariance so E E YGO- In what follows, we

derive the likelihood ratio test for testing HO, the null hypothesis that the

coordinates of u are all equal, versus H1, the alternative that ,u is completely

unknown. As usual, form the matrix Y: n x p with rows YJ1',] = 1,..., n, so

e (Y) = N(et', IIn ? E)

where ,u Ee RP and L E YGO Consider the new random vector Z = (I,, X F)Y

where F is defined in the previous paragraph. Setting It = F,i, we have

P(Z) = N(ev',In I D)

where D = 17F. As noted earlier, D is diagonal with diagonal elements

dk = ak, k= 1,...,q+ 1; dp_k+2 = ak, k = 2,..., q + 1.

Since L was assumed to be a completely unknown element of YG0, the

diagonal elements of D are unknown parameters subject only to the

restriction that aj > 0, j = 1,..., q + 1. In terms of v = Fft, the null

hypothesis is Ho: V2 = = vp = 0. Because of the structure of D, it is

convenient to relabel things once more. Denote the columns of Z by

Z1,..., Zp and consider W1,. . ., IWq+ defined by

WI = zR , an = (ZjZp-j+2)f jr= 2,. +q. + 1.

Thus W, E- R' and Wj E= 52 n for j = 2,. . ., q + 1. Define vectors (X GE R2




by

pi j = 2,.-, q + 1.

Now, it is clear that W,,..., Wq,I are independent and

(W1) = N(vle, a,IJ), (Wj) = N(ej, ajIn X I2),

j= 2,...,~q 1.

Further, the null hypothesis is Ho: = 0,j = 2,.. ., q + 1, and the alterna

tive is that (, * 0 for some j = 2, .. ., q + 1. With the model written in this

form, a derivation of the likelihood ratio test is routine. Let Pe = ee'/n and

let 11 -

11 denote the usual norm on C2 n Then the likelihood ratio test rejects

Ho for small values of

A +H IIW - peT~I jrl llW-eWj

j=2 1WI12

Of course, the likelihood ratio test of HOO : j= 0 versus H(A *j 0

rejects for small values of

A1 - - 'ji2 j = 2,..., q +.

The random variables A2,9 . . . 9Aq+ l are independent, and under Hoj),

e (Aj) =

' (n - 1, 1).

Thus under Ho, A is distributed as a product of the independent beta random variables, each with parameters n - I and 1.

We end this section with a discussion that leads to a new type of structured covariance-namely, the complex covariance structure that is discussed more fully in the next section. This covariance structure arises when we search for an analog of Proposition 9.11 for the cyclic group Go. To keep things simple, assume p = 3 (i.e., q = 1) SO Go has three elements

and is a subgroup of the permutation group '3, which has six elements. Since p = 3, Propositions 9.9 and 9.13 yield that y,53 = YGO and these

symmetry models consist of those covariances of the form

2 = aPe + fQe, a > 0,/f

> 0

where Pe = tee' and Qe = I3 - Pe




Now, consider the two groups iP3 X Ir and Go ? I, acting on Cr, 3 by

(g 1r)(X) = gx, gE 3 X Er,3

Proposition 9.11 states that a covariance T on Er 3 is P3 ?3 Ir invariant iff

(9.5) T-=Pe ? A + Qe ? B

for some r x r positive definite A and B. We now claim that for r > 1, there

are covariances on Er, 3 that cannot be written in the form (9.5), but that are

Go ? Ir invariant.

To establish the above claim, recall that the vectors u,, u2, and U3 defined earlier are an orthonormal basis for R3 and

Pe = UIUl , Qe = U2U2 + U3"u.

These vectors were defined from the vectors Wk X Xk + iYk, k = 1, 2, 3, by

Uk = Xk + Yk, k = 1, 2,3. Define the matrix J by

J-i I w2w2 - W3W3] I

By Proposition 9.12, J commutes with C. Consider vectors v2 and V3 given by

V2 = -(U2 + U3) v3 = -(u2- u3)

so {v2, V3) is an orthonormal basis for span (u2, U3). Since W3 = w2 we have

U3 = x2 - Y2, which implies that v2 = xX2 and v3 = C2 y2. This readily

implies that

j= v -v3v L

so J is skew-symmetric, nonzero, and Ju, = 0. Now, consider the linear transformation To on er 3 to Er 3 given by

To = Pe ? A + Qe C) B + J ? F

where A and B are r X r and positive definite and F is skew-symmetric. It is

now a routine matter to show that (C 1r) To = To(C X Ir) since CPe = PeC,

CQe = QeC, and JC = CJ. Thus To commutes with each element of Go ? Ir

and To is symmetric as both J and F are skew-symmetric. We now make two

claims: first, that a nonzero F exists such that To is positive definite, and




second, that such a To cannot be written in the form (9.5). Since Pe ? A +

Qe ?9 B is positive definite, it follows that for all skew-symmetric F's that

are sufficiently small,

Pe ? A + Qe 09 B + J 0 F

is positive definite. Thus there exists a nonzero skew-symmetric F so that To is positive definite. To establish the second claim, we have the following.

Proposition 9.14. Suppose that

Pe 09 A1 + Qe 0 B1 + J 0 Fl P le A2 + Qe 0 B2 + J 0 F2

where Ai and Bj, j = 1, 2, are symmetric and Fj, j = 1, 2, is skew-symmetric.

This implies that AI = A2, B1 = B2, and F1 = F2.

Proof. Recall that (u,, V2, V3) is an orthonormal basis for R3. The relation

QeUl = Ju, = 0 implies that for x e Rr

(Pe 0 Aj + Qe?8 B + J 0 f)(u1ELx) u10(A x)

for j = 1,2 so uI C(AIx) = uIlO(A2x). With ( ,.) denoting the natural

inner product on er,3, we have

x'A,x = KuUlJx, u1O(A1x)) = (u1Ox, u1O(A2x)) = x'A2x

for all x e Rr. The symmetry of A1 and A2 yield AI = A2. Since PeV2 0,

QeV2 = v2, and Jv2- -V3, we have

(Pe 0 Al + Qe 0 B, + J 0 F1)(v2Ox) = v201(Bx) -v3D(Fx)

= V20(CB2X) -V30(F2X)

for all x E R'. Thus

x'B1x = (v2Ex, v20B1x - v3O(F1x)) = x'B2x,

which implies that B1 = B2. Further,

-y'F1x = (v3Oy, v2 (Bx) - v30FIx) =-y'F2x

for all x, y e Rr. Thus F1 = F2. 0




In summary, we have produced a covariance

To = Pe 0 A + Qe ? B + J 0 F

that is (Go 0 I,)-invariant but is not (S3 ? Ir)-invariant when r > 1. Of course, when r = 1, the two symmetry models yy3 and YGo are the same. At this point, it is instructive to write out the matrix of To in a special ordered

basis for fEr 3. Let E1... e,, be the standard basis for Rr so

{U1El ... I UlEEr, V2E01,... , V20LJr, V3fEl,..., V3 ELr)

is an orthonormal basis for (Er3' (-,-)). A straightforward calculation shows that the matrix of To in this basis is

A O O

[TO]-O B F. O -F B

Since [To] is symmetric and positive definite, the 2r x 2r matrix

(-F B)

has these properties also. In other words, for each positive definite B, there is a nonzero skew-symmetric F (in fact, there exist infinitely many such skew-symmetric F's) such that Y. is positive definite. This special type of structured covariance has not arisen heretofore. However, it arises again in a very natural way in the next section where we discuss the complex normal distribution. It is not proved here, but the symmetry model of Go 0 I, when p = 3 consists of all covariances of the form

To = Pe 0 A + Qe 0 B + J 0 F

where A and B are positive definite and F is skew-symmetric.

9.5. COMPLEX COVARIANCE STRUCTURES

This section contains an introduction to complex covariance structures. One situation where this type of covariance structure arises was described at the end of the last section. To provide further motivation for the study of such

models, we begin this section with a brief discussion of the complex normal distribution. The complex normal distribution arises in a variety of contexts



COMPLEX COVARIANCE STRUCTURES 371

and it seems appropriate to include the definition and the elementary properties of this distribution.

The notation introduced in Section 1.6 is used here. In particular, a is the field of complex numbers, ? ' is the n-dimensional complex vector space of n-tuples (columns) of complex numbers, and Cn is the set of all n x n

complex matrices. For x, y E ?P n, the inner product between x and y is

n

(X, Y) =

-XjYj=X*Y. j=1

where x* denotes the conjugate transpose of x. Each x E ?n has the unique representation x = u + iv with u, v e R . Of course, u is the real part of x,

v is the imaginary part of x, and i= r is the imaginary unit. This representation of x defines a real vector space isomorphism between ?Pn and R2n. More precisely, for x E ?n, let

[x]=( u E R2n

where x = u + iv. Then [ax + by] = a[x] + b[y] for x, y e ?Pn, a, b e R,

and obviously, [ ] is a one-to-one onto function. In particular, this shows that ?n is a 2n-dimensional real vector space. If C E Cn, then C = A + iB

where A and B are n x n real matrices. Thus for x = u + iv e ?n,

Cx = (A + iB)(u + iv) = Au - Bv + i(Av + Bu)

so

[Cx] Au-Bv) (A -B)(u)

This suggests that we let (C) be the (2n) x (2n) partitioned matrix given by

(C)= (B A B): (2n) x (2n).

With this definition, [Cx] = (C)[x]. The whole point is that the matrix C e C,, applied to x E ?' can be represented by applying the real matrix

(C) to the real vector [x] E R2". A complex matrix C e Cn is called Hermitian if C = C*. Writing C = A

+ iB with A and B both real, C is Hermitian iff

A + iB = A' - iB',




which is equivalent to the two conditions

A = A', B=-B'.

Thus C is Hermitian iff {C) is a symmetric real matrix. A Hermitian matrix C is positive definite if x*Cx > 0 for all x E CF', x * 0. However, for

Hermitian C,

x*Cx = [x]'C[x]

so C is positive definite iff (C) is a positive definite real matrix. Of course, a

Hermitian matrix C is positive semidefinite if x*Cx > 0 for x E (J7 and C is

positive semidefinite iff (C) is positive semidefinite. Now consider a random variable X with values in 4T. Then X = U + iV

where U and V are real random variables. It is clear that the mean value of

X must be defined by

&X = &U + i&V

assuming &U and &V both exist. The variance of X, assuming it exists, is

defined by

var(X) = 6 [(X - &(X))( X - i X

where the bar denotes complex conjugate. Since X is a complex random variable, the complex conjugate is necessary if we want the variance of X to

be a nonnegative real number. In terms of U and V,

var(X) = var(U) + var(V).

It also follows that

var(aX + b) = aa-var(X)

for a, b E 4. For two random variables X1 and X2 in (I, define the

covariance between X1 and X2 (in that order) to be

COV{ X,, X2) } [( X, - R (X1 )(X2 - ( X2 ) )

assuming the expectations in question exist. With this definition it is clear that cov(X,, XI)

= var(X,), cov(X2, X,) =cov{X1 ,AX2), and

Cov( XI, X2 + X3) = CoV( X1, X2) + CoV{ XI, X3}.



COMPLEX COVARIANCE STRUCTURES 373

Further,

cov(a,X1 + b,, a2X2 + b2) = a,"2cov(X1, X2)

fora,, b,b2 e 47

We now turn to the problem of defining a normal distribution on 47n. Basically, the procedure is the same as defining a normal distribution on Rn. Step one is to define a normal distribution with mean zero and variance one

on (4, then define an arbitrary normal distribution on 47 by an affine

transformation of the distribution defined in step one, and finally we say that Z E 47 has a complex normal distribution if (a, Z) = a*Z has a

normal distribution in 47 for each a e 47". However it is not entirely obvious

how to carry out step one. Consider X e 4 and let ?7N(O, 1) denote the

distribution, yet to be defined, called the complex normal distribution with mean zero and variance one. Writing X = U + iV, we have

[XI U(E)eR2

so the distribution of X on 47 determines the joint distribution of U and V on R2 and, conversely, as [-] is one-to-one and onto. If f(X) = 47N(O, 1), then the following two conditions should hold:

(i) f&(aX) = 47N(O, 1) for a E 4with a& = 1.

(ii) [XI has a bivariate normal distribution on R2.

When ail = 1 and X has mean zero and variance one, then aX has mean zero and variance one so condition (i) simply says that a scalar multiple of a complex normal is again complex normal. Condition (ii) is the requirement that a normal distribution on 47 be transformed into a normal distribution on R2 under the real linear mapping [*]. It can now be shown that conditions (i) and (ii) uniquely define the distribution of [XI and hence provide us with the definition of a 47N(O, 1) distribution. Since &X = 0, we have S [XI = 0. Condition (i) implies that

e([X]) = (([aX]), aa = 1.

However, writing a = a + i/3,

[aXI = ( [ -[XI




where r is a 2 x 2 orthogonal matrix with determinant equal to one since ai 2 ?+ 2 = 1. Therefore,

eq[x]) = P-(r[x])

for all such orthogonal matrices. Using this together with the fact that

1 = var(X) = var(U) + var(V) implies that

Cov([ X]) = 2'I2

Hence

CQ[X]) = N2(0O, I2)

so the real and imaginary parts of X are independent normals with mean

zero and variance one half.

Definition 9.1. A random variable X = U + iV E ? has a complex normal

distribution with mean zero and variance one, written f&(X) = (tN(O, 1), if

e(( V) = 2(0, 2,2)'

With this definition, it is clear that when e&( X) = ?4N(O, 1), the density of X

on ?T with respect to two-dimensional Lebesgue measure on ? is

p(x) =-exp[-xxj, xe C.

Given ,u E ? and o2, a > 0, a random variable X1 E 4a has a complex

normal distribution with mean ,u and variance a2 if f(XI) = E (aX + ,u)

where e(X) = 4N(O, 1). In such a case, we write E(X1) = 4?N(,L, U2). It is

clear that X1 = U1 + iV, has a ?N(,u, a2) distribution iff U1 and V1 are

independent and normal with variance 1l 2 and means &U, = u, & V1 =

where , = ,u I + iJ2. As in the real case, a basic result is the following.

Proposition 9.15. Suppose Xl,..., Xm are independent random variables in

?T with P(Xj)) =

4N(jj, a2)j = I,..., m. Ten

m m m

e E ( a, X + by) -?N Y. (a, 1 + b), E a, d^a f , j==l(I+ ),i )

for a.,b 9j=I, M



Proof. This is proved by considering the real and imaginary parts of each

Xj. The details are left to the reader. E

Suppose Y is a random vector in (W' with coordinates Y1,..., Y, and that

var(Yj) < + oo for j = 1,. . ., n. Define a complex matrix H with elements

hjk given by

hjk - cov(Yj, Yk}

Since hjk = hkj, H is a Hermitian matrix. For a, b E ?tn, a bit of algebra

shows that

cov(a*Y, b*Y) = a*Hb = (a, Hb).

As in the real case, H is the covariance matrix of Y and is denoted by

Cov(Y) H. Since a*Ha = var(a*Y) > 0, H is positive semidefinite. If

H = Cov(Y) and A EC (,, it is readily verified that Cov(A Y) = AHA*.

We now turn to the definition of a complex normal distribution on the

n-dimensional complex vector space (Tn.

Definition 9.2. A random vector X E ?n has a complex normal distribu

tion if, for each a E a', (a, X) = a*X has a complex normal distribution

on ?1.

If X E On has a complex normal distribution and if A E ,e,, it is clear that

AX also has a complex normal distribution since (a, AX) = (A*a, X). In

order to describe all the complex normal distributions on ?Vn, we proceed as in the real case. Let X1,..., Xn be independent with Q(Xj) = ?TN(O, 1) on ? and let Xe En have coordinates X1,..., Xn. Since

n

a*X= E a XI,

j=l

Proposition 9.15 shows that Q(a*X) = ?VN(O, Id1jaj). Thus X has a com plex normal distribution. Further, SX = 0 and

COV{Xj, Xk) = ajk

so Cov(X) = I. For A E C,n and ,t E ?n, it follows that Y = AX + ,u has a

complex normal distribution and

&Y= u, Cov(Y)=AA*-H.

376 INFERENCE FOR MEANS IN MULTIVARIATE LINEAR MODEL

Since every nonnegative definite Hermitian matrix can be written as AA* for some A E C2, we have produced a complex normal distribution on n

with an arbitrary mean vector it e a' and an arbitrary nonnegative definite

Hermitian covariance matrix. However, it still must be shown that, if X and X in 4T' are complex normal with &X = &X and Cov(X) = Cov(X), then

E(X) = E( X). The proof of this assertion is left to the reader. Given this fact, it makes sense to speak of the complex normal distribution on (' with

mean vector ,u and covariance matrix H as this specifies a unique probabil

ity distribution. If X has such a distribution, the notation

f (X) = (VN(A, H)

is used. Writing X = U + iV, it is useful to describe the joint distribution of

U and V when e (X) = CN(Mi, H) on 4T'. First, consider X = U + iV where

C(X) = 4'N(t, I). Then the coordinates of X are independent and it

follows that

( V) ((2 ) 2

2n)

where L= , + 4iA2. For a general nonnegative definite Hermitian matrix

H, write H = AA* with A E Cn. Then

x)= e (Ak+ +).

Since

[X]=(U)

and

[AX+ ( A)[X]k + [y CB )( V) Ai

where A = B + iC, it follows that

E([X]) = P-({A)[k] +

But H = 2 + iF where I is positive semidefinite, F is skew-symmetric, and

the real matrix

{H) =(FE)




is positive semidefinite. Since H = AA*, (H) = (A) (A)', and therefore,

)= ((A)[k] + [ N) N ' )

N((y2),~~~~~~~~~~~~- 2(2z)

In summary, we have the following result.

Proposition 9.16. Suppose l&(X) = ?N(,u, H) and write X = U + iV, ,u =

Al + 4L 2, and H = I + iF. Then

( V) P(2) 2F Y.))

Conversely, with U and V jointly distributed as above, set X = U + iV, ,u = Al + i,2, and H = 2 + iF. Then E(X) = ?N(p, H).

The above proposition establishes a one-to-one correspondence between n-dimensional complex normal distributions, say ?N(p, H), and 2n-dimen sional real normal distributions with a special covariance structure given by

2{H}=2(F 2

where H = Y + iF. Given a sample of independent complex normal ran

dom vectors, the above correspondence provides us with the option of either

analyzing the sample in the complex domain or representing everything in

the real domain and performing the analysis there. Of course, the advantage of the real domain analysis is that we have developed a large body of theory that can be applied to this problem. However, this advantage is a bit

illusory. As it turns out, many results for the complex normal distribution

are clumsy to prove and difficult to understand when expressed in the real

domain. From the point of view of understanding, the proper approach is

simply to develop a theory of the complex normal distribution that parallels the development already given for the real normal distribution. Because of

space limitations, this theory is not given in detail. Rather, we provide a

brief list of results for the complex normal with the hope that the reader can

see the parallel development. The proofs of many of these results are minor

modifications of the corresponding real results. Consider X E tP such that E(X) = ?N(p, H) where H is nonsingular.

Then the density of X with respect to Lebesgue measure on I'P is

f(x) = SV-P(det H) 'exp[- (x - A)*H-I(x -

I)].




When H = I, then

f (X*X) = X2p( *

With this result and the spectral theorem for Hermitian matrices (see Halmos, 1958, Section 79), the distribution of quadratic forms, say X*AX for a Hermitian, can be described in terms of linear combinations of independent noncentral chi-square random variables.

As in the real case, independence of jointly complex normal random vectors is equivalent to the absense of correlation. More precisely, if E (X) = ?rN(,i, H) and if A: q X p and B: r X p are complex matrices,

then AX and BX are independent iff AHB* = 0. In particular, if X is partitioned as

X= (X2) Xi E TPi, j =1, 2

and H is partitioned similarly as

(HiI H12

H H21 H22J

where Hjk ispj X Pk' then Xl and X2 are independent iff H12 = 0. When H22 is nonsingular, this implies that Xl - HH2 H2'X2 and X2 are independent.

This result yields the conditional distribution of Xl given X2, namely,

-(XIIX2) = 4TN(IL + H12UH2'(X2 - 2), H11.2)

where HH1.2 = H11 - HH2'H21 and ,j = &Xj,j = 1,2.

The complex Wishart distribution arises in a natural way, just as the real

Wishart distribution did.

Definition 9.3. A p x p random Hermitian matrix S has a complex Wishart

distribution with parameters H, p, and n if

= J = I )

where X,,..., X,, E CP are independent with

e(Xj) = (N(O, H).




In such a case, we write

c(S) = ?W(H, p, n).

In this definition, p is the dimension, n is the degrees of freedom and H is a

p x p nonnegative definite Hermitian matrix. It is clear that S is always nonnegative definite and, as in the real case, S is positive definite with

probability one iff H is positive definite and n > p. When p = I and H = 1, it is clear that

?VW(l, 1, n)= X2

Further, complex analogues of Proposition 8.8, 8.9, and 8.13 show that if

?(S) = W(H, p, n) with n > p and H positive definite, and if E(X) =

N(O, H) with X and S independent, then

S (X*S- 'X) = F2p,2(n-p+ l),

We now turn to a brief discussion of one special case of the complex MANOVA problem. Suppose XI,..., X, Ee 4VP are independent with

( Xj) = VN(IA, H)

and assume that H > 0 -that is, H is positive definite. The joint density of

X,,..., Xn with respect to 2np-dimensional Lebesgue measure is

n

p(XIA, H) = H 1T-PtHr-'exp[ (X - uX)*H-l(Xj

- A)] j=l

= T -nPIH1-nexp- E (Xj - A)*H (Xj -A)

= 7T-nPIHV-nexp[-n(X-p)*Hl(X-1

-tr E (X - X)(X-X)*)H

where X n-'Y. Xj and tr denote the trace. Here, X is the np-dimensional




vector in qVP consisting of X,, X2,..., X,. Setting

n S =E ( Xi

- X )( Xj - X )*,1 j=l

we have

p(Xju, H) = -nPIHI-nexp[-n(XV-1)*H-I(X.,i) - trSH-1].

It follows that (X, S) is a sufficient statistic for this parametric family and

12 X is the maximum likelihood estimator of ,u. Thus

p(XI,I, H) = ,P-npIHi-nexp[-tr SH-'].

A minor modification of the argument given in Example 7.10 shows that when S > 0, p(XII,, H) is maximized uniquely, over all positive definite H, at H = n- 'S. When n > p + 1, then S is positive definite with probability

one so in this case, the maximum likelihood estimator of H is H = n- 'S. If A = 0, then

p (XI, H) =,-np V H I -n exp- |- XJ*H i Xj

f - np IH -n exp[-tr SH ]

where n

S = XjXJ* = S + nXX*.

j=1

Obviously, p(Xp, H) is maximized at f = n- 's. Thus the likelihood ratio test for testing , = 0 versus IL 0 rejects for small values of

A AP(XPH) IS"-n -S"n p(XIL, H) H Si = IS + nXX*In

As in the real case, X and S are independent,

C (S) =

4W(H, p, n -1

and

C (VX) = 4TN(V , H).



ADDITIONAL EXAMPLES OF LINEAR MODELS 381

Setting Y = Vnx,

Al/n= iSI = 1

Is + YY*I l + Y*S-Iy

so the likelihood ratio test rejects for large values of Y*S- 'Y T2. Argu ments paralleling those in the real case can be used to show that

(T 2) = F(2p,2(n - p), 8)

where 8 = n,u*H-',u is the noncentrality parameter in the F distribution.

Further, the monotone likelihood ratio property of the F- distribution can be used to show that the likelihood ratio test is uniformly most powerful

among tests that are invariant under the group of complex linear transfor

mations that preserve the above testing problem. In the preceeding discussion, we have outlined one possible analysis of

the one-sample problem for the complex normal distribution. A theory for the complex MANOVA problem similar to that given in Section 9.1 for the real MANOVA problem would require complex analogues of many results given in the first eight chapters of this book. Of course, it is possible to

represent everything in terms of real random vectors. This representation consists of an n x 2p random matrix Y E E2p, nwhere

C (Y) = N(ZB, In ? I).

As usual, Z is n X r of rank r and B: r X 2p is a real matrix of unknown

parameters. The distinguishing feature of the model is that I is assumed to have the form

(F 2

where 2 : p x p is positive definite and F: p x p is skew-symmetric. For reasons that should be obvious by now, V's of the above form are said to have complex covariance structure. This model can now be analyzed using the results developed for the real normal linear model. However, as stated earlier, certain results are clumsy to prove and more difficult to understand when expressed in the real domain rather than the complex domain. Although not at all obvious, these models are not equivalent to a product of real MANOVA models of the type discussed in Section 9.1.

9.6. ADDITIONAL EXAMPLES OF LINEAR MODELS

The examples of this section have been chosen to illustrate how condition ing can sometimes be helpful in finding maximum likelihood estimators and




also to further illustrate the use of invariance in analyzing linear models. The linear models considered now are not products of MANOVA models and the regression subspaces are not invariant under the covariance trans formations of the model. Thus finding the maximum likelihood estimator of

mean vector is not just a matter of computing the orthogonal projection onto the regression subspace. For the models below, we derive maximum likelihood estimators and likelihood ratio tests and then discuss' the problem of finding a good invariant test.

The first model we consider consists of a variation on the one-sample problem. Suppose Xl,..., Xn are independent with C(Xi) = N(j, 2) where Xi E RP, i = 1,..., n. As usual, form the n x p matrix X whose rows are

X,, i = 1,..., n. Then

E(X) = N(ep', In ? 2)

where e E Rn is the vector of ones. When ,u and 2 are unknown, the linear

model for X is a MANOVA model and the results in Section 9.1 apply

directly. To transform this model to canonical form, let F be an n X n

orthogonal matrix with first row e'/ Vn. Setting Y = rx and ,B = 4n',

E (Y) = N(El 3, In ? 2)

where el is the first unit vector in Rn and ,B E fp . Partition Y as

/Y,

where Y, E p, Y2 E Ep m and m = n - 1. Then

P-(Y,) =N(f, 1 )

and

-(Y2) = N(O, Im ? 2).

For testing HO:,B= 0, the results of Section 9.1 show that the test that

rejects for large values of Y,(Y2Y2)-Y Y (assuming m > p) is equivalent to

the likelihood ratio test and this test is most powerful within the class of

invariant tests. We now turn to a testing problem to which the MANOVA results do not

apply. With the above discussion in mind, consider U E E and Z E m

where U and Z are independent with

(U) = N(/3, 2)



ADDITIONAL EXAMPLES OF LINEAR MODELS 383

and

e(z) = N(O, Im ?2) )

Here, ,BE Ep I and L > 0 is a completely unknown p x p covariance

matrix. Partition ,B into /3R and /,3 where

fi GE Ep,, i = 1,2, +2=

Consider the problem of testing the null hypothesis Ho: PI = 0 versus

Hi : PI * 0 where /2 and L are unknown. Under Ho. the regression sub

space of the random matrix

( z P) E lSm+ I

is

{ (0 ) P ,m+i'/32

E EP2,1}'

and the set of covariances is

It is easy to verify that MO is not invariant under all the elements of y so the

maximum likelihood estimator of /32 under Ho cannot be found by least

squares (ignoring Y.). To calculate the likelihood ratio test for Ho versus H1,

it is convenient to partition U and Z as

U=(UL,9U2), Ui ef ,, i= 1,2

Z=(Z19 Z2), ZiEep m im = 12

and then condition on U1 and Z . Since U and Z are independent, the joint

distribution of U and Z is specified by the two conditional distributions,

F&(U21U,) and e(Z21Z1), together with the two marginal distributions, P&(U,) and E(Z,). Our results for the normal distribution show that these distribu

tions are

(U2lUI) =

N(,82 + (U1

- /1)71

2 1 .22-1)

(U,) = N(/31, 1)

eF(z2Iz1) = N(Z1- '212 Im ? 221)

E(Z1) = N(O, Im X 111)




where 2 is partitioned as

'Ell 2120

221 Y.222

with Yij being pi x pj, i, j= 1,2. As usual, 222-1 = 22 - -21 '112- By

Proposition 5.8, the reparameterization defined by '1 I = Y. I I12 - . 12

and 22 = 22 .1 is one-to-one and onto. To calculate the likelihood ratio test for Ho versus H1, we need to find the maximum likelihood estimators

under Ho and HI.

Proposition 9.17. The likelihood ratio test of Ho: PI? = 0 versus H1: ,f3 * 0 rejects Ho if the statistic

A = U1Sfl'Uf

is too large. Here, S = Z'Z and

S- S(l S12\ S21 S22

where Sij is pi x pj.

Proof. Let f1(U1J,I1,*11) be the density of f&(U1), let f2(U21U, 9PBI,/2, '12,4'22) be the conditional density of ,(U21U,), let f3(Z1Pll) be the

density of P&(ZI), and letf4(Z21Z1,4112,4'22) be the density of P(Z21Z1). Under Ho, PI = 0 and the unique value of f2 that maximizes

f2(U21U,O?, f 2, '112, 'P22) iS

I2 U2- U,*12

for 'P2 fixed. It is clear that

f2(u2119 ,o2 , 12 9 '22) & 1422V1

where the symbol a means "is proportional to." We now maximize with respect to 'P2. With /2 = P2, '12 occurs only in the density of Z2 given Z1.

Since E(Z21Z,) =

N(Z,'12, Im X '22), it follows from our treatment of the

MANOVA problem that

'12 =

(ZlZ1) Z1Z2 = SlI 12




and

M4Z21ZI +12 + 22) a 1*221 2X[ 2 tS2122

Since f,3 = 0, it is now clear that

*II = ZIIZI + Ul'Ul] Si mI [ + Ul'Ul]

and

+22 = m + i S22 1

Substituting these values into the product of the four densities shows that the maximum under Ho is proportional to

AO = IS22. 1V-(m+?)/21S1l + U,;U1-(m l)/2

Under the alternative H1, we again maximize the likelihood function by first noting that

/2 U2 - (Ul - #0*12

maximizes the density of U2 given U1. Also,

f2(U21u1 II1 ,g2, I12, 'I22) aj1221/

With this choice of /2, f3 occurs only in the density of U, so = U1

maximizes the density of U1 and

fl(Ull,l *11) a 1*1 11- 1/2

It now follows easily that the maximum likelihood estimators of 12' %1I and I22 are

12= S11 12

- 1 Z 1

'l= m+l Z'Z1 m+15"

1 "p22 =m + i 2-P




Substituting these into the product of the four densities shows that the maximum under H1 is proportional to

Al = IS22.11-Vm+ 1)/21S11I-(m+ 1)/2

Hence the likelihood ratio test will reject Ho for small values of

AO ISI111 (m?+ 1)/2 1 -2 A1 - IS11 + U;U 1(m+l)/2 (1 + U1sllQU;)(m? 1)/2

Thus the likelihood ratio test rejects for large values of

A= U= S,,'U,

and the proof is complete. El

We now want to show that the test derived above is a uniformly most powerful invariant test under a suitable group of affine transformations. Recall that U and Z are independent and

e (u) = N(13, 1), f&(Z) = N(O, Im ? 1).

The problem is to test Ho /3 = 0 where ,B = (,B,, f 2) with Pi E p, 1, 1, 2. Consider the group G with elements g = (r, A, (0, a)) where

rE cm (O a) Ef- p, a E P2

and

A AII A22)

where A1j is p, x pj and Aii is nonsingular for i = 1,2. The action of

g = (r, A,(0, a)) is

g(Z) rZA')

The group operation, defined so G acts on the left of the sample space, is

(rF, Al,(0, a1))(F2, A2,(0, a2)) = (F2, A,A2,(0, a2)A' + (O, a)).

It is routine to verify that the testing problem is invariant under G. Further,




it is clear that the induced action of G on the parameter space is

(F, A, (0, a))(/, 2) = (f3A' + (0, a), AEA').

To characterize the invariant tests for the testing problem, a maximal invariant under the action of G on the sample space is needed.

Proposition 9.18. In the notation of Proposition 9.17, a maximal invariant is

A- U1SQ,U1.

Proof. As usual, the proof consists of showing that A = U1S-1Ul is an

orbit index. Since m > p, we deal with those Z's that have rank p, a set of

probability one. The first claim is that for a given UE ep l and Z E m of

rank p, there exists a g E G such that

(Z ) ( Zo )

where ej(1,0,..., 0) E- land

Write Z = tVwhere 1 E J m and V is a p x p upper triangular matrix so

S = Z'Z = V'V. Then consider

where (i E ?p, i = 1, 2, and note that A is of the form

{All O

A=A21 A22)

since (V')-' is lower triangular. The values of (i, i = 1, 2, are specified in a

moment. With this choice of A,

ZA' = 4VV O ) ( ? )




which is in Spm for any choice of E C,i= 1, 2. Hence there is a f E? m

such that

rzAt = Zo.

Since V is upper triangular, write

_l {vll l28

v ( O V22)

with VU being p1 X pj. Then

= ( uV'1v",, u1v'2t'2 + V24).

As S = V'V and V e Gt, it follows that Sjj' = V11(V11)' so the vector U1V" has squared length A = U1llV 'l= U1Sjj' Uj . Thus there exists

{1e P such that

where z = (1,0,..., 0) e Ep Now choose a e eP2 1 to be

aU= AI =- I

SO

UA' + (0,a) =A//

The above choices for A, (, 1r, and a yield g = (F, A, (0, a)), which satisfies

g(U) I (U2)

and this establishes the claim. To show that

A== u2sWVu;

is maximal invariant, first notice that A is invariant. Further, if

(7') and {2




both yield the same value of A, then there exists gi E G such that

gi U- ) , i 12. \ZjJ ZO

Therefore,

g_g Ut= (U2)

and A is maximal invariant.

To show that a uniformly most powerful G-invariant test exists, the distribution of A = U1SjUjU1 is needed. However,

E(UI)= N(lI, Y11

P,(SI,)-W(Y I , p I, m.)

and U1 and S11 are independent. From Proposition 8.14, we see that

f(A) = F(p,, m-p1 + 1, 8)

where 8 = jQfij and the null hypothesis is Ho: 8 = 0. Since the non central F distribution has a monotone likelihood ratio, the test that rejects

for large values of A is uniformly most powerful within the class of tests

based on A. Since all G-invariant tests are functions of A, we conclude that

the likelihood ratio test is uniformly most powerful invariant. O

The final problem to be considered in this chapter is a variation of the

problem just solved. Again, the testing problem of interest is Ho: 0I = ? versus H1: #I * 0, but it is assumed that the value of 2 is known to be zero

under both Ho and HI. Thus our model for U and Z is that U and Z are

independent with

(U) = N((31,0), >:)

4(Z) = N(0, Im ? E)

where U E- P, E Ep ,I Z E p,pm' and m > p. In what follows, the

likelihood ratio test of Ho versus H1 is derived and an invariance argument

shows that there is no uniformly most powerful invariant test under a

natural group that leaves the problem invariant. As usual, we partition U




into U1 and U2, Ui E ,1, and Z is partitioned into Z1 e l m and

2 EP2, m S

U = (Ul, U2), z = (Z, Z2).

Also

= = [zlz1 zlz2 \ 11 S12\ S ~ (' =' Z) (SS= zz, z z2 sI s2

and S,H.2 = S,, - S12S:21S21.

Proposition 9.19. The likelihood ratio test of Ho versus H, rejects for large

values of the statistic

A - (u1 - u2S2'S21)S '2(U`- U2S-2-'S2.) 1 + U2S2;'U2

Proof. Under Ho,

C,(U) = N(0,, Y)

E(Z) = N(O,Im 8) )

so the maximum likelihood estimator of Y. is

= + I (Z'Z + 'u)= (S + U'u).

The value of the maximized likelihood function is proportional to

A_ =I2-(m+1)/2

Under H1, the situation is a bit more complicated and it is helpful to

consider conditional distributions. Under H1,

( U1JU2) = N(,31 + U2y2- 2 21, >1122)

(U2)= N(O, -22)

'( Z,1Z2 ) = N( Z2E222 I21, Im ?

E11.2)




and

e(Z2)= N(O, Im X 722)

The reparameterization defined by 'P1 = * 11-2 "21 = '2221, and 'P22 =

.22 is one-to-one and onto. Let f1(U11U2, /I, "21' I11), f2(U21I22), f3(Z11Z2, "21, "1 1), and f4(Z21"22) be the density functions with respect to Lebesgue measure dU, dU2 dZ1 dZ2 of the four distributions above. It is clear that

I= U1 - U2I21

maximizesf1(U1lU2, 8l , "21 1*1 I) and f1(UlIU2, , "21' "11) a II1/2- With

f82 substituted into fl, the parameter '21 only occurs in the density

f3(Z1IZ2, "21, '1). Since

C(Zl1Z2) =

N(Z2*21, Im X *11),

our results for the MANOVA model show that

'21 = (Z2Z2) Z2Z, = S2%1

maximizesf3(Z IZ2, '21, '1P) for each value of *, I When '21 is substituted into f3, an inspection of the resulting four density functions shows that the

maximum likelihood estimators of 'I, and '22 are

*II = m +

1 S1.2

and

'22= m Zl Z2 + U2U2)= m+l(S22 + U2U2).

Under H1, this yields a maximized likelihood function proportional to

Al =+-1 - (m+ 1)/2I42 V-m+ 1)/2

Therefore the likelihood ratio test rejects Ho for small values of

A - A0 = [ |S22 + U2U2 2IS,1 ]m21 A3 =

-s+U,U




However,

iS22 + uu21 = lS221(1 + U2S2 2u)

and

ISI = IS221IS11.21

Thus

[A3 ] 2/(m+) iSi(1 + U2Si2u') = 1 + U2S2u2;

Is + U'UI I + Us-U *

Now, the identity

US-'U'= (U, - u2S~'s21)sjQ2(u, - u2s2s21)' + u2s22'u2

follows from the problems in Chapter 5. Hence rejecting for small values of

[A3 ]2/(m+)- 1 I

where A is given in the statement of this proposition, is equivalent to

rejecting for large values of A. O

The above testing problem is now analyzed via invariance. The group G

consists of elements g = (1F, A) where F E Om and

A (Al A12 Aii E Glp, i = 1,2.

The group action is

- UA' ( r, A)( )=(U

and group composition is

(rF, A,)(r2, A2) = (r,r2 A1A2)

The action of the group on the parameter space is

(r, A)(/l1, L) = (I8A1AI, AEA').

It is clear that the testing problem is invariant under the group G.




Proposition 9.20. Under the action of G on the sample space, a maximal

invariant is the pair (W,, W2) where

= (Ul) -

-U 2S2)

1 + U2s2 U2'

and

W2= U2Su2 '2u

A maximal invariant in the parameter space is

a = 2l2

Proof. As usual, the method of proof is a reduction argument that provides a convenient index for thc orbits in the sample space. Since m > p, a set of

measure zero can be deleted from the sample space so that Z has rank p on

the complement of this set. Let

Z= (P) E

and setu1 =u Ep l andu2= (0,..., 0,1,0,..., O)E= l,, where the one occurs in the ( PI + 1) coordinate of u2. Now, given U and Z, we claim that there exists a g = (r, A) E G such that

( U) /X1u + X2u2 2 2

where

X12 = (Ul -U2 -Su2sI ) St I I. 2(sUl U2S22 )S2 )

and

X22 = us2'U2;.

To establish this claim, write Z = IT where EI' E S m and T E GT is a p x p lower triangular matrix. A modification of the proof of Proposition 5.2 establishes this representation for Z. Consider

A = 6(T-)= ( fl 0

(T




where j E , i= 1, 2, so ( E- p and

ZA' =PTA' =4'pe,m

Thus for any such ( and F e Om, (F, A) e G. Also, 1 can be chosen so that

rZA'

Now,

UA' = (Ul, u2) T-'' = (U, u2)(T22 T22)(0 0 )

= ((U1TII + U2T2I), U2T22e,

where T1i is pi x p1 and

T- 1 = T" O T T2 T22

Since S Z'Z = T'T,

a bit of algebra shows that

(UIT" + U2T21)(U1T" + U2T21'

= (U,- u2j~'s1)s (1- -u S~921)' - 2 Ul ( U,U2 Si2-2S21 ) S,1 2 ( Ul-U2 Si22S1)=X

and

(U2T22 )(U2T22) = u2s2-1u = x2.

Let E = (1,,..., O) e and 2 =(1,,...,0) e 2 Since the vectors

XI1 , and U1T" + U2T2I have the same length, there exists E- e 0p such

that

(U1T1 U2T 22 ) =l.

For similar reasons, there exists a t' E 0P2 such that

U2 T22 V2 =

X2'2

With these choices for t, and t2'

((uTr" + U2T 21)1,U2T22v) = (X1u1 + X2u2).




Thus there is a g = (F, A) E G such that

g(U) = (XIUI + X2u2)

This establishes the claim. It is now routine to show that (X1, X2) =

(Xi(U, Z), X2(U, Z)) is an invariant function. To show that (X1, X2) is

maximal invariant, suppose (U, Z) and (U, Z) yield the same (Xl, X2) values. Then there exist g and g in G such that

(U XIuI + X2u2) = CJ)

so

This shows that (X,, X2) is maximal invariant. Since the pair (W,, W2) is a

one-to-one function of (X,, X2), it follows that (W1, W2) is maximal in

variant. The proof that 8 is a maximal invariant in the parameter space is

similar and is left to the reader. z

In order to suggest an invariant test for Ho: 81 = 0 based on (W1, W2), the distribution of (W,, W2) is needed. Since

e ((U1, U2)) = N((/3, 0), Y)

and

C (S) = W(E, p, m)

with S and U independent,

e_ (W2) = e(U2Si2 2 Fp2,m -P2+

Therefore, W2 is an ancillary statistic as its distribution does not depend on any parameters under Ho or H1. We now compute the conditional distri

bution of W, given W2. Proposition 8.7 shows that

D(Sti 2) = W(-(11.2, Pi, m - P2)

t( lS211 = N(221 21,2 ?11-2)




and

IB(S22) = W(122, P2' m)

where 51 I2 is independent of (S21 S22). Thus

( u2s2'- S21IS22, U2) = N(U22'21, (U2s2 U;) 11l2)

and conditional on (S22' U2),

( U1 IS22, U2) = N(1 + U2j221121 11 2)

Further, U1 and U2S2-21S2I are conditionally independent-given (S22, U2)* Therefore,

Ul - U2Si2-2S21 IS22, U2) = N(l, (l + U2Si2 2) 21 2)

so

I , - U2S2S:2%11U =N T.) e ( U 252252} S22, U2 j=N | 211-2j

Since SH-2 is independent of all other variables under consideration, and

since

(u -Ul-2s22 S21)SH(U1 -U2S2- S21)' W1= ~~~1 + W2

it follows from Proposition 8.14 that

P(WuIS22 U2)=F(P,m P+- ; + W2)

where 8 = 2fi1* However, the conditional distribution of WI given

(S22' U2) depends on (S22, U2) only through the function W2 = U2S'2-'U2. Thus

PE (WI JW2) = F(pi,mP+; m +p2 )

and

(W2)= Fp2,m-P2+I'



PROBLEMS 397

Further, the null hypothesis is Ho: 8 = 0 versus the alternative H1: 8 > 0.

Under Ho, it is clear that W, and W2 are independent. The likelihood ratio

test rejects Ho for large values of W, and ignores W2. Of course, the level of

this test is computed from a standard F-table, but the power of the test

involves the marginal distribution of W, when 8 > 0. This marginal distri

bution, obtained by averaging the conditional distribution l (WI W2) with respect to the distribution of W2, is rather complicated.

To show that a uniformly most powerful test of Ho versus H1 does not exist, consider a particular alternative 8 = S0 > 0. Let fI(wIJw2, 8) denote

the conditional density function of WI given W2 and let f2(w2) denote the density of W2. For testing Ho: 8 = 0 versus HI: 8 = 80, the Neyman

Pearson Lemma asserts that the most powerful test of level a is to reject if

f1 (W 1W2, 80) >c(a) fl ( WI I W2, so) f1(w1jw2,0o)

where c(a) is chosen to make the test have level a. However, the rejection region for this test depends on the particular alternative 80 so a uniformly

most powerful test cannot exist. Since W2 is ancillary, we can argue that the test of Ho should be carried out conditional on W2, that is, the level and the

power of tests should be compared only for the conditional distribution of W, given W2. In this case, for w2 fixed, the ratio

fl(wIjw2, )

fl ( WIIW2, 0)

is an increasing function of w, so rejecting for large values of the ratio (w2 fixed) is equivalent to rejecting for W, > k. If k is chosen to make the test

have level a, this argument leads to the level a likelihood ratio test.

PROBLEMS

1. Consider independent random vectors Xij with 1E(Xij) = N(1i, 2) for j = ,...,nandi=l,..., k. For scalars a,,..., ak consider testing Ho: Yaiui- = 0 versus H1: Yajui * 0. With T2 = a?nT', let bi= T a1, set Xi = n jXij and let Si =

j(Xij- Xi)(Xij -

X)5. Write

this problem in the canonical form of Section 9.1 and prove that the test that rejects for large values of A = (EibiXi) S (YibiXi) is UMP invariant. Here S = Y2iSi. What is the distribution of A under Ho?

2. Given Y e E and X Ek n of rank k, the least-squares estimate

B = (X'X)- X'Y of B can be characterized as the B that mini



mizes tr(Y - XB)'(Y - XB) over all k x p matrices.

(i) Show that for any k x p matrix B,

(Y- XB)'(Y- XB) = (Y- XB)'(Y- XB)

+ (X(B - B))'(X(B - B)).

(ii) A real-valued function ( defined for p x p nonnegative definite matrices is nondecreasing if 4(SI) < 4)(S, + S2) for any S, and

S2 (Si > 0, i = 1, 2). Using (i), show that, if (p is nondecreasing,

then (((Y - XB)'(Y - XB)) is minimized by B = B.

(iii) For A that is p x p and nonnegative definite, show that (p(S) =

trAS is nondecreasing. Also, show that +(S) = det(A + S) is nondecreasing.

(iv) Suppose (p(S) = ((FSI") for S > 0 and F E ep so ? (S) can be

written as +(S) = +(X(S)) where A(S) is the vector of ordered characteristic roots of S. Show that, if 4 is nondecreasing in each

argument, then ( is nondecreasing.

3. (The MANOVA model under non-normality.) Let E be a random n x p matrix that satisfies E(FE4,') = e_(E) for all r1E en and 4 E 9p.

Assume that Cov(E) = In ? Ip and consider a linear model for Y E

9p, n generated by Y = ZB + EA' where Z is a fixed n x k matrix of

rank k, B is a k x p matrix of unknown parameters, and A is an

element of Glp. (i) Show that the distribution of Y depends on (B, A) only through

(B, AA'). (ii) Let M = {fLIp. = ZB, B e ep k) and y = I 21? > 0, is p

x p). Show that (M, y} serves as a parameter space for the linear

model (the distribution of E is assumed fixed).

(iii) Consider the problem of testing Ho: RB = 0 where R is r x k of

rank r. Show that the reduction to canonical form given in

Section 9.1 can be used here to give a model of the form

(9.6) ' 52 = (-) + EA'

3-k 0~~~~~~~~~~

where Y1 is r x p, Y2 is (k - r) x p, Y3 is (n - k) x p, B? is

r x p, B2 is (k - r) x p, E is n x p, and A is as in the original

PROBLEMS 399

model. Further, E and E have the same distribution and the null hypothesis is Ho: B, = 0.

(iv) Now, assume the form of the model in (9.6) and drop the tildas. Using the invariance argument given in Section 9.1, the testing problem is invariant and any invariant test is a function of the t

largest eigenvalues of Y1(Y3Y3)- 'Y, where t = min(r, p). Assume n - k > p and partition E as Y is partitioned. Under Ho, show

that

w~ I(Y~Y,y'= -EF (E3E3)1FE. W YI ( Y3'Y3 ) Y= (EE)lE1 l

(v) Using Proposition 7.3 show that W has the same distribution no matter what the distribution of E as long as l (FE) = e(E) for

all r E Qn and E3 has rank p with probability one. This distri

bution of W is the distribution obtained by assuming the ele ments of E are i.i.d. N(O, 1). In particular, any invariant test of Ho has the same distribution under Ho as when E is N(O, In 0 Ip).

4. When Y, is N(B1, Ir 0 2) and Y3 is N(O, Im 0 E) with m > p + 2,

verify the claim that

l 1( Y33 3) m_= _

I + BBi: 1. m-p-lIP

- -

5. Consider a data matrix Y: n x 2 and assume fQ(Y) = N(ZB, In 0 E) where Z is n x 2 of rank two so B is 2 x 2. In some situations, it is

reasonable to assume that a,, = u22-that is, the diagonal elements of

Y. are the same. Under this assumption, use the results of Section 9.2 to derive the likelihood ratio test for Ho: bII = b12, b2 =b22 versus

H1: b * Ib 12 or b2 l * b22 . Is this test UMP invariant?

6. Consider a "two-way layout" situation with observations ij, i= 1,..., m and j = 1,..., r. Assume Yij= , + ai + fj + eij where IL, ai, and P3j are constants that satisfy lai = l: = 0. The eij are random

errors with mean zero (but not necessarily uncorrelated). Let Y be the m x n matrix of Yij's, ul be the vector of ones in R', u2 be the vector

of ones in R', a E R"' be the vector with coordinates ai, and ,B E Rn

be the vector with coordinates f3j. Let E be the matrix of eij's.

(i) Show the model is Y = ,iuul2 + au2 + ul,' + E in the vector

space En, m Here, a E Rm with a'u, = 0 and /3 E R' with /3'u2




= 0. Let

Ml = (xIx en, m' x = puu, i eR')

M2 = (xlx = au'2, a e R', a'u = 0)

M3 = (xix = u1,/', /3 E R', f3'u2 = 0).

Also, let ( , ) be the usual inner product on Enm

(ii) Show M, IM2 IM3 IMl in(En, m, ( )).

Now, assume Cov(E) = In ? A where A = yP + 3Q with P

n-u2u, Q = I - P, and y > 0 and 8 > 0 are unknown parameters.

(iii) Show the regression subspace M = Ml D M2 ED M3 is invariant

under each Im ? A. Find the Gauss-Markov estimates for p,, a,

and /3. (iv) Now, assume E is N(0, Im ? A). Use an invariance argument to

show that for testing Ho: a = 0 versus H1: a * 0, the test that

rejects for large values of W = IiPM2Yii2/IIQMyII2 is a UMP invariant test. Here, QM = I- PM. What is the distribution

of W?

7. The regression subspace for the MANOVA model was described as M = (,uL = ZB, B E p, k) C Ep n where Z is n x k of rank k. The

subspace of M associated with the null hypothesis HO: RB = 0 (R is

r x r of rank r) is w = {(Imu = ZB, B e FP k RB = 0). We know that

PM =

PZ ? IP where Pz = Z(Z'Z)- 'Z' (PM is the orthogonal projec tion onto M in (PS,, n, ( * , * ))). This problem gives one form for Ps,. Let

W= Z(Z'Z)'-R'.

(i) Show that W has rank r.

Let Pw = W(W'W)- 'W' so Pw ? Ip is an orthogonal projection.

(ii) Show that 6(Pw ? Ip) c M - w where M - w = M n w.

Also, show dim(6R(Pw ? Ip)) = rp.

(iii) Show that dim w = (k - r)p.

(iv) Now, show that Pw ? Ip is the orthogonal projection onto

M - w so Pz ? Ip - Pw 0 Ip is the orthogonal projection on

to W.

8. Assume X,,..., Xn are i.i.d. from a five-dimensional N(0, 2) where E

is a cyclic covariance matrix (Y. is written out explicitly at the

beginning of Section 9.4). Find the maximum likelihood estimators of a2, Pi, P2



9. Suppose X1,..., X,, are i.i.d. N(O, T) of dimension 2p and assume I has the complex form

I F) \-F E}

Let S=I:XiX,' and partition S as ' is partitioned. show that I = (2n)-'(S11 + S22) and F== (2n)f(S12 - S21) are the maximum

likelihood estimates of , and F.

10. Let Xl, . . ., Xn be i.i.d. N(,, E) p-dimensional random vectors where A and I are unknown, l > 0. Suppose R is r x p of rank r and consider

testing Ho: R 0 = _ versus H1: R,i * 0. Let X = (l/n) E'Xi and S

I ( Xi - X)( Xi - X)'. Show that the test that rejects for large values

of T = (RX)'(RSR')f (RX) is equivalent to the likelihood ratio test. Also, show this test is UMP invariant under a suitable group of transformations. Apply this to the problem of testing u, = *

= AP where ,lL ,,p are the coordinates of ,.

11. Consider a linear model of the form Y= ZB + E with Z: n X k of

rank k, B: k x p unknown, and E a matrix of errors. Assume the first

column of Z is the vector e of ones (the regression equation has the constant term in it). Assume Cov(E) = A (p) ? Y where A (p) has ones

on the diagonal and p off the diagonal (- l/(n - 1) < p < 1).

(i) Show that the GM and least-squares estimates of B are the same. (ii) When 15(E) = N(O, A (p) ? E) with 2 and p unknown, argue via

invariance to construct tests for hypotheses of the form RB = 0 where R is r x k - 1 of rank r and B: (k - 1) X p consists of

the last k - 1 rows of B.


1. The material in Section 9.1 is fairly standard and can be found in many texts on multivariate analysis although the treatment and emphasis may be different than here. The likelihood ratio test in the MANO VA setting was originally proposed by Wilks (1932). Various competitors to the

likelihood ratio test were proposed in Lawley (1938), Hotelling (1947), Roy (1953), and Pillai (1955).

2. Arnold (1973) applied the theory of products of problems (which he had

developed in his Ph.D. dissertation at Stanford) to situations involving patterned covariance matrices. This notion appears in both this chapter and Chapter 10.


3. Given the covariance structure assumed in Section 9.2, the regression

subspaces considered there are not the most general for which the Gauss-Markov and least-squares estimators are the same. See Eaton

(1970) for a discussion.

4. Andersson (1975) provides a complete description of all symmetry models.

5. Cyclic covariance models were first studied systematically in Olkin and

Press (1969).

6. For early papers on the complex normal distribution, see Goodman

(1963) and Giri (1965a). Also, see Andersson (1975).

7. Some of the material in Section 9.6 comes from Giri (1964, 1965b).

8. In Proposition 9.5, when r = 1, the statistic \x is commonly known as

Hotelling's T2 (see Hotelling (1931)).



CHAPTER 10

Canonical Correlation

Coefficients

This final chapter is concerned with the interpretation of canonical correla

tion coefficients and their relationship to affine dependence and indepen dence between two random vectors. After using an invariance argument to

show that population canonical correlations are a natural measure of affine

dependence, these population coefficients are interpreted as cosines of the angles between subspaces (as defined in Chapter 1). Next, the sample

canonical correlations are defined and interpreted as cosines of angles. The

distribution theory associated with the sample coefficients is discussed briefly.

When two random vectors have a joint normal distribution, indepen dence between the vectors is equivalent to the population canonical correla tions all being zero. The problem of testing for independence is treated in the fourth section of this chapter. The relationship between the MANOVA testing problem and testing for independence is discussed in the fifth and final section of the chapter.

10.1. POPULATION CANONICAL CORRELATION COEFFICIENTS

There are a variety of ways to introduce canonical correlation coefficients and three of these are considered in this section. We begin our discussion with the notion of affine dependence between two random vectors. Let X E (V, (, )1) and Y E (W, (, *)2) be two random vectors defined on the same probability space so the random vector Z = (X, Y) takes values in the vector space V @ W. It is assumed that Cov(X) =

II , and Cov(Y) = -22

both exist and are nonsingular. Therefore, Cov(Z) exists (see Proposition

403



404 CANONICAL CORRELATION COEFFICIENTS

2.15) and is given by

= Cov(Z) E '\t 22 )

Also, the mean vector of Z is

ii = &Z= (&X, Y) = (M'I I2).

Definition 10.1. Two random vectors U and U, in (V, (-, )) are affinely equivalent if U = A U + a for some nonsingular linear transformation A and

some vector a E V.

It is clear that affine equivalence is an equivalence relation among random vectors defined on the same probability space and taking values in V.

We now consider measures of affine dependence between X and Y, which are functions of ,u = ($X, & Y} and L = Cov(Z) where Z = {X, Y). Let

m(,u, :) be some real-valued function of M and : that is supposed to

measure affine dependence. If instead of X we observe X, which is affinely equivalent to X, then the affine dependence between X and Y should be the

same as the affine dependence between X and Y. Similarly, if Y is affinely

equivalent to Y, then the affine dependence between X and Y should be the

same as the affine dependence between X and Y. These remarks imply that m(,, :) should be invariant under affine transformations of both X and Y.

If (A, a) is an affine transformation on V, then (A, a)v = Av + a where A

is nonsingular on V to V. Recall that the group of all affine transformations

on V to V is denoted by Al(V) and the group operation is given by

(A,, a,)(A2, a2) = (AIA2, A1a2 + al).

Also, let Al(W) be the affine group for W. The product group Al(V) x

Al(W) acts on the vector space V @ W in the obvious way:

((A, a), (B, b))(v, w) = (Av + b, Bw + b).

The argument given above suggests that the affine dependence between X and Y should be the same as the affine dependence between (A, a)X and

(B, b)Y for all (A, a) e Al(V) and (B, b) e Al(W). We now need to

interpret this requirement as a condition on m(,u, :). The random vector

((A, a), (B, b)){X, Y) = (AX + a, BY + b)




has a mean vector given by

((A, a), (B, b)){tt1, t2) = (AA1 + a, A.2 + b)

and a covariance given by

AY.11Af Al ,B'

BY.2A' BE22B J

Therefore, the group Al(V) X Al(W) acts on the set

= ((A, I)J E FVED K ,2 >,0, :ii> , ,i= 1, 2}.

For g - ((A, a), (B, b)) E Al(V) X Al(W), the group action is given by

(A9, Y) -+ (gjA, g(2))

where

g= (At,1 + a, B,u2 + b)

and

g(l AIIIA' AI:,2Bf

a~~~1 22 g() B2,2A' B222B ).

Requiring the affine dependence between X and Y to be equal to the affine dependence between (A, a)X and (B, b)Y simply means that the function m defined on E0 must be invariant under the group action given above. Therefore, m must be a function of a maximal invariant function under the action of Al(V) X Al(W) on E). The following proposition gives one form of a maximal invariant.

Proposition 10.1. Let q = dim V, r = dim W, and let t = min(q, r). Given

III{21 212

E' 2X 1222

which is positive definite on V ED W, let XI >** > XI > ? be the t largest

eigenvalues of




where 22I -12 Define a function h on e by

h(IL, E-) = (XI, A2X- ..

At)9

where A1 > * A are defined in terms of 2 as above. Then h is a

maximal invariant function under the action of G Al(V) x Al(W) on 0.

Proof. Let {v,,. . ., v,) and (w,,. . ., w,) be fixed orthonormal sets in V and

W. For each E, define Q12(2) by

Q12(2) 12iw = I

where A, >* >X are the t largest eigenvalues of A(Y.). Given (tL, L) E

E, we first claim that there exists a g E G such that g,u = 0 and

IV Q12(() 7w )

The proof of this claim follows. For g = ((A, a), (B, b)), we have

(AYa A' Az B'12B g() kBE21A' B 222BJ

Choose A = FXjj'l2 and B = Al - 1/2 where r E 0(V), A E 0(W), and

. 1/2 iS the inverse of the positive definite square root of Eii, i = 1, 2. For

each IF and A,

AY1A' -= ]f7I/22 -11'/2r =I

BE 2B' = Al-3/222 2y-1/2A' I

and

AY212B' = 1 1222

Using the singular value decomposition, write

A _ E- 1/2E 2 - 1/2 - E X1(2XiEy, i= 1




where (xl,..., x,) and (y y.. ,Y) are orthonormal sets in V and W,

respectively. This representation follows by noting that the rank of A12 is at most t and

A12Af2 1 1/21222 2111/

has the same eigenvalues as A(2:), which are A X> * * At > 0. For A and

B as above, it now follows that

AE12B' = Ig2(rXi)0](Ayi). i= ,

Choose F so that Fxi = vi and choose A so that Ayi = wi. Then we have

A21,2B' =

Q12 (7 )

so g(:) has the form claimed. With these choices for A and B, now choose a = -AL,u and b = -B'2. Then

9A = g(Il1 1'2) = ((A, a), (B, b))(ul, t2)

= (A/L1 + a, Bil2 + b) = {0,0) = 0.

The proof of the claim is now complete. To finish the proof of Proposition 10.1, first note that Proposition 1.39 implies that h is a G-invariant function. For the maximality of h, suppose that h(ji, E) = h(v, '). Thus

Q12(2) = Q12(*),

which implies that there exists a g and g such that

gu = 0, gv = 0,

and

g(e ) ( lV Q12= (( V =)

Therefore,

g 'g(h, a) = (n, E)

so h is maximal invariant. ol



The form of the singular value decomposition used in the proof of Proposition 10.1 is slightly different than that given in Theorem 1.3. For a linear transformation C of rank k defined on (V, (.,)l) to (W, ( ')2)

Theorem 1.3 asserts that

k C = :tL iwioXi

where jaiL> 0, (xI,..., Xk}, and (wl,..., Wk) are orthonormal sets in V and W. With q = dim V, r = dim W, and t = min{q, r), obviously k < t. When

k < t, it is clear that the orthonormal sets above can be extended to

{xI,..., x,) and {w1,..., wt}, which are still orthonormal sets in V and W. Also, setting 1i = 0 for i = k + 1,..., t, we have

C- YAiwi0?i C = Ett...ox ,

and #2 > * * * L 2 are the t largest eigenvalues of both CC' and C'C. This form of the singular value decomposition is somewhat more convenient in this chapter since the rank of C is not explicitly mentioned. However, the

rank of C is just the number of ,yi, which are strictly positive. The

corresponding modification of Proposition 1.48 should now be clear. Returning to our original problem of describing measures of affine

dependence, say m(,, E), Proposition 10.1 demonstrates that m is invariant under affine relabelings of X and Y iff m is a function of the t largest

eigenvalues, A,,..., XA, of A(lZ). Since the rank of A(T.) is at most t, the

remaining eigenvalues of A(s), if there are any, must be zero. Before

suggesting some particular measures m(,u, E), the canonical correlation

coefficients are discussed.

Definition 10.2. In the notation of Proposition 10.1, let P1 i

1,..., t. The numbers p > P2 > ... > p, > 0 are called the population

canonical correlation coefficients.

Since p, is a one-to-one function of Xi, it follows that the vector

(pi,... Pt,) also determines a maximal invariant function under the action

of G on 9. In particular, any measure of affine dependence should be a

function of the canonical correlation coefficients. The canonical correlation coefficients have a natural interpretation as

cosines of angles between subspaces in a vector space. Recall that Z = (X, Y)

takes values in the vector space V ED W where (V, (*, -),) and (W, (.*, )2)


are inner product spaces. The covariance of Z, with respect to the natural

inner product, say (, ), on V @ W, is

I (II 212

l21 222

In the discussion that follows, it is assumed that I is positive definite. Let

* ) denote the inner product on V @ W defined by

(z1, Z2) = (Z, z2) = COV[(Z1, Z), (Z2, Z)]

for z1, z2 E V @ W. The vector space V can be thought of as a subspace of

V D W- namely, just identify V with V @ (0) c V ED W. Similarly, W is a

subspace of V ED W. The next result interprets the canonical correlations as

the cosines of angles between the subspaces V and W when the inner

product on V @ W is (., *)1.

Proposition 10.2. Given 2, the canonical correlation coefficients p, > * > p, are the cosines of the angles between V and W as subspaces in the inner product space (V @ W, (., .)M).

Proof. Let P1 and P2 be the orthogonal projections (relative to (-, )=) onto V ED (0) and W @ (0), respectively. In view of Proposition 1.48 and Defini

tion 1.28, it suffices to show that the t largest eigenvalues of PIP2PI are A,=p2, i = 1,..., t. We claim that

0 0

is the orthogonal projection onto V @ (0). For (v, w) E V @ W,

(0V 2112 )(V, w) = (V + Efll12w,0)

so the range of C1 is V ED (0) and C1 is the identity on V @ (0). That

C2 = Cl is easily verified. Also, since

e ih e 212 i 1 21 122

9

the identity C'l = E:C, holds. Here C' is the adjoint of Cl relative to the



inner product (-, .)-namely,

I 2Iv ? C=t21 111

This shows that C, is self-adjoint relative to the inner product (*, ) . Hence Cl is the orthogonal projection onto V @ (0) in (V ED W, (, *) ). A similar argument yields

O O

C2= 2-212 Iw 22( ~21 w

as the orthogonal projection onto (0) D W in (V E W, (, )>). Therefore Pi = Ci, i = 1, 2, and a bit of algebra shows that

PAPP<A () C)

where A(Y.)

= 2 I -2X2212I and

C= A(2)2ll zl

Thus the characteristic polynomial of P1P2P1 is given by

p(a) = det[P1P2P - aI] =(-a)rdet[A(E) - aIv]

where r = dim W. Since t = min{q, r) where q = dim V, it follows that the t

largest eigenvalues of P1P2P, are the t largest eigenvalues of A(z). These

are o p2, SO the proof is complete. O

Another interpretation of the canonical correlation coefficients can be given using Proposition 1.49 and the discussion following Definition 1.28. Using the notation adopted in the proof of Proposition 10.2, write

P2P1 = LPiti[i 1=1

where {n,..., i} is an orthonormal set in V @ (0) and (b,..., () is an orthonormal set in (0) @ W. Here orthonormal refers to the inner product

( ,* ). on V ED W, as does the symbol O in the expression for P2P1 -that is, forz ,z2e VEW,

(Z10Z2)z = (Z2, Z)2Zi

= (Z2, YZ)Zl.

The existence of this representation for P2PI follows from Proposition 1.48, as does the relationship

('Qi, jz ijPj

for i, j = 1,..., t. Define the sets D,i and D2i, i = 1,..., t, as in Proposition 1.49 (with M,

= V @ (0) and M2 = (O) @ W), so

sup sup (71, = (71i, = Pi ijEDI, reD2,

for i - 1,..., t. To interpret pi, first consider the case i = 1. A vector q is in

D1I iff

={v,90), v e V

and

1I= (m, 7q) = (v, z11v), = var(v, X),.

Similarly, ( D21 iff

(0, w), wEE W

and

1 - (1, S) = (w, 122w)2 = var(w, Y)2.

However, for q = {v,0) E D1, and = (0, w) e D21,

(TJ, OX = (v, z12w)1 cov((v, X)1, (w, Y)2}.

This is just the ordinary correlation between (v, X), and (w, Y)2 as v and w have been normalized so that 1 = var(v, X), = var(w, Y)2. Since (q, ()y < pI forall qe DI1 and -D21, itfollows that for every x e V, x * 0, and y e W, y * 0, the correlation between (x, X)1 and (y, Y)2 is no greater than p1. Further, writing q I = (v1, 0) and 4 = (0, w1), we have

Pi = (71, ,1)2 = (mqlg 741)

= (V1, 12w1)1 = COVO(v1, X)1, (W1I Y)2),

which is the correlation between (v,, X)1 and (w,, Y)2. Therefore, p1 is the


maximum correlation between (x, X)} and (y, Y)2 for all nonzero x e V and y X W. Further, this maximum correlation is achieved by choosing x = v, andy = w,.

The second largest canonical correlation coefficient, P2, satisfies the equality

sup sup (m ) = (n2' 42)l = P2. 'qeD12 teD22

A vector q is in D12 iff

X = {v,O), v e V

I = (n, 07. (v, 21 lV)

and

o = (m q1)E = (V, 111V)1.

Also, a vector ( is in D22 iff

(={O,w}, we W

1 = (e, ( = (w, 222W)2

and

0= (0, = (W, 122W1)2.

These relationships provide the following interpretation of P2. The maxi

mum correlation between (x, X)1 and (y, Y)2 is p1 and is

PI = cov{(v1, X)I, (w,, Y)2)

since I = var(v , X) 1 = var(w, ,Y)2. Suppose we now want to find the

maximum correlation between (x, X), and (y, Y)2 subject to the condition

J cov((x, X)1, (vI, X),} = 0

( ) lcov((Y, Y)2, (WI, Y)2) = 0

Clearly (i) is equivalent to

(ii) (X, ZVI)I = 0

( (Y, 22W)2= 0.




Since correlation is invariant under multiplication of the random variables by positive constants, to find the maximum correlation between (x, X)I and (Y, Y)2 subject to (ii), it suffices to maximize cov((x, X)1, (y, Y)2) over those x's and y's that satisfy

{ (x, I11X)I = 1, (X, I11V1)A = 0

(y' y

22Y)2A 1, (y, :22wI )2 = 0.

However, x E V satisfies (iii) iff q = (x, 0) is in D12 and y e W satisfies (iii) iff ( = (0, y) is in D22. Further, for such x, y, q, and (,

cov((x, X)1, (y, Y)2) = ('1, o)

Thus maximizing this covariance subject to (iii) is the same as maximizing ( for X e D12 and 4 e D22. Of course, this maximum is P2 and is

achieved at 712 E D12 and i2 E D22. Writing ?2 = (V2, 0) and 42 = (0, W2), it is clear that v2 E V and w2 E W satisfy (iii) and

cov((v2, X)I, (w2, Y)2) = P2.

Furthermore, Proposition 1.48 shows that

0 (?1' qI2)X = (Q29 0D2'

which implies that

0 = cov((vl, X)1, (W2, Y)2} = CoV{(V2, X)1, (W1, Y)2}.

Therefore, the problem of maximizing the correlation between (x, X), and

(Y, Y)2 (subject to the condition that the correlation between (x, X), and

(vI, X)} be zero and the correlation between (y, Y)2 and (w1, Y)2 be zero) has been solved.

It should now be fairly clear how to interpret the remaining canonical correlation coefficients. The easiest way to describe the coefficients is by induction. The coefficient p, is the largest possible correlation between (x, X), and (y, Y)2 for nonzero vectors x E V and y E W. Further, there exist vectors v, E V and w, E W such that

cov((vI, X)1, (w,, Y)2} = pi

and

1 = var(v1, X)} = var(w1, Y)2.



These vectors came from q and 4 in the representation

P2Pl= EPitioi i-lI

given earlier. Since -1i E VE (0), we can write qi = (vi,O), i = 1,..., t.

Similarly, {i = (0, wi), i = 1,..., t. Using Proposition 1.48, it is easy to check that

COv((Vj, X),, (Wk, Y)2) Pj6jk

cov((Vj, X)I, (Vk, X)1) =jk

COV((Wj, Y)2' (Wk, Y)2) = ajk

forj, k = 1,..., t. Of course, these relationships are simply a restatement of

the properties of b,., t and q,..., iq. For example,

cov((v1, X)1, (Wk, Y)2) = (Vj, 112Wk), = (j' 0k)= Pj=jk'

However, as argued in the case of P2, we can say more. Given P I..., pt and

the vectors v, . . ., vi- ,1 and w, ..., wi1 obtained from l,..., ni- and

b1,. . . I (i- 1, consider the problem of maximizing the correlation between

(x, X), and (y, Y)2 subject to the conditions that

{ cov((x, X)1, (v1, X)1) = 0, j = 1,..., i - 1

cov((y, Y)2, (wj, Y)2) = O, j = 1,..., i - 1.

By simply unravelling the notation and using Proposition 1.49, this maxi

mum correlation is pi and is achieved for x = v, and y = wi. This successive

maximization of correlation is often a useful interpretation of the canonical

correlation coefficients. The vectors v1,. . ., vt and w1,..., w, lead to what are called the canonical

variates. Recall that q = dim V, r = dim W and t = min{q, r). For definite

ness, assume that q < r so t = q. Thus {v1,..., vq) is a basis for V and

satisfies

(Vji I1VJk)I =

jk

for j, k = 1,..., q so (vI,. .. vq) is an orthonormal basis for V relative to


the inner product determined by I 1. Further, the linearly independent set (Wp,..., wq} satisfies

( Wj 22WJ)2 = jk

so (Wi,..., wq) is an orthonormal set relative to the inner product de

termined by 222' Now, extend this set to (w1,..., wr) so that this is an

orthonormal basis for W in the 22 inner product.

Definition 10.3. The real-valued random variables defined by

Xi = ( Vi, X),, i 1 1,.. q

and

yi= (w, Y)2, i 1,.. r

are called the canonical variates of X and Y, respectively.

Proposition 10.3. The canonical variates satisfy the relationships

(i) var Xj = var Yk = 1.

(ii) COV{Xj, Yk = Pj8jk'

These relationships hold for j = 1,..., q and k = 1,..., r. Here, p1,., Pq are the canonical correlation coefficients.

Proof. This is just a restatement of part of what we have established above.

Let us briefly review what has been established thus far about the population canonical correlation coefficients p,,..., pt. These coefficients were defined in terms of a maximal invariant under a group action and this group action arose quite naturally in an attempt to define measures of affine dependence. Using Proposition 1.48 and Definition 1.28, it was then shown that p,,..., p, are cosines of angles between subspaces with respect to an inner product defined by E. The statistical interpretation of the coefficients came from the detailed information given in Proposition 1.49 and this interpretation closely resembled the discussion following Definition 1.28.

Given X in (V, (., ) ) and Y in (W, (_ , )2) with a nonsingular covariance

= (Ell 12

\ 21 222



the existence of special bases {v1,..., vq) and (wp, w,r for V and W was

established. In terms of the canonical variates

Xi = (vi, X)1, Y1 = (Wj, Y)2,

the properties of these bases can be written

1 = var Xi = var Yj

and

cov{Xi, Yj) = Pi8ii

for i = 1,..., q and j = 1,..., r. Here, the convention that pi = 0 for

i > t = min(q, r) has been used although pi is not defined for i > t. When

q < r, the covariance matrix of the variates XI,..., Xq, Y1,..., Yr (in that

order) is

4(Iq (DO)

? ,(DO)' IrV

where D is a q X q diagonal matrix with diagonal entries p l * > Pq and

O is a q x (r - q) block of zeroes. The reader should compare this matrix

representation of 2 to the assertion of Proposition 5.7.

The final point of this section is to relate a prediction problem to that of

suggesting a particular measure of affine dependence. Using the ideas

developed in Chapter 4, a slight generalization of Proposition 2.22 is

presented below. Again, consider X E (V, (*, )1) and Y E (W, (*, *)2) with

Il= ;Yl =

JM2 and

Cov(X Y) (21 222)

It is assumed that I11 and 222 are both nonsingular. Consider the problem

of predicting X by an affine function of Y-say CY + vo where C E

I (W, V) and vo E V. Let [ , ]1 be any inner product on V and let II - be

the norm defined by [, *. The following result shows how to choose C and vo to minimize

61IX - (CY + vo)112.

Of course, the inner product [ *, * I on V is related to the inner product (*, )

by

[v1, V2] = (vl, Aov2 )1

for some positive definite A0.

Proposition 10.4. For any C E C4(W, V) and vo e V, the inequality

Ix -(CY + vO)112 >(AO , 1 - -12122221)

holds. There is equality in this inequality iff

VO Vo Al1 1 2 22 I 2

and

C= C -1222

Here, <*,*) is the natural inner product on f (V, V) inherited from (V, (.*, *)).

Proof. First, write

X- (CY+ VO) = Ul + U2

where

U1 = x - + Vo) = x - 1 - 212222(y - A2)

and

U2== (C-C)Y + A- v0.

Clearly, U1 has mean zero. It follows from Proposition 2.17 that U1 and U2 are uncorrelated and

Cov( U1) = -1 1122 221.

Further, from Proposition 4.3 we have S[U,, U2] = 0. Therefore,

&IIllX - (CY + vo)112 = tgllU + U2112 = &11U1112 + ;IIU2112

= &(U1, AOUI) + &11U2112 = S(AO, UIOU1) + &jjU2112

=(A0,o ll-2 122 22 221) + &1jU2jj2,

where the last equality follows from the identity

&;U1EJU1 = Y 11 - 112

22 121

established in Proposition 2.21. Thus the desired inequality holds and there is equality iff &1JU2112 = 0. But &;IIU2112 is zero iff U2 is zero with probability one. This holds iff v = vo and C = C since Cov(Y) = 122 is positive definite. This completes the proof. o

Now, choose AO to be 2:l' in Proposition 10.4. Then the mean squared error due to predicting X by CY + vo, measured relative to Y., is

(Ell Ell - 212Y22'221) = & - (CY + vo)!12.

Here, is obtained from the inner product defined by

[V1, V21 = (v1, jIl'v2).

We now claim that 0 is invariant under the group of transformations

discussed in Proposition 10.1, and thus 4 is a possible measure of affine

dependence between X and Y. To see this, first recall that ( , * ) is just the

trace inner product for linear transformations. Using properties of the trace, we have

+(z) = <I, I - 1/22 221 Y' 1l / >

= tr(.I - 1 12222

q

= E (1 - Ai)

whereA1 >XI q > 0 are the eigenvalues of I l 2 221 E

However, at most t = min(q, r) of these eigenvalues are nonzero and, by

definition, pi = X'(2, i - l,..., t, are the canonical correlation coefficients. Thus

is a funcn of pp2) + (q - t)

is a function of p,,..., p, and hence is an invariant measure of affine

SAMPLE CANONICAL CORRELATIONS 419

dependence. Since the constant q - t is irrelevant, it is customary to use

+1(2 = E(1 -p2)

rather than 4(s) as a measure of affine dependence.

10.2. SAMPLE CANONICAL CORRELATIONS

To introduce the sample canonical correlation coefficients, again consider inner product spaces (V, (, -),) and (W, (., *)2) and let (V E W, (., *)) be

the direct sum space with the natural inner product (., *). The observations

consist of n random vectors Zi = {Xi, Yj) E VE W, i = 1,..., n. It is

assumed that these random vectors are uncorrelated with each other and Fe(Z1) =

E(Zj) for all i, j. Although these assumptions are not essential in

much of what follows, it is difficult to interpret canonical correlations without these assumptions. Given Z ,--., Zn, define the random vector Z by specifying that Z takes on the values Zi with probability 1/n. Obviously, the distribution of Z is discrete in V @ W and places mass 1/n at Zi for

i = 1,..., n. Unless otherwise specified, when we speak of the distribution of Z, we mean the conditional distribution of Z given Z.,..., Zn as

described above. Since the distribution of Z is nothing but the sample probability measure of Z,... , Zn, we can think of Z as a sample approxi

mation to a random vector whose distribution is E(Z1). Now, write Z =

(X, Y) with X E V and Y E W so X is Xi with probability I/n and Y is Y with probability l/n. Given Z1,..., Zn, the mean vector of Z is

n1

and the covariance of Z is

n

CovZ = S --E (Z, - Z)O(Z, - Z). i=lI

This last assertion follows from Proposition 2.21 by noting that

CovZ=&(Z- Z)o(Z- Z)

since the mean of Z is Z. When V = Rq and W = RK are the standard




coordinate spaces with the usual inner products, then S is just the sample covariance matrix. Since S is a linear transformation on V 0 W to V @D W, S can be written as

=(Sl1 S12

S21 S22

It is routine to show that

I n

Sti - n E ( Xi - X )EI(Xi -X) n

.1 1=1l

in

Sl x (X -,)O( y ) *=1

in

S22 E (Yi - Y)n(Yi -Y)

and S21 = S,2. The reader should note that the symbol O appearing in the

expressions for S, 1, SI2, and S22 has a different meaning in each of the three

expressions-namely, the outer product depends on the inner products on

the spaces in question. Since it is clear which vectors are in which spaces, this multiple use of 0 should cause no confusion.

Now, to define the sample canonical correlation coefficients, the results

of Section 10.1 are applied to the random vector Z. For this reason, we

assume that S = Cov Z is nonsingular. With q = dim V, r = dim W, and

t = min{q, r), the canonical correlation coefficients are the square roots of

the t largest eigenvalues of

A(S) = SQ s 12 2 S:21

In the sampling situation under discussion, these roots are denoted by ... ** r, > 0 and are called the sample canonical correlation coefficients.

The justification for such nomenclature is that r?,..., r2 are the t largest

eigenvalues of A(S) where S is the sample covariance based on Z,,..., Zn. Of course, all of the discussion of the previous section applies directly to the

situation at hand. In particular, the vector (rr,..., rt) is a maximal invariant

under the group action described in Proposition 10.1. Also, r1,. . ., r, are the

cosines of the angles between the subspaces V ED (0) and (0) @ W in the

vector space V E W relative to the inner product determined by S.



SAMPLE CANONICAL CORRELATIONS 421

Now, let {v1,..., vq) and (w,,..., wr) be the canonical bases for V and

W. Then we have

cov((vi, X), (Wj' Y)2) -riij

for i = ,..., q and j = 1,..., r. The convention that ri O for i > t is

being used. To interpret what this means in terms of the sample ZI,..., Zn,

consider r1. For nonzero x E V and y e W, the maximum correlation

between (x, X)1 and (y, Y)2 is r, and is achieved for x = v, and y = w1.

However, given Z1,..., Z,n we have

var(x, X)1 = var((x, O}, Z) = ((x,O), S{x,O}))

in X = (x, S1lx)} = n E (x, Xi-

2

n I

and, similarly, n

var(y, Y)2 =- (YY- n )2

An analogous calculation shows that

I n

cov((x, X)0, (Y, Y)2} = n E (x, Xi - X )1(Y, Yi - Y )2. i=l

Thus var(x, X)1 is just the sample variance of the random variables (x, X1),, i = 1,..., n, and var(y, Y)2 is the sample variance of (y, Y)2,

i = 1,..., n. Also, cov((x, X)1, (y, Y)2} is the sample covariance of the

random variables (x, Xi),, (y, Yi)2, i = 1,..., n. Therefore, the correlation between (x, X)1 and (y, Y)2 is the ordinary sample correlation coefficient for the random variables (x, Xi),, (y, Yi)2, i = 1,..., n. This observation

implies that the maximum possible sample correlation coefficient for (x, X), , (y, Yi)2, i = 1, . . ., n is the largest sample canonical correlation

coefficient, r1, and this maximum is attained by choosing x = vI and

y = wI. The interpretation of r2,..., r, should now be fairly obvious. Given i, 2 < i < t, and given r1,..., tri_, vI,..., vi,, and w1,..., wi,-, consider

the problem of maximizing the correlation between (x, X)1 and (y, Y)2 subject to the conditions

cov((x, X)1, (v1, X),) = O, j= 1,..., i- 1

cov((y, Y)2, (Wn, Y)2) = 0, j = 1,..., i

- 1.


These conditions are easily shown to be equivalent to the conditions that the

sample correlation for

(x, Xk)I, (Vj, Xk)J, k = 1,..., n

be zero for j = 1,..., i - 1 with a similar statement concerning the Y's.

Further, the correlation between (x, X), and (y, Y)2 is the sample correla tion for (x, Xk)I, (y' Yk)2, k = 1,..., n. The maximum sample correlation is ri and is attained by choosing x = vi and y = wi. Thus the sample

interpretation of r,,..., r, is completely analogous to the population inter pretation of the population canonical correlation coefficients.

For the remainder of this section, it is assumed that V = Rq and W = R'

are the standard coordinate spaces with the usual inner products, so V E W

is just RKP where p = q + r. Thus our sample is Z1,.., Zn with Zi E RP and

we write

Zi Xi

cl R)

with Xi E Rq and Yi e R , i = 1,..., n. The sample covariance matrix, assumed to be nonsingular, is

n E7Zi \i S21 S22J

where

n

I S22 = n E (Yi - Y )(Yi Y )

n

S12 = (Xi -X) (Yi-)

and S2l = S,2 Now, form the random matrix Z: n x p whose rows are

(Zi - Z)' and partition Z into U: n x q and V: n X r so that

2= (UV).




The rows of U are (Xi - X)' and the rows of V are (Yi - Y)', i = 1,.. ., n.

Obviously, we have nS = Z'Z, nS11 = U'U, nS22 = V'V, and nS12 = U'V.

The sample canonical correlation coefficients r, >, . r, are the square roots of the t largest eigenvalues of

A(S) = SjilS12S~221 = (U'u) 'U'V(V'V) V'U.

However, the t largest eigenvalues of A(S) are the same as the t largest eigenvalues of PxPy where

PX- U(U'U) U'

and

Py V(P"V) Ivt.

Now, Px is the orthogonal projection onto the q-dimensional subspace of Rn, say Mx, spanned by the columns of U. Also, Py is the orthogonal

projection onto the r-dimensional subspace of R', say My. spanned by the columns of V. It follows from Proposition 1.48 and Definition 1.28 that the sample canonical correlation coefficients r1,..., r, are the cosines of the angles between the two subspaces Mx and My contained in Rn. Summariz ing, we have the following proposition.

Proposition 10.5. Given random vectors

Zi= Xi

e RP i=l 1. ,n yi/

where Xi e RK and Y1 E Rr, form the matrices U: n X q and V: n X r as

above. Let Mx c Rn be the subspace spanned by the columns of U and let

My C Rn be the subspace spanned by the columns of V. Assume that the

sample covariance matrix

nI

is nonsingular. Then the sample canonical correlation coefficients are the cosines of the angles between Mx and My.

The sample coefficients r1,..., r, have been shown to be the cosines of angles between subspaces in two different vector spaces. In the first case,




the interpretation followed from the material developed in Section 10.1 of this chapter: namely, r1,..., r, are the cosines of the angles between Rq S (0) c RP and (0) D R@ c RP when RP has the inner product determined by

the sample covariance matrix. In the second case, described in Proposition 10.5, r,,.. ., r, are the cosines of the angles between Mx and My in Rn when

Rn has the standard inner product. The subspace Mx is spanned by the columns of U where U has rows (Xi - X)', i = 1,..., n. Thus the coordi

nates of the jth column of U are j - Xj for i = 1,..., n where Xij is the

jth coordinate of Xi E Rq, and Xj is the jth coordinate of X. This is the

reason for the subscript X on the subspace Mx. Of course, similar remarks apply to My.

The vector (r1,..., r,) can also be interpreted as a maximal invariant under a group action on the sample matrix. Given

Xi Zi = YE RP, 1.. n,

let X: n X q have rows X,, i = ,..., n and let Y: n x r have rows Y)',

i 1,..., n. Then the data matrix of the whole sample is

2 = (kXY) : n x p,

which has rows Z,', i = 1,..., n. Let e e R' be the vector of all ones. It is

assumed that Z E -

c ep n where S is the set of all n X p matrices such that the sample covariance mapping

s(Z) = (Zt- et')(Z -eZ')

has rank p. Assuming that n > p + 1, the complement of S in has Lebesgue measure zero. To describe the group action on Z, let G be the set

of elements g = (F, c, C) where

IF e"O(e) = {FIF E n, Fe =

e), c e RP

and

( A

B) A E Gl, B E Gl.

For g = (F, c, C), the value of g at Z is

gz = 1ZC' + ec'.




Since

s(g2) = Cs(Z)C'.

it follows that each g E G is a one-to-one onto mapping of , to S. The

composition in G, defined so G acts on the left of F, is

(rlg C15 CM)(2 5 C2 5 C2)'=- (rl ]2 5Cl. + CIC2 5 CIC2).

Proposition 10.6. Under the action of G on F, a maximal invariant is the

vector of canonical correlation coefficients rl,..., r, where t = min{q, r).

Proof Let 5 + be the space of p x p positive definite matrices so the

sample covariance mapping s: , S+ is onto. Given S E S +, partition S

as

S=-(Si S12 s S21 S22J

where Sl 1is q x q, S22 is r X r, and S12 is q X r. Define h on Sp by letting

h(S) be the vector (X1,..., A,)' of the t largest eigenvalues of

A(S) = S1- S,'S2-S21,.

Since ri = A, i = 1,. . ., t, the proposition will be proved if it is shown that

,(2) - h(s(Z))

is a maximal invariant function. This follows since h(s(Z)) = (A1,.. ., A)', which is a one-to-one function of (r1,..., r,). The proof that qp is maximal

invariant proceeds as follows. Consider the two subgroups G1 and G2 of G

defined by

GI= (glg (F, c, Ip) E G)

and

G2= ( glg= (In, 0, C) E G).

Note that G2 acts on the space 5 ? in the obvious way- namely, if g2 (In, p 0, C), then

g2(S) CSC, s .




Further, since

(r, c, C) = (r, c, Ip)(In,o, C),

it follows that each g E G can be written as g = g1g2 where gi E Gi, = 1, 2. Now, we make two claims:

(i) s: - p is a maximal invariant under the action of G1 on Z. p

(ii) h: -* R' is a maximal invariant under the action of G2 on

Assuming (i) and (ii), we now show that (Z) = h(s(Z)) is maximal invariant. For g E G, write g = g1g2 with gi e Gi, i = 1, 2. Since

s(g12) = s(Z), g1 E G

and

s(g22) = g2(S(2)), g2 E G2

we have

p(g2) = h(s(g1g2Z)) = h(s(g2Z)) = h(g2s(Z)) = h(s(Z)).

It follows that 4p is invariant. To show that qp is maximal invariant, assume

p(Z1) = qZ2). A g E G must be found so that gZ1 = Z2- Since h is

maximal invariant under G2 and

h(s(Z)) = h(s(Z2)),

there is a g2 e G2 such that

g2(S(Zl)) = s(22)

However,

g2(S(Z1)) = S(g221) = s(22)

and s is maximal invariant under G, so there exists a g, such that

g1g2Z1 =Z2

This completes the proof that 4p, and hence r1,.. ., is a maximal invariant



SOME DISTRIBUTION THEORY 427

-assuming claims (i) and (ii). The proof that s: , + is a maximal

invariant is an easy application of Proposition 1.20 and is left to the reader. That h: 5 + -+ R' is maximal invariant follows from an argument similar to p that given in the proof of Proposition 10.1. o

The group action on Z treated in Proposition 10.6 is suggested by the following considerations. Assuming that the observations Z,, .., Zn in RP are uncorrelated random vectors and Q(Z1) = E(Z1) for i = 1, .. ., n, it

follows that

= e,u'

and

COV Z = I,, X 2

where , = &Z, and Cov Z, = :. When Z is transformed by g = (F, c, C),

we have

6gZ = e(Cu + c)'

and

CovgZ = In, (CrC').

Thus the induced action of g on (,u, :) is exactly the group action consid

ered in Proposition 10.1. The special structure of GZ and Cov Z is reflected by the fact that, for g = (F, 0, Ip), we have &SgZ = SZ and Cov gZ = Cov Z.

10.3. SOME DISTRIBUTION THEORY

The distribution theory associated with the sample canonical correlation coefficients is, to say the least, rather complicated. Most of the results in this

section are derived under the assumption of normality and the assumption

that the population canonical correlations are zero. However, the distribu

tion of the sample multiple correlation coefficient is given in the general case of a nonzero population multiple correlation coefficient.

Our first result is a generalization of Example 7.12. Let Z1,..., Zn be a

random sample of vectors in RP and partition Zi as

Zi =Xi E Yi ERr.




Assume that Z, has a density on RP given by

p(Zjpu, Y) = 1t1- /2f((Z - )1-1(

where f has been normalized so that

fzz'f(z'z) dz = Ip.

Thus when the density of Z1 isp( I., E), then

&Z1 = , CovZ1 = E.

Assuming that n > p + 1, the sample covariance matrix

S = E (Z Z Z )( ( Zi )

= 2 2

is positive definite with probability one. Here S1 I is q x q, S22 is r x r, and

S12 is q x r. Partitioning 2 as S is partitioned, we have

I I(. 12)

\ 21 y2222

Thus the squared sample coefficients, rl? t r,, are the t largest eigenvalues of STJ'Sl2S22'S2l and the squared population coefficients, p7l >

> p, , are the t largest eigenvalues of 2 112222 221* In the present generality, an invariance argument is given to show that the joint distribu

tion of (r1,..., ,r) depends on (,u, E) only through (p1,. . ., ps). Consider the

group G whose elements are g = (C, c) where c E RP and

c=(O B)' AeGlq, Be Glr

The action of G on RP is

(C, c)z = Cz + c

and group composition is

(C,, c,)(C2, C2) = (C1C2, CIc2 + cl).




The group action on the sample is

g(zl 9 . Zn) =(9Z], gZn)

With the induced group action on (,u, 1) given by

g(Ai, L) = (gA,u C:C')

where g = (C, c), it is clear that the family of distributions of (Z,,..., Zn) that are indexed by elements of

e I( )1p E- RP, I +

is a G-invariant family of probability measures.

Proposition 10.7. The joint distribution of (r1,..., r,) depends on (IL, E) only through (p,,..., p,).

Proof. From Proposition 10.6, we know that (r1,... , r,) is a G-invariant function of (Z,... *, Zn). Thus the distribution of (rl,,..., r,) will depend on the parameter 0 = (,u, E) only through a maximal invariant in the parame ter space. However, Proposition 10.1 shows that (pl,..., pt) is a maximal invariant under the action of G on E. O

Before discussing the distribution of canonical correlation coefficients, even for t = 1, it is instructive to consider the bivariate correlation coeffi cient. Consider pairs of random variables (Xi, Y?), i = 1,..., n, and let X E Rn and Y E Rn have coordinates Xi and Yi, i = ,...,n. With e E R' being the vector of ones, Pe = ee'/n and Qe = I - Pe, the sample correla tion coefficient is defined by

r QeY ' QeX

\IlQeyll IlQeXII,

The next result describes the distribution of r when (Xi, Yi), i = 1,..., n, is a random sample from a bivariate normal distribution.

Proposition 10.8. Suppose (Xi, Y1)' E R2, i = 1,..., n, are independent random vectors with




where ,u E R2 and

a(l l 012

0Y21 022

is positive definite. Consider random variables (U1, U2, U3) with:

(i) (U1, U2) independent of U3.

(ii) t(U3) = X-2.

(iii) e(U2) = Xn (iv) i&(U1IU2) = N( P (u21/2 1).

/1 - p2

where p = a127(01122)1/2 is the correlation coefficient. Then we have

Proof The assumption of independence and normality implies that the matrix (XY) E E2 n has a distribution given by

fE(XY) = N(eji', In ? 2).

It follows from Proposition 10.7 that we may assume, without loss of

generality, that 2 has the form

When 2 has this form, the conditional distribution of X given Y is

E(XIY) = N((tL- p2)e + pY, (1 - p2)In)

so

e(QeXlY) = N(PQeY, (l - p2)Qe)

Now, let v,,..., vn be an orthonormal basis for Rn with v, = Vn and

v QeY




Expressing QeX in this basis leads to

n

QeX = E(ViQe X) Vi

2

since Qee = 0. Setting

Vi'QeX (i = , i = ~~29 ... ., n,

it is easily seen that, conditional on Y, we have that (2 " ... 9 are indepen dent with

P-(421Y) = N(p(l _ p2 /2 IIQeYIl, 1)

and

(tj I =Y) N(O, 1), i = 3,.. ., n.

Since

n n

IIQeX112 = (v'QeX)2 = (1 -p2),{2 2 2

the identity

$S2

7/2 + n23(

holds. This leads to

r _ 2

/1-r2 An/2

Setting U1 = (2' U2 = IIQeYII2' and U3 = 2n32 yields the assertion of the proposition. O

The result of this proposition has a couple of interesting consequences. When p = 0, then the statistic

W= U1 W = n~-2 r = n-2

l-r2U1/




has a Students t distribution with n - 2 degrees of freedom. In the general case, the distribution of W can be described by saying that: conditional on U2, W has a noncentral t distribution with n - 2 degrees of freedom and noncentrality parameter

a- P U2 1 - p2

where f (U2) = X2". Let Pm(*I8) denote the density function of a non central t distribution with m degrees of freedom and noncentrality parame ter 8. The results in the Appendix show that Pm(-18) has a monotone likelihood ratio. It is clear that the density of W is

h(wp) = f ~Pm(W (P(l - p2)/2 )U1/2)f(U) du

where f is the density of U2 and m = n - 2. From this representation and

the results in the Appendix, it is not difficult to show that h ( -p) has a

monotone likelihood ratio. The details of this are left to the reader. In the case that the two random vectors X and Y in Rn are independent,

the conditions under which W has a tn-2 distribution can be considerably weakened.

Proposition 10.9. Suppose X and Yin Rn are independent and both liQeXII and IIQeYII are positive with probability one. Also assume that, for some

number g, e R, the distribution of X - ,ie is orthogonally invariant.

Under these assumptions, the distribution of

W= /n - r

/1r2

where

( QeY A QeX

IlQeyIll IlQeXII

is a tn2 distribution.

Proof. The two random vectors QeX and Qey take values in the (n - 1)

dimensional subspace

M = (xlx E R, x'e = 0).




Fix Y so the vector

Qe E M IlQeyIIM

has length one. Since the distribution of X - ,ple is O,n invariant, it follows

that the distribution of Qe X is invariant under the group

G = (FIF e On, Fe = e),

which acts on M. Therefore, the distribution of QeX/llQeXIl is G-invariant on the set

'X= {xlx e M, llxll = 1).

But G is compact and acts transitively on 9 so there is a unique G-invariant

distribution for QeX/IIQeXII in 9X. From this uniqueness it follows that

IIQeXI - IIQeZII IQeXl I( QeZ)l

where l (Z) = N(0, In) on Rn. Therefore, we have

0(r) = __y' QeZ

and for each y, the claimed result follows from the argument given to prove Proposition 10.8. O

We now turn to the canonical correlation coefficients in the special case that t = 1. Consider random vectors Xi and Y with Xi e R' and Yi E R ,

= 1,..., n. Let X E R' have coordinates Xl,..., Xn and let Y E Er,n have

rows Y,,..., Y1. Assume that QeY has rank r so

P = QeY [(QeY)'QeY] (QeY)'

is the orthogonal projection onto the subspace spanned by the columns of Qey. Since t = 1, the canonical correlation coefficient is the square root of

the largest, and only nonzero, eigenvalue of

(QeX)(QeX)' P

IlQeX112




which is

2 (QeX)'P(QeX) IlPQeX112 I1QeXII2 IIQeXII2

For the case at hand, r, is commonly called the multiple correlation coeffi cient. The distribution of r2 is described next under the assumption of normality.

Proposition 10.10. Assume that the distribution of (XY) e Er, 1, n is given

by

E(XY) = N(eM, In, ?& )

and partition 2 as

2: (il 1 212 (r + 1) x (r+ 1

where a > 0, 22 is i X r, and 222 is r x r. Consider random variables U1,

U2, and U3 whose joint distribution is specified by:

(i) (Ul, U2) and U3 are independent.

(ii) f (U3)= X=2 (iii) tE(U2) = X2 xn- 1

(iv) (Ul IU2) = X2 (A), where A = p2(1 - p2)- IU2.

Here p = (E,2122221221/a1 )1/2 is the population multiple correlation coeffi cient. Then we have

( 2-r2) U3

Proof Combining the results of Proposition 10.1 and Proposition 5.7, without loss of generality, 2 can be assumed to have the form

where Es E Rr and ? = (1,0,..., 0). When 2 has this form, the conditional




distribution of X given Y is

C(XjY) = N((IL I - pWL 2e1)e + pYel,(l - p2)In)

where &X = ,u1e and &;Y = etu2. Since Qee = 0, we have

E (QeXIY) N(PQeY(l -(1 -P2)Qe)

The subspace spanned by the columns of Qey is contained in the range of

Qe and this implies that QeP = PQe = P So

IIQeXI2 = II(Qe - P)Xjj2 + (PXij2 = II(Qe - P)QeXII + IIPQeXII2.

Since

2 =iPQeX112

IlQeXI1

it follows that

I lP QeXIV

I -rl I(Qe- P)QeXII2

Given Y, the conditional covariance of QeX is (1 - P2)Qe and, therefore,

the identity PQe(Qe - P) = 0 implies that PQeX and (Qe - P)QeX are

conditionally independent. It is clear that

CI((Qe - P)QeXIY) = N(O, (1 - P2)(Qe -p)),

so we have

e 11(Qe P )QeX1121y ) = (I1 _ p2) x2_r C,(II(Qe - P)QeXI2I)_1 P) _-rI

since Qe - P is an orthogonal projection of rank n - r - 1. Again, condi

tioning on Y,

e (PQeXlY) = N(PQeYei, (1 - p2)P)

since PQe = P and QeYel is in the range of P. It follows from Proposition 3.8 that

f(1PQeX121IY) =

(1 _ p2)X2(A)




where the noncentrality parameter A is given by

2 A =1E'eQ eYQ-YI h

That U2 ElYQeY6i has a X2-i distribution is clear. Defining U1 and U3 by

U1 = (I - P2)I IIPQeXII2

and

U3 = (1 - P2) II(Qe - P)QeXII2,

the identity

r2

1-r2 U3

holds. That U3 is independent of (U,, U2) follows by conditioning on Y. Since

fe(UllY) = xn(A)

where

2 p2

A = ' 2CY'QeYCi = 1 U2 2'

the conditional distribution of U, given Y is the same as the conditional distribution of U, given U2. This completes the proof. o

When p = 0, Proposition 10.10 shows that

i [i-rl ) ~( X2 r ) Fr=n-r-1

which is the unnormalized F-distribution on (0, ox). More generally,

i -2) =F(r,n-r- I;A)




where

- P 2

is random. This means that, conditioning on A = 8,

e( ' lr2 )8 =F(r, n - r-1;

Let f( I8) denote the density function of an F(r, n - r - 1; 8) distribution,

and let h() be the density of a X2- distribution. Then the density of

r2l /1- r ) is

k(wlp) = f (wIp2(1 p2) u)h(u) du.

From this representation, it can be shown, using the results in the Appen

dix, that k(wlp) has a monotone likelihood ratio. The final exact distributional result of this section concerns the function

of the sample canonical correlations given by

W= H (I - ri2) i=I

when the random sample (Xi, Y,)', i = 1,..., n, is from a normal distri bution and the population coefficients are all zero. This statistic arises in

testing for independence, which is discussed in detail in the next section. To be precise, it is assumed that the random sample

Zi Xi E RP, i ,.. n

satisfies

et)= N(Aq 2:).

As usual, Xi E R, Yi E K, and the sample covariance matrix

n s= (Zi - Z)(Z1 -Z)




is partitioned as

s- 511 S120 S\21 S22)

where S, I is q x q and S22 is r x r. Under the assumptions made this far, S

has a Wishart distribution-namely,

c (S) = W(, p, n- 1).

Partitioning :, we have

Ell 2120

(21 Y. 22

and the population canonical correlation coefficients, say p1 > > P, are all zero iff 12 = 0.

Proposition 10.11. Assume n - 1 > p and let r,> *. ** r, be the sample canonical correlations. When E12 = 0, then

ce(1ii(l -ri2)) = U(n-r-1 r, q)

where the distribution U(n - r - 1, r, q) is described in Proposition 8.14.

Proof. Since r2,..., rt2 are the t largest eigenvalues of

A(S)= SilIS=2S221

and the remaining q - t eigenvalues of A(S) are zero, it follows that

t

w -=rl (I _

ri2 ) =II- Sl I S12 SL%1

Since W is a function of the sample canonical correlations and Y.12 = 0

Proposition 10.1 implies that we can take

z = (Iq or)

without loss of generality to find the distribution of W. Using properties of




determinants, we have

w= isi~'iis11-

s12s'S211

- 11I1-21 W- lS1,1},

- S12Si22'2 W

IISll.2 + S12S22s211

Proposition 8.7 implies that

(Sl l2) = W(Iq, q, n - r -1)

t /2S21I

S22 ) = N(0, 'r ? hI)

and S1 I.2 and S12S2 SS21 are independent. Therefore,

&(Ss2S122S21) = W(Iq, q, r)

and by definition, it follows that

E (W) = U(n-r - 1, r,q). O

Since

w I ~~S22 .11 w2~~ S _'S121 1S22.1 + S21S12S121

the proof of Proposition 10.1 1 shows that C (W) = U(n - q - 1, q, r) so

U(n - q - 1, q, r) = U(n - r - 1, r, q) as long as n - 1 > q + r. Using

the ideas in the proof of Proposition 8.15, the distribution of W can be

derived when 212 has rank one-that is, when one population canonical correlation is positive and the rest are zero. The details of this are left to the

reader. We close this section with a discussion that provides some qualitative

information about the distribution of r, > * * * > r, when the data matrices

X E Eq n and Y E Crr, are independent. As usual, let Px and Py denote the

orthogonal projections onto the column spaces of QeX and QeY. Then the

sample canonical correlations are the t largest eigenvalues of PYPX-say

p(PYPX)- E R'.

It is assumed that QeX has rank q and QeY has rank r. Since the

distribution of p(PyPx) is of interest, it is reasonable to investigate the




distributional properties of the two random projections Px and Py. Since X and Y are assumed to be independent, it suffices to focus our attention on

Px. First note that Px is an orthogonal projection onto a q-dimensional

subspace contained in

M = (xlx E Rn, x'e = 0).

Therefore, Px is an element of

63> ( ) Pl P is an n X n rank q orthogonal

q,(e) projection, Pe =O

Furthermore, the space Pq n(e) is a compact subset of Rn2 and is acted on

by the compact group

C)(e) r={1 En1'Fe = e),

with the group action given by P -> VPF'. Since Qn(e) acts transitively on

9',,(e), there is a unique On(e)-invariant probability distribution on 6'q n(e). This is called the uniform distribution on Pq n(e).

Proposition 10.12. If f(X) = P&(X) for r e (9n(e), then Pxhas a uniform distribution on @q, n(e).

Proof It is readily verified that

Prx = rPxr' F E (9(e).

Therefore, if f2(FX) = e( X), then

P-( Px) = P-( rPXIF)I

which implies that the distribution E(Px) on gPq n(e) is ?n(e)-invariant. The uniqueness of the uniform distribution on q, n(e) yields the result. O

When C(X) = N(epL',

In X ), then

P,(X) = E(FX)

for IF E O(e) so

Proposition 10.12 applies to this case. For any two n X n positive semidefi

nite matrices B1 and B2, define the function p(B IB2) to be the vector of the

t largest eigenvalues of B1 B2. In particular, qp(PyPx) is the vector of sample

canonical correlations.



PROPOSITION 10. 13. 441

Proposition 10.13. Assume X and Y are independent, C(FX) = E(X) for

F E8 Q9(e), QeX has rank q, and QeY has rank r. Then

f (Tp(PYPX)) = C (T (POPA

where P0 is any fixed rank r projection in grY n(e).

Proof First note that

p(PYrFPX') = (r'PYrPX)

since the eigenvalues of PyFPxrl are the same as the eigenvalues of

rFPy1Px. From Proposition 10.12, we have

E(PX) = E(rpxr ) r E (9"(e).

Conditioning on Y, the independence of X and Y implies that

e(9(PyPx)1Y) = p(T(PyrPxr1)1Y) = (fp(rfPyrPx)iY)

for all F E Q(e). The group C(Se) acts transitively on )r. n(e), so for Y fixed, there exists a F e En,(e) such that rPyr = P0. Therefore, the equa

tion

-(T(PYPX)IY) = E(T(POPA)IY) = cwp(oPx))

holds for each Y since X and Y are independent. Averaging @(p (PYPA)IY) over Y yields Ef(p(PyPx)), which must then equal Ef(p(P0Px)). This completes the proof. c

The preceeding result shows that e(q(PyPx)) does not depend on the distribution of Y as long as X and Y are independent and e(X) = E(FX)

for F E En(e). In this case, the distribution of qp(PyPx) can be derived under the assumption that e( X) = N(0, I,, ? Iq) and e(Y) = N(0, I,, ? Ir) Suppose that q < r so t = q. Then (q(PyPx)) is the distribution of

r, > *-- > rq where Ai = r, i =2 1,..., q, are the eigenvalues of

SJQS1lSi2'S2, and

= S 1 S12

S2, S22 J

is the sample covariance matrix. To find the distribution of r,,..., rq, it


would obviously suffice to find the distribution of yi = 1 - Xi, i = 1,..., q, which are the eigenvalues of

I- Sj1S2S2-S21 = (T, + T2) -'T

where

T1-S1- S12 SI2S21; T2 = SI2S2

It was shown in the proof of Proposition 10.11 that T, and T2 are

independent and

C(TI) = W(Iq, q, n - r - )

and

C (T)= W(Iq, q,)

Since the matrix

B = ( T, + T2 )T i/2T1 (T + T 1/2

has the same eigenvalues as (T, + T2) TI, it suffices to find the distribu

tion of the eigenvalues of B. It is not too difficult to show (see the Problems

at end of this chapter) that the density of B is

p(B) = w(n - r - 1, q).o(r, q) IBI(n--q- 2)/2lIq -Bl r-q- 1)/2

o(n - l, q)

with respect to Lebesgue measure dB restricted to the set

(= (BIB E 5, Iq - B E }

Here, w(-,-) is the Wishart constant defined in Example 5.1. Now, the

ordered eigenvalues of B are a maximal invariant under the action of the

group ?q on 9X given by B * I"B, F E ?q. Let X be the vector of ordered

eigenvaluesofB soAX E , 1 > l * > Aq> 0. Sincep(FBI') = p(B), F E q9 it follows from Proposition 7.15 that the density of A is q(X) =

p(Dx) where Dx is a q x q diagonal matrix with diagonal entries AX,_, Aq. Of course, q( ) is the density of A with respect to the measure v(dX) induced by the maximal invariant mapping. More precisely, let



TESTING FOR INDEPENDENCE 443

and consider the mapping (p on 9 to 5 defined by T(B) = A where X is the

vector of eigenvalues of B. For any Borel set C c F, v(C) is defined by

v(C) = | dB. m- (C)

Since q(A) has been calculated, the only step left to determine the distribu tion of A is to find the measure P. However, it is rather nontrivial to find v and the details are not given here. We have included the above argument to show that the only step in obtaining C(A) that we have not solved is the calculation of P. This completes our discussion of distributional problems associated with canonical correlations.

The measure P above is just the restriction to S of the measure P2

discussed in Example 6.1. For one derivation of v2, see Muirhead (1982, p. 104).

10.4. TESTING FOR INDEPENDENCE

In this section, we consider the problem of testing for independence based on a random sample from a normal distribution. Again, let Z,...., Zn be

independent random vectors in RP and partition Zi as

X.~~~~~~ Zi I, Xi E R y Rr.

It is assumed that l?(Zi) = N(u, 2), so

Cov(Z,) I = (=Co vi2) COV( )

for i = 1,..., n. The problem is to test the null hypothesis Ho: . 12 = 0 against the alternative H1: 12 *0 . As usual, let Z have rows Zi', i = 1,= . . , n

so Cf(Z) = N(eA', I,,n E). Assuming n > p + 1, the set F C , P, nwhere

S - (Z - eZ')'(Z - eZ) (S21 Si2)

has rank p is a set of probability one and E is taken as the sample space for

Z. The group G considered in Proposition 10.6 acts on E and a maximal invariant is the vector of canonical correlation coefficients r1,..., rt where t = min(q, r).




Proposition 10.14. The problem of testing Ho: l12 = 0 versus H,: 112 * 0

is invariant under G. Every G-invariant test is a function of the sample canonical correlation coefficients r,,..., rt. When t = 1, the test that rejects for large values of r, is a uniformly most powerful invariant test.

Proof. That the testing problem is G-invariant is easily checked. From Proposition 10.6, the function mapping Z into r,,..., r, is a maximal invariant so every invariant test is a function of r,,..., r,. When t = 1, the

test that rejects for large values of r, is equivalent to the test that rejects for

large values of U- r2/(l - r2). It was argued in the last section (see Proposition 10.10) that the density of U, say k(ulp), has a monotone likelihood ratio where p is the only nonzero population canonical correla tion coefficient. Since the null hypothesis is that p = 0 and since every invariant test is a function of U, it follows that the test that rejects for large values of U is a uniformly most powerful invariant test.

When t = 1, the distribution of U is specified in Proposition 10.10, and this can be used to construct a test of level a for Ho. For example, if q = 1,

then e(U) = Fr, n-r- l and a constant c(a) can be found from standard

tables of the normalized Sr-distribution such that, under Ho, P{U > c(a)) = a.

In the case that t > 1, there is no obvious function of r1,..., rt that

provides an optimum test of Ho versus H1. Intuitively, if some of the ri's are

" too big," there is reason to suspect that Ho is not true. The likelihood ratio

test provides one possible criterion for testing 212 = 0.

Proposition 10.15. The likelihood ratio test of Ho versus H, rejects if the

statistic

W= (l - ri2) = 1

is too small. Under Ho, E(W) = U(n - r - 1, r, q), which is the distribu

tion described in Proposition 8.14.

Proof The density function of Z is

p(ZIP, t, = (,r )-)np11-n 2ex [ tr(Z - eW)'( Z - eu') 7]

Under both Ho and H1, the maximum likelihood estimate of IL is 4 = Z.

Under HI, the maximum likelihood estimate of L is Y = (l/n)S. Partition




ing S as I is partitioned, we have

S21 S22 J

where Si, is q x q, S12 is q X r, and S22 is r x r. Under Ho, I has the form

II(I Li 0= Y, 222

so

(2-1 0)

0 2-1

When I has this form,

A^ 9

- np

1II1-

n/2 I 22- n /2eXp[-

rS

_ = trS-lP n/exp[- trSI1ljj]jz |-n/2

x exp[ - I trS22 22I2

From this it is clear that, under Ho, E = (1/n)SI, and E22 = (1/n)S22. Substituting these estimates into the densities under Ho and H1 leads to a

likelihood ratio of

A(Z) (ISI;HS221) n/2

Rejecting Ho for small values of A(Z) is equivalent to rejecting for small

values of

W = ( A ( Z)) 2/n i 5 1l1

1sI1111S221'

The identity IS = 1StI1S22 - S21StISl'21 shows that

IS22- S21SI1S2lj = hr - Sr2)21Sjj'S12i = IS221 I Ir

- H ( - r2 1 221 iSi

where r2,..., r are the t largest eigenvalues of S2i1S21SjS12. Thus the




likelihood ratio test is equivalent to the test that rejects for small values of W. That l (W) = U(n - r - 1, r, q) under Ho follows from Proposition

10.11. o

The distribution of W under H1 is quite complicated to describe except in the case that 212 has rank one. As mentioned in the last section, when .22 has rank one, the methods used in Proposition 8.15 yields a description of the distribution of W.

Rather than discuss possible alternatives to the likelihood test, in the next section we show that the testing problem above is a special case of the

MANOVA testing problem considered in Chapter 9. Thus the alternatives to the likelihood ratio test for the MANOVA problem are also alternatives to the likelihood ratio test for independence.

We now turn to a slight generalization of the problem of testing that

212 = 0. Again suppose that Z E , satisfies e(Z) = N(e,L', In ? 1) where

,u e RP and 2 are both unknown parameters and n > p + 1. Given an

integer k > 2, let P1,..., Pk be positive integers such that zCp1 = p. Parti tion Y. into blocks Iij of dimension pi X pj for i, j = 1,..., k. We now

discuss the likelihood ratio test for testing Ho: I ij = 0 for all i, j with i * j.

For example, when k = p and each pi = 1, then the null hypothesis is that 2

is diagonal with unknown diagonal elements. By mimicking the proof of Proposition 10.15, it is not difficult to show that the likelihood ratio test for

testing Ho versus the alternative that Y. is completely unknown rejects for

small values of

A k

A- lSi.l

i= 1

Here, S = (Z - eZ')'(Z - eZ') is partitioned into Sij: pi x pj as E was

partitioned. Further, for i = 1,..., k, define S(ji) by

Sii i(i+ 1) Sik

s(ii)-:

so S(i) is (Pi + * + Pk) X (Pi + * + Pk)J Noting that SO= S, we can

write A- -

| k-I |S |

A ISI | i= 1 IS- I IS(l+ 1 i+ 1)1 i= l




Define Wi, i = I,., k - 1, by

W _ I (i,, l i Siil 1S(i+ 1,i+ )1,

Under the null hypothesis, it follows from Proposition 10.11 that

k k

(WI) =U n - 1 - E pip,, P, pi).

j=i+ I j=i+ I

To derive the distribution of A under Ho, we now show that WI,..., Wk

are independent random variables under Ho. From this it follows that, under Ho,

k-I k k

C(A) = r| U n - I -

E? pq, E p,, Pi i=l j=i+l I j=i+lI

so A is distributed as a product of independent beta random variables. The

independence of WI,..., Wk I for a general k follows easily by induction

once independence has been verified for k = 3.

For k = 3, we have

A = W W2= iI IS(22J1

and, under Ho,

a(S) = W(M:, ,n - 1)

where I has the form

Y. 0 0 IjI_ 0

( 0

222 0 O 0 033 ? (22)J

To show W1 and W2 are independent, Proposition 7.19 is applied. The sample space for S is p -the space of p X p positive definite matrices. Fix

Y. of the above form and let P. denote the probability measure of S so PO is the probability measure of a W(I, p, n - 1) distribution on S'. Consider the group G whose elements are (A, B) where A E Glp, and B E Gl(P2+p3)




and the group composition is

(Al, BI)(A2, B2) = (A1A2, B1B2).

It is easy to show that the action

(A, B)[S]( A

g)S( A

B)

defines a left action of G on 5'. If e(S) = W(Y, p, n - 1), then

f-((A, B)[S]) = W((A, B)[ ], p, n - 1)

where

(A, B)[Y-] = (A O)( O) AIA

This last equality follows from the special form of E:. The first thing to

notice is that

WI = WI(S)= 5SI ISIII1S(22)I

is invariant under the action of G on 5p . Also, because of the special form

of 2, the statistic

T(S) -(SI S(22)) E 5p, X hP2+P3)

is a sufficient statistic for the family of distributions {gPoIg E G). This

follows from the factorization criterion applied to the family (gPo0g E G}, which is the Wishart family

(0 ? 22 ) Yl p, P Y22 (E 5(+P2+P3))

However, G acts transitively on + in the obvious way:

(A, B)[S1, S2] (AS1A, BS2B')

for [Si, S2] E Xpl x (+P,). Further, the sufficient statistic T(S) E p X

5(P2+P3) satisfies

T((A, B)[S]) = (A, B)[T(S)]



so T(-) is an equivariant function. It now follows from Proposition 7.19 that

the invariant statistic W1(S) is independent of the sufficient statistic (S). But

W2 (S) I S(22)i

1 S2211IS3 31

is a function of S(22) and so is a function of T(S) - [S11, S(22)]. Thus W1 and

W2 are independent for each value of E in the null hypothesis. Summarizing, we have the following result.

Proposition 10.16. Assume k = 3 and 2 has the form specified under Ho. Then, under the action of the group G on both S+ and

+ X )<(+P the

invariant statistic

W1(S) S15I

and the equivariant statistic

T(S) = [S,19 S(22)]

are independent. In particular, the statistic

W (S) 15I(22)1 W2(S) =

being a function of T(S), is independent of W1.

The application and interpretation of the previous paragraph for general k should be fairly clear. The details are briefly outlined. Under the null hypothesis that Eij

= 0 for i, j 1,. . ., k and i *j, we want to describe the distribution of

k-I s( ) k-i

A =H I (i = H

W

It was remarked earlier that each Wi is distributed as a product of indepen dent beta random variables. To see that Wl,..., Wk I are independent, Proposition 10.16 shows that

= ISIIS(22)1


and S(22) are independent. Since (W2,..., Wk 1) is a function of S(22), W

and (W2,. . ., Wk I) are independent. Next, apply Proposition 10.16 to to conclude that

W _ IS(22,1 I1S2211S(33)1

and S(33) are independent. Since (W3,..., Wk-l) is a function of S(33), W2 and (W3,..., WkI ) are independent. The conclusion that WI,..., Wkl are independent now follows easily. Thus the distribution of A under Ho has been described.

To interpret the decomposition of A into the product HVI'W, first consider the null hypothesis

Ho('): Y.Ij = ? forj=- 2,..., k.

An application of Proposition 10.15 shows that the likelihood ratio test of

HoM versus the alternative that : is unknown rejects for small values of

ISIIIIS(22)1

Assuming Ho') to be true, consider testing

Hg2): 2j =0 forj =3,...,k

versus

H (2): 22* for somej = 3,..., k.

A minor variation of the proof of Proposition 10.15 yields a likelihood ratio

test of H(2) versus H(2) (given Hg')) that rejects for small values of

W2 IS(22,1

1S221IS(33)1

Proceeding by induction, assume null hypotheses Ho(', i = 1,..., m - 1, to

be true and consider testing

Hg(m): Emj

= O, j = m + II.., k

versus

H(m) Iy *0 forsomej = m + ,..., k.



MULTIVARIATE REGRESSION 451

Given the null hypotheses H i = 1,..., m - 1, the likelihood ratio test of H(m) versus H(m) rejects for small values of

Wm = 15(mm)l ISmmIIS(m+1,m+1)V

The overall likelihood ratio test is one possible way of combining the likelihood ratio tests of H(m) versus Hen), given that HOM, i = 1,..., m - 1,

is true.

10.5. MULTIVARIATE REGRESSION

The purpose of this section is to show that testing for independence can be

viewed as a special case of the general MANOVA testing problem treated in

Chapter 9. In fact, the results below extend those of the previous section by

allowing a more general mean structure for the observations. In the notation

of the previous section, consider a data matrix Z: n x p that is partitioned

as Z = (XY) where Xis n X q and Y is n X r so p = q + r. It is assumed

that

e(Z) = N(TB, In e 2)

where T is an n x k known matrix of rank k and B is a k X p matrix of

unknown parameters. As usual, I is a p x p positive definite matrix. This is

precisely the linear model discussed in Section 9.1 and clearly includes the model of previous sections of this chapter as a special case.

To test that X and Y are independent, it is illuminating to first calculate

the conditional distribution of Y given X. Partition the matrix B as

B = (B1B2) where B1 is k X q and B2 is k x r. In describing the conditional

distribution of Y given X, say E(YIX), the notation

22-1 -22 -221211 1

is used. Following Example 3.1, we have

(YIX) = N(TB2 + (In - 12111-)(X -TB1), I, , 222-1)

= N(T(B2- B1,2'122) + X2lllj12, In ? 1221)

and the marginal distribution of X is




Let Wbe the n x (q + k) matrix (XT) and let C be the (q + k) x r matrix

of parameters

c C2/ B2- BIE-1112

so

X21112 + T(B2 - B Ill12) = (XT)( WC .

In this notation, we have

e(YIX) =N(WC, In 0.222.1)

and

e(x) = N(TB1, In 0 211).

Assuming n > p + k, the matrix W has rank q + k with probabilty one so

the conditional model for Y is of the MANOVA type. Further, testing

Ho: 212 = 0 versus HI: 212 * 0 is equivalent to testing Ho: C, = 0 versus

HI: Cl * 0. In other words, based on the model for Z,

e(Z) = N(TB, In C 2),

the null hypothesis concerns the covariance matrix. But in terms of the conditional model, the null hypothesis concems the matrix of regression parameters.

With the above discussion and models in mind, we now want to discuss

various approaches to testing Ho and flo. In terms of the model

e(z) = N(TB, In 09 2)

and assuming HI, the maximum likelihood estimators of B and 2 are

B = (T'T)-'T'Z, s= n

where

S= (Z - TB)'(Z -TB),



MULTIVARIATE REGRESSION 453

so

c(S) = W(I, p, n - k).

Under Ho, the maximum likelihood estimator of B is still B as above and

since

z = t ? ~222}

it is readily verified that

Y.ii =

nSii, i , 1

where

S= (S1 S,12 S21 S22)

Substituting these estimators into the density of Z under Ho and H, demonstrates that the likelihood ratio test rejects for small values of

A(Z)= Is1121

Under Ho, the proof of Proposition 10.11 shows that the distribution of A(Z) is U(n - k - r, r, q) as described in Proposition 8.14. Of course, symmetry in r and q implies that U(n - k - r, r, q) = U(n - k - q, q, r).

An alternative derivation of this likelihood ratio test can be given using the conditional distribution of Y given X and the marginal distribution of X. This follows from two facts: (i) the density of Z is proportional to the conditional density of Y given X multiplied by the marginal density of X, and (ii) the relabeling of the parameters is one-to-one-namely, the map ping from (B, E) to (C, B1, II, 122-.1) is a one-to-one onto mapping of

p,k Xp to-(r,(q?k) X k X X eS +. We now turn to the likelihood ratio test of Ho versus H, based on the conditional model

E(YIX) = N(WC, In X 12221)

where X is treated as fixed. With X fixed, testing fto versus H, is a special

case of the MANOVA testing problem and the results in Chapter 9 are




directly applicable. To express Ho in the MANOVA testing problem form, let K be the q x (q + k) matrix K = (Iq 0), so the null hypothesis Ho is

Ho: KC= 0.

Recall that

C- (w'w)'W'Y

is the maximum likelihood estimator of C under HA. Let P = W(W'W)- W' denote the orthogonal projection onto the column space of W, let Qw = In - Pw, and define V E 5,+ by

V- Y'QWY (Y W )'(Y- WC).

As shown in Section 9.1, based on the model

E(YIX) = N(WC, In X E 22.1)

the likelihood ratio test of Ho: KC = 0 versus HI: KC * 0 rejects Ho for

small values of

A i~~~~~~~vi IV+(KC )'( K ( W'W )

- lK') -I(KC )

For each fixed X, Proposition 9.1 shows that under Ho, the distribution of

AI(Y) is U(n - q - k, q, r), which is the distribution (unconditional) of

A(Z) under Ho. In fact, much more is true.

Proposition 10.17. In the notation above:

(i) V= S22.1.

(ii) (KC)'(K(W'W)- 'K')- '(KC) = S21Sl1QS12. (iii) AI(Y)= A(Z).

Further, under Ho, the conditional (given X) and unconditional distribution of A,(Y) and A(Z) are the same.

Proof. To establish (i), first write S as

S = (Z- TB)'(Z - TB) = Z'(I-PT)Z




where PT = T(T'T) 'T' is the orthogonal projection onto the column space of T. Setting QT = I - PT and writing Z = (XY), we have

S = Z'QTZ = (Y)QTXY ( Y Y'QTY) ( 2 22).

This yields the identity

S22 .1 =

QTYQ -

Y'QTX(X'QTX) X'QTY = Y(I -PT) Y- Y'POY

where P0O= QTX(X'QTX)-'X'QT is the orthogonal projection onto the column space of QTX. However, a bit of reflection reveals that Po = Pw - PT so

S22 Y (I -

PT)Y -

Y(PW -

PT)Y Y (I - P YQWY =V.

This establishes assertion (i). For (ii), we have

S21Snl 12 = Y'POY

and

(KC)'( K(W'W) 'K') KC

= YW(W'W) K'(K(W'W) 'WW(W'W)K')

xK(W'W) -W'Y

= Y'U(U(U) uu'Y Y'PuY

where U W(W'W)-'K' and Pu is the orthogonal projection onto the column space of U. Thus it must be shown that Pu = Po or, equivalently,

that the column space of U is the same as the column space of QTX. Since

W = (XT), the relationship

W'U= W'W(W'W)I'K,K= ( Iq)

proves that the q columns of U are orthogonal to the k columns of T. Thus

the columns of U span a q-dimensional subspace contained in the column

space of W and orthogonal to the column space of T. But there is only one



subspace with these properties. Since the column space of QTX also has these properties, it follows that Pu = Po so (ii) holds. Relationship (iii) is a consequence of (i), (ii), and

Is',IIS22 S22 1 + S21S2l.S12 1

The validity of the final assertion concerning the distribution of 1A(Y) and A(Z) was established earlier. o

The results of Proposition 10.17 establish the connection between testing for independence and the MANOVA testing problem. Further, under Ho, the conditional distribution of A1(Y) is U(n - q - k, q, r) for each value

of X, so the marginal distribution of X is irrelevant. In other words, as long

as the conditional model for Y given X is valid, we can test Ho using the

likelihood ratio test and under Ho, the distribution of the test statistic does not depend on the value of X. Of course, this implies that the conditional

(given X) distribution of A(Z) is the same as the unconditional distribution

of A(Z) under Ho. However, under H1, the conditional and unconditional

distributions of A(Z) are not the same.

PROBLEMS

1. Given positive integers t, q, and r with t < q, r, consider random

vectors U E Rt, V e Rq, and W e R' where Cov(U) = I, and U, V,

and W are uncorrelated. For A: q X t and B: r X t, construct X=

AU+ Vand Y=BU+ W.

(i) With AI1 = Cov(V) and A22 = Cov(W), show that

Cov(X) = AA' + A11

Cov(Y) = BB' + A22

and the cross covariance between X and Y is AB'. Conclude that

the number of nonzero population canonical correlations be

tween X and Y is at most t.

(ii) Conversely, given X E Rq and Y E Rr with t nonzero population

canonical correlations, construct U, V, W, A, and B as above so

that X = AU + V and Y = BU + W have the same joint covari

ance as X and Y.

PROBLEMS 457

2. Consider X E Rq and Y e R' and assume that Cov(X) = I,, and

Cov(y) = 222 exist. Let 212 be the cross covariance of X with Y.

Recall that 6Yn denotes the group of n X n permutation matrices. (i) If gE12h = 212 for all gE'Yq and h E'tP , show that

Eh2 = 8e1e'

for some 8 E R1 where e, (e2) is the vector of ones in Rq (Rr).

(ii) Under the assumptions in (i), show that there is at most one nonzero canonical correlation and it is I(8e K e j l1e I)l/2 (e' -'21e2)'/2. What is a set of canonical coordinates?

3. Consider X E RP with Cov(X) = I > 0 (for simplicity, assume 6 =

0). This problem has to do with the approximation of X by a lower

dimensional random vector- say Y = BX where B is a t x p matrix of

rank t.

(i) In the notation of Proposition 10.4, suppose Ao: p X p is used to define the inner product [, ] on Rn and prediction error is

measured by IIjX - CYI12 where 11 * 11 is defined by [-, *] and C

is p x t. Show that the minimum prediction error (B fixed) is

8(B) = trA0(E - 2B'(BYB'1)yBY)

and the minimum is achieved for C = C = YB(B2B')- I.

(ii) Let A = l1/2A0E'/2 and write A in spectral form as A = :f Xiaia' where XI > > XA > 0 and a1,..., ap is an ortho

normal basis for RP. Show that 8(B) = trA(I - Q(B)) where

Q(B) - S'/2B'(BEB') 'BE'/2 is a rank t orthogonal projection. Using this, show that 8(B) is minimized by choosing Q = Q = E:laia,, and the minimum is + A I. What is a corresponding B and X = CBX that gives the minimum? Show that X = CBX = E11/2Q>-1i/2X.

(iii) In the special case that AO = Ip, show that

X= E2(a'X)a1 i=lI

where al,..., ap are the eigenvectors of E and a = Xiai with

AI > Ap. (The random variables a'X are often called the

principal components of X, i = 1.. . , p. It is easily verified that cov(a'X, ajX)

= 8ijiX.)

4. In RP, consider a translated subspace M + aO where ao E RP-such a

set in RP is called a flat and the dimension of the flat is the dimension

of M.



(i) Given any flat M + ao, show that M + aO = M + bo for some

unique bo E M'. Consider a flat M + a0, and define the orthogonal projection onto M + aO by x -+ P(x -

ao) + aO where P is the orthogonal projection onto M. Given n points x,,..., xn in RP, consider the problem of finding the "closest" k-dimensional flat M + aO to the n points. As a

measure of distance of the n points from M + ao, we use A(M, ao) =

2 llx i- x i12 where 11 * 11 is the usual Euclidean norm and i = P(xi -

ao) + ao is the projection of xi onto M + aO. The problem is to find M

and ao to minimize A(M, ao) over all k-dimensional subspaces M and

all aO.

(ii) First, regard ao as fixed, and set S(b) = Y.'(xi - b)(xi - b)' for

any b E RP. With Q = I - P, show that A(M, ao) = trS(ao)Q = trS(x)Q + n(a0 - f)'Q(ao - x5) where x~ = n l:nxi.

(iii) Write S(xf) = 2 PfX v v, in spectral form where X * >Xp> 0 and v,,...., vp is an orthonormal basis for RP. Using (ii), show

that Al(M, ao) > I2P?I Xi with equality for zo = x and for M

span(vD,., g Vk).

5. Consider a sample covariance matrix

|S(I] S12

VS21 S22

with Sii > 0 for i = 1, 2. With t = min{dim Sii, i = 1, 2), show that the t sample canonical correlations are the t largest solutions (in A) to the

equation IS12Si2-IS21 -

X2SIII = O, A E [0, x0).

6. (The Eckhart-Young Theorem, 1936.) Given a matrix A: n x p (say n > p), let k < p. The problem is to find a matrix B: n x p of rank no

greater than k that is "closest" to A in the usual trace inner product on

p . Let )k be all the n x p matrices of rank no larger than k, so the

problem is to find

inf IIA - B2

where IIMI12 = tr MM' for M E p, n. (i) Show that every B E I1k can be written A C where 4' is n X k,

+4' ='Ik' and C is k x p. Conversely, AC E 6

k' for each such 4 and C.

(ii) Using the results of Example 4.4, show that, for A and 4 fixed,

inf IIA - 'C112 = IIA - x'A112. CEEp, k

PROBLEMS 459

(iii) With Q = I - 44', Q is a rank n - k orthogonal projection. Show that, for each B E 'ik

IIA - B112 > infllAQl12 = inf trQAA' E Q Q

k v

where A, >. * > Ap are the singular values of A. Here Q ranges

over all rank n - k orthogonal projections. (iv) Write A = E:1 iuiv, as the singular value decomposition for A.

Show that B = E:kjXu v, achieves the infimum of part (iii).

7. In the case of a random sample from a bivariate normal distribution

N(,u, Y.), use Proposition 10.8 and Karlin's Lemma in the Appendix to

show that the density of W= In - 2r(I - r2)-1/2 (r is the sample correlation coefficient) has a monotone likelhood ratio in 0 = p(l - p2)- 1/2. Conclude that the density of r has a monotone likelihood ratio

in p.

8. Let fp q denote the density function on (0, oo) of an unnormalized Fp, q

random variable. Under the assumptions of Proposition 10.10, show that the distribution of W = r2(1 - r 2)- I has a density given by

00

f(w|p) E fr+2k,n-r-1(w)h(kJp) k= I

where

h(klp) (1

p)(n )/ r((n- 1)/2 + k) (p2)k

k!r((n - 1)72)

k = 0,1....

Note that h(.jp) is the probability mass function of a negative bi nomial distribution, so f(wlp) is a mixture of F distributions. Show that f( p) has a monotone likelihood ratio.

9. (A generalization of Proposition 10.12.) Consider the space Rn and an integer k with 1 < k < n. Fix an orthogonal projection P of rank k, and for s < n - k, let Y be the set of all n X n orthogonal projections R of rank s that satisfy RP = 0. Also, consider the group 6(P) = (I'll' E= Q, rp = Pr'.

(i) Show that the group 6(P) acts transitively on @s under the action R -) IRI'.

(ii) Argue that there is a unique @(P) invariant probability distribu tion on qs.

(iii) Let A have a uniform distribution on 0(P) and fix R0 E s

Show that AROA' has the unique 6(P) invariant distribution on

10. Suppose Z E 0Ep , has an En-left invariant distribution and has rank p

with probability one. Let Q be a rank n - k orthogonal projection with p + k < n and form W = QZ.

(i) Show that W has rank p with probability one.

(ii) Show that R = W(W'W)- W has the uniform distribution on p

(in the notation of Problem 9 above with P = I - Q and s = p).

11. After the proof of Proposition 10.13, it was argued that, when q < r,

to find the distribution of r, > --- > rq, it suffices to find the distri

bution of the eigenvalues of the matrix B = (T1 + T2)- 1/27T, (T1 +

2)-1/2 where T1 and T2 are independent with E(T1) = W(Iq, q, n - r

- 1) and E(T2) = W(Iq, q, r). It is assumed that q < n - r - 1. Let

fi( m) denote the density function of the W(Iq, q, m) distribution

(m > q) with respect to Lebesgue measure dS on 59q

Thus f(SIm)= (m, q)ISI(m-q- 1)/2 exp - 2 tr S]I(S) where

1 if S>0 l 0 otherwise

(i) With W1 = T1 and W2 = T1 + T2, show that the joint density of

W1 and W2 with respect to dW, dW2 is f(W1In - r - I)f(W2 -

WIIr). (ii) On the set where WI > 0 and W2 > 0, define B=

W7 "/2W W71/2 and W2 = V. Using Proposition 5.11, show that

the Jacobian of this transformation is Idet VI(q+ 1)/2 Show that the joint density of B and V on the set where B > 0 and V > 0 is

given by

f(V"'2BV"/2in - r - I)f(V"/2(I -B)V/2 r)Idet VI(q+ 1)/2.

(iii) Now, integrate out V to show that the density of B on the set

0 < B < Iq is

X - r -1, q)w(r, q) w (n - 1, q)

xB BI(n-r-q2)/2I q - I( r- q- 1)72

PROBLEMS 461

12. Suppose the random orthogonal transformation r has a uniform distribution on Q9n. Let A be the upper left-hand k x p block of 1 and assume p < k. Under the additional assumption that p < n - k, the

following argument shows that A has a density with respect to Lebesgue measure on ep k.

(i) Let 4:n X p consist of the first p columns of r so A: k X p has

rows that are the first k rows of 4. Show that 4 has a uniform

distribution on p n. Conclude that 4 has the same distribution as Z(Z'Z)1/2where Z: n X p is N(O, In ? Ip).

(ii) Now partition Z as Z = (X) where X is k x p and Y is

(n - k) x p. Show that Z'Z = X'X + Y'Y and that A has the

same distribution as X(X'X + Yy)- 1/2. (iii) Using (ii) and Problem 11, show that B = A'A has the density

p(B)= -w(k, p)o(n - k, p) I(k-p-l)/2I1 -gI(n-k-p-1)/2 p(B)

w(n,p) IIpB

with respect to Lebesgue measure on the set 0 < B < Ip.

(iv) Consider a random matrix L: k x p with a density with respect

to Lebesgue measure given by

h(L) = clIp- LL(n-k-P-1)/2p (L L)

where for B E SPI

?(B) 1 ifO<<B<Ip 0 otherwise

and

w(n - k, p)

(vF-a)kp w(n, p)'

Show that B = L'L has the density p(B) given in part (iii) (use

Proposition 7.6). (v) Now, to conclude that A has h as its density, first prove the

following proposition: Suppose 9X is acted on measurably by a compact group G and T: - 'J is a maximal invariant. If

Pi and P2 are both G-invariant measures on 9 such that P,(i- '(C)) = P2(T- I(C)) for all measurable C C 6, then P, = P2.

(vi) Now, apply the proposition above with 9 = lpk' G = ?k' T(X) = x'x, P, the distribution of A, and P2 the distribution of L as given in (iv). This shows that A has density h.

13. Consider a random matrix Z: n x p with a density given byf(ZIB, 2)

- :1-"n2 h(tr(Z - TB) 2-(Z - TB)') where T: n x k of rank k is

known, B: k x p is a matrix of unknown parameters, and 2: p x p is

positive definite and unknown. Assume that n > p + k, that

sup ICIn/2h (tr(C)) < + x, c E Sp

and that h is a nonincreasing function defined on [0, oo). Partition Z

into X: n X q and Y: n X r, q + r = p, so Z = (XY). Also, partition

2 into .ij, i, j= 1,2, where 27 is q x q, 222 is r x r, and 212 iS

q x r.

(i) Show that the maximum likelihood estimator of B is B= (T'T) TZ andf(ZIB, 2) = IV- /2h(trS1) where S = Z'QZ

with Q = I - P and P = T(T'T)-'T'.

(ii) Derive the likelihood ratio test of Ho: 212 = 0 versus H,: 212 * 0. Show that the test rejects for small values of

A(Z) =1iUS22i

(iii) For U: n x q and V: n X r, establish the identity

tr(UV)2I(UV)' = tr(V - U2l'212):22'(V - U2j'212) + tr U2jTl'U'. Use this identity to derive the conditional distribu tion of Y given X in the above model. Using the notation of

Section 10.5, show that the conditional density of Y given X is

fl ( YC, B1, 119 E222.19 X)

- I222.1I-n/2h(tr(Y - WC)2 I.1(Y - WC)' + B)>(q)

where n = tr(X - TB1)2Y,1(X - TB,) and (+(X))' =

Jer h(tr uu' + ij) du.

(iv) The null hypothesis is now that Cl = 0. Show that, for each fixed

'q, the likelihood ratio test (with C and 222.1 as parameters) based on the conditional density rejects for large values of A(Z). Verify (i), (ii), and (iii) of Proposition 10.17.

(v) Now, assume that

sup sup IC I/2h(trC+ N(n)=k2< +00. 'i>0 C e

Show that the likelihood ratio test for C, = 0 (with C, 22 1' B1, and I as parameters) rejects for large values of A(Z).

(vi) Show that, under Ho, the sample canonical correlations based on

Sll, S12, S22 (here S = Z'QZ) have the same distribution as when Z is N(TB, I,, ? E). Conclude that under Ho, A(Z) has

the same distribution as when Z is N(TB, In ? E).


1. Canonical correlation analysis was first proposed in Hotelling (1935,

1936). There are as many approaches to canonical correlation analysis as there are books covering the subject. For a sample of these, see

Anderson (1958), Dempster (1969), Kshirsagar (1972), Rao (1973), Mardia, Kent, and Bibby (1979), Srivastava and Khatri (1979), and

Muirhead (1982).

2. See Eaton and Kariya (1981) for some material related to Proposition 10.13.

Appendix

We begin this appendix with a statement and proof of a result due to Basu (1955). Consider a measurable space (Q,.'3) and a probability model

(P0I E10 0) defined on (6X, 'Mi). Consider a statistic T defined on (9, fi3) to

(J' ,), and let =i =(T-'(B)IB e f. Thus fT is a a-algebra and

T C i. Conditional expectation given fT is denoted by S( *I?T).

Definition A.1. The statistic T is a sufficient statistic for the family {PO0I E

9) if, for each bounded @ measurable function f, there exists a T measurable function f such that &e(f IfT) = f a.e. Po for all 0 E 9.

Note that the null set where the above equality does not hold is allowed to depend on both 0 and f. The usual intuitive description of sufficiency is that the conditional distribution of X e 9 (1(X) = Po for some 0 E 9) given T(X) = t does not depend on 0. Indeed, if P(- It) is such a version of

the conditional distribution of X given T(X) = t, then f defined by f(x) = h(T(x)) where

h(t) = f(x)P(dxlt)

serves as a version of &0(f IfT) for each 0 E 9.

Now, consider a statistic U defined on (9X, 33) to (F, ffi2)*

Definition A.2. The statistic U is called an ancillary statistic for the family (P910 E 9) if the distribution of U on (5, f2) does not depend on 0 E 9- that is, if for all B e 932

PO( U- '(B)}-=PQ(U- 1(B))

for all O,'q e 9.

465



466%p APPENDIX

In many instances, ancillary statistics are functions of maximal invariant statistics in a situation where a group acts transitively on the family of probabilities in question-see Section 7.5 for a discussion.

Finally, given a statistic T on (9%, 63,) to (6J, ?13,) and the parametric

family {(PI E6 9), let (Q610 E 9) be the induced family of distributions of T on (6@, @ l )-that is,

Qo(B) = Pe(T-I(B)), B E

Definition A.3. The family {QeIO E= 9) is called boundedly complete if the only bounded solution to the equation

Jh(y)Qe(dy) = 0, 6 e 9

is the function h = 0 a.e. Qe for all 0 E 9.

At times, a statistic T is called boundedly complete- this means that the induced family of distributions of T is boundedly complete according to the above definition. If the family {QeIO e 0) is an exponential family on a

Euclidean space and if 9 contains a nonempty open set, then {QlIO e 9) is

boundedly complete-see Lehmann (1959, page 132).

Theorem (Basu, 1955). If T is a boundedly complete sufficient statistic and if U is an ancillary statistic, then, for each 0, T(X) and U(X) are

independent.

Proof. It suffices to show that, for bounded measurable functions h and k on 6J and Z, we have

(A.1) &eh(T(X))k(U(X)) = &0h(T(X))6;ek(U(X)).

Since U is ancillary, a = &6k(U(X)) does not depend on 0, so 6,(k(U) -

a) = 0 for all 0. Hence

So [&e((k(U) -

a)I%T)] = 0 for all 0.

Since T is sufficient, there is a 1T measurable function, say f, such that

S )- a)I@T) = f a.e. Po. But since f is T measurable, we can write

f(x) = 4(T(x)) (see Lehmann, 1959, Lemma 1, page 37). Also, since k is



APPENDIX 467

bounded, f can be taken to be bounded. Hence

go+(T) = O forallO

and 4 is bounded. The bounded completeness of T implies that 4 is 0 a.e. Q6, where Qo(B) = Po(T-I(B)), B e '@i? . Thus h(T)4'(T) = 0 a.e. QO, so

O = &eh(T)iP(T) = &0[h(T)6e((k(U) - a)l.T)]

= go[&6(h(T)(k(U) - a)I3T)I

= &eh(T)(k(U) - a).

Thus (A.1) holds.

This Theorem can be used in many of the examples in the text where we have used Proposition 7.19.

The second topic in this Appendix concerns monotone likelihood ratio and its implications. Let 9X and '4 be subsets of the real line.

Definition A.4. A nonnegative function k defined on 3 X ?J is totally positive of order 2 (TP-2) if, for xl < x2 and yi < Y2, we have

(A.2) k(x,, yl)k(x2, Y2) > k(x,, y2)k(x2, Y1)

In the case that 'Th is a parameter space and k(-, y) is a density with respect

to some fixed measure, it is customary to say that k has a monotone

likelihood ratio when k is TP-2. This nomenclature arises from the observa tion that, when k is TP-2 and yi < Y2, then the ratio k(*, y2)/k(-, Y,) is

nondecreasing in x-assuming that k(., y,) does not vanish. Some obvious examples of TP-2 functions are: exp[xy], xy for x > 0, yX for y > 0. If

x = g(s) and y = h(t) where g and h are both increasing or decreasing,

then k,(s, t) k(g(s), h(t)) is TP-2 whenever k is TP-2. Further, if 4'(x) > 0, 42(Y) > 0, and k is TP-2, then k,(x, y) = 41(x)42(y)k(x, y) is also

TP-2. The following result due to Karlin (1956) is of use in verifying that some

of the more complicated densities that arise in statistics are TP-2. Here is the setting. Let 9, @, and 5 be Borel subsets of R' and let ,u be a a-finite

measure on the Borel subsets of 'J.

468 APPENDIX

Lemma (Karlin, 1956). Suppose g is TP-2 on QC x L54 and h is TP-2 on

9 X q. If

k(x, z) = fg(x, y)h(y, z)M(dy)

is finite for all x E % and z E Z, then k is TP-2.

Proof. For xi < x2 and z1 < Z2, the difference

A = k(x,, z,)k(x2, Z2) - k(x1, z2)k(x2, zi)

can be written

A = ffg(xI, y1)g(x2, y2)[h(yl, zl)h(y2, z2)- h(yl, Z2)h(y2, z,)]

x 11(dyl) IL(dY2)

Now, write A as the double integral over the set {y, < y2) plus the double

integral over the set {Yi > Y2). In the integral over the set {yi > Y2}, interchange y, and Y2 and then combine with the integral over (y, < Y2).

This yields

,A=ffJ [g(x,, y1)g(x2, Y2)-g(x1, y2)g(x2, Y1)] (Yi <Y2)

[h(yl, z,)h(y2, Z2)- h(yl, Z2)(h(y2, z,)]p(dy,)p(dy2).

On the set (y, < Y2), both of the bracketed expressions are non-negative as

g and h are TP-2. Hence A > O so k is TP-2. El

Here are some examples.

* Example A.1. With X = (0, oo), let

x(m/2)-l exp[-x]

f(x,m) = 2m/2Fr(m/2)

be the density of a chi-squared distribution with m degrees of

freedom. Since xm/2, x E 9 and m > 0, is TP-2, f(x, m) is TP-2.

Recall that the density of a noncentral chi-squared distribution with

PROPOSITION A. 1 469

p degrees of freedom and noncentrality parameter X > 0 is given by

h(x,X) = E ' / 21] f(x, p + 2j). j=o J

Observe that f(x, p + 2j) is TP-2 in x and ] and (X/2)*exp[- 'XI/j! is TP-2 in j and X. With G? = (0, 1,. .. } and t& as counting measure, Karlin's Lemma implies that h(x, X) is TP-2. *

* Example A.2. Recall that, if x2 and x2 are independent random variables, then Y = x2 /X2 has a density given by

f(yjm n) r((m + n)/2) y (m/2)l , ' r(m/2) r(n/2) (I + y)(m+n)12

If the random variable x2 is noncentral chi-squared, say XP(X), rather than central chi-squared, then Y= X2(X)/X2 has a density

00 (X/2)Vepxn Xl

h(ylX)= F, (A /fL2Jf (ylp + 2j, n).

Of course, Y has an unnormalized F(p, m; X) distribution accord ing to our usage in Section 8.4. Since f(yIp + 2j, n) is TP-2 in y and j, it follows as in Example A. l that h is TP-2.

The next result yields the useful fact that the noncentral Student's t distribution is TP-2.

Proposition A.1. Suppose f > 0 defined on (0, oo) satisfies

(i) fo'e"xf(x) dx < +oo for u E R'

(ii) f(x/q) is TP-2 on (0, oo) X (0, ox).

For 0 E R1 and t E R', define k by k(t, 0) = J& eotxf(x) dx. Then k is

TP-2.

Proof First consider t E R' and 0 > 0. Set v = Ox in the integral defining

k to obtain

k(t, 0) e t"dv.

470 APPENDIX

Now apply Karlin's Lemma to conclude that k is TP-2 on R' X (0, oo). A similar argument shows that k is TP-2 on R' X (- oo, 0). Since k(t, 0) is a

constant, it is now easy to show that k is TP-2 on R' x R'.

* Example A.3. Suppose X is N(,u, 1) and Y is x2. The random variable T = X/ VY, which is, up to a factor of Vn , a noncentral

Student's t random variable, has a density that depends on ,u-the noncentrality parameter. The density of T, (derived by writing down the joint density of X and Y, changing variables to T and

W = VY, and integrating out W) can be written

2ep[- i/L F(n/2)(l + t2 )(n

1)/2

x f exp[i(t)ttx] exp[ -x2]x- dx

where

ip(t) = Jt(l + t2) 1/2

is an increasing function of t. Consider the function

00 k(v, It) = jexp[ v,ux ]exp[_x2]xn- dx.

With f(x) = exp_-x2]x-, Proposition A.l shows k, and hence h, is TP-2.

We conclude this appendix with a brief description of the role of TP-2 in one sided testing problems. Consider a TP-2 density p(xI0) for x E c

and 0 E 0) c R'. Suppose we want to test the null hypothesis Ho: 0 c

(-oo, 00] nl 0 versus H1: 0 e (00, oo) rf 0. The following basic result is

due to Karlin and Rubin (1956).

Proposition A.2. Given any test 40 of Ho versus HI, there exists a test 4

of the form

I if x > x0

+(x)-(y if x =xO

O if x <xO

with O < y < 1 such that 6.0 -< S&oo for 0 < 00 and &e4 >- &e40p for 9 > 00. For any such test 4, &epO is nondecreasing in 0.

Comments on Selected Problems

CHAPTER 1

4. This problem gives the direct sum version of partitioned matrices. For (ii), identify V1 with vectors of the form (v , 0) E VI ED V2 and restrict

T to these. This restriction is a map from V1 to V, @3 V2 so T(v , 0) =

Iz(v,), z2(v)) where zl(vl) E V1 and z2(v,) E V2. Show that z1 is a linear transformation on V1 to V, and Z2 is a linear transformation on

V, to V2. This gives A Il and A21 . A similar argument gives A,2 and A22.

Part (iii) is a routine computation.

5. If xr+1 = Ercixi, then wr+ I = ErCiWi.

8. If u E Rk has coordinates U,,..., Uk, then Au = E'uixi and all such

vectors are just span (xl,..., Xk}. For (ii), r(A) = r(A') so dim IR (A'A) = dim % (AA').

10. The algorithm of projecting x2,..., Xk onto (span x,)' is known as Bjork's algorithm (Bjork, 1967) and is an alternative method of doing Gram-Schmidt. Once you see that Y2,. Yk are perpendicular to Yi, this problem is not hard.

11. The assumptions and linearity imply that [Ax, w] = [Bx, w] for all

x e V and w E W. Thus [(A - B)x, w] = O for all w. Choose w= (A - B)x so (A - B)x = 0.

12. Choose z such that [y , z] * 0. Then [Y1, z]x1 = [Y2, z]x2 so set

c = [Y2, Z]/[YI, z]. Thus cx2 EoyY

= x2 0 Y2 so cyI O x2 = Y2 ? x2. Hence ClIX21I2yI = I1x2112Y2 so YI = c- 'Y2

13. This problem shows the topologies generated by inner products are all the same. We know [x, y] = (x, Ay) for some A > 0. Let c, be the

minimum eigenvalue of A, and let c2 be the maximum eigenvalue of A.

471



472 COMMENTS ON SELECTED PROBLEMS

14. This is just the Cauchy-Schwarz Inequality.

15. The classical two-way ANOVA table is a consequence of this problem. That A, B1, B2, and B3 are orthogonal projections is a routine but useful calculation. Just keep the notation straight and verify that p2 = p = p', which characterizes orthogonal projections.

16. To show that F(M') c M', verify that (u, Fv) = 0 for all u E M

when v E M'. Use the fact that FrT = I and u = Pu1 for some

U e M (since F(M) C M and F is nonsingular).

17. Use Cauchy-Schwarz and the fact that PMX = x for x E M.

18. This is Cauchy-Schwarz for the non-negative definite bilinear form [C, D] = trACBD'.

20. Use Proposition 1.36 and the assumption that A is real.

21. The representation aP + ,B(I - P) is a spectral type repre sentation-see Theorem 1.2a. If M = <R(P), let x1, ..., xr, Xr+ I, ....

xn be any orthonormal basis such that M = span{x,,..., Xr). Then

Axi = axi, i = 1,..., r, and Axi = fxi, i = r + 1,..., n. The char

acteristic polynomial of A must be (a - X)r(/3 - X),-r.

22. Since X, = suplX11l=1(x, Ax), ,i, = sup11x11==(x, Bx), and (x, Ax) >

(x, Bx), obviously XI >? ,. Now, argue by contradiction-letj be the

smallest index such that Xi

< ,u;. Consider eigenvectors x,,..., xn and

yl, yn with Axi = Aixi and Byi =

= iyi, i= l,...,n. Let M =

span{xj, xj+1,. . ., xn) and let N = span(y,..., yj). Since dim

M = n - j + 1, dim M n N > 1. Using the identities Aj =

suPX(M, IXII=I(x, Ax), l&j = inf x Nlxll=I(x, Bx), for any x E M n N,

xll = 1, we have (x, Ax) < AX < yj < (x, Bx), which is a contradic

tion. 23. Write S = EnX x 1f x in spectral form where Xi > 0, i = 1,..., n.

Then 0 = (S, T) = En Ai(xi, Txi), which implies (xi, Txi) = 0 for i =

1,... , n as T > 0. This implies T = 0.

24. Since tr A and (A, I) are both linear in A, it suffices to show equality

for A's of the form A = xOJ y. But (xO y, I) = (x, y). However, that

tr x El y = (x, y) is easily verified by choosing a coordinate system.

25. Parts (i) and (ii) are easy but (iii) is not. It is false that A2 > 12 and a

2 x 2 matrix counter example is not hard to construct. It is true that

Al/2 > B1/2. To see this, let C = B'12A-1/2, SO by hypothesis, I > C'C.

Note that the eigenvalues of C are real and positive-being the same

as those of B'14A- 1/2B1/4 which is positive definite. If A is any

eigenvalue for C, there is a corresponding eigenvector-say x such that

xll = 1 and Cx = Xx. The relation I > C'C implies A2 < 1, so 0 < A

< 1 as A is positive. Thus all the eigenvalues of C are in (0,1] so

CHAPTER 1 473

the same is true of A'14B'12A'14. Hence A'14B12A"'4 < I so B1/2 1/ A12.

26. Since P is an orthogonal projection, all its eigenvalues are zero or one and the multiplicity of one is the rank of P. But tr P is just the sum of

the eigenvalues of P.

28. Since any A c f&(V, V) can be written as (A + A')/2 + (A - A')/2, it follows that M + N = (V, V). If A e M n N, then A = A' = -A,

so A = 0. Thus E(V, V) is the direct sum of M and N so dim M +

dim N = n2. A direct calculation shows that (xE1O xj + xj1 xiIi )< j U (xi L xj

- xj JxiIi <j) is an orthogonal set of vectors, none of

which is zero, and hence the set is linearly independent. Since the set has n2 elements, it forms a basis for lE(V, V). Because x,O Xi

+ xj E Xi E M and xiO xj-x1 Uxi E N, dim M > n(n + 1)/2 and dim N >

n(n - 1)/2. Assertions (i), (ii), and (iii) now follow. For (iv), just verify that the map A -* (A + A')/2 is idempotent and self-adjoint.

29. Part (i) is a consequence of suplvI-1llAvll = sup1lvil=,[Av, Av]'/2 =

sup11v11.= (v, A'Av)"/2 and the spectral theorem. The triangle inequality follows from 111A + Bill =

supliv,1=1 IhAv + Bvll < sup11v11=1(llAvll +

IlBvll) < supllvll=I IlAvil + sup,,,,ll, v IBvll.

30. This problem is easy, but it is worth some careful thought-it provides more evidence that A ? B has been defined properly and ( , -

) is an

appropriate inner produce on f (W, V). Assertion (i) is easy since (A ? B)(xi El wj) = (Axi) 0 (Bwj) = (Xixi) O (1jwj) = Xijxi EO wj.

Obviously, xi E wj is an eigenvector of the eigenvalue X juj. Part (ii)

follows since the two linear transformations agree on the basis {xio wjli = 1,..., m,j= 1,..., n) for fE(W, V). For (iii), if the eigen

values of A and B are positive, so are the eigenvalues of A ? B. Since

the trace of a self-adjoint linear transformation in the sum of the

eigenvalues (this is true even without self-adjointness, but the proof requires a bit more than we have established here), we have tr A ? B = Ei, Xji, = (EiXi)(Xj1ij) = (tr A)(tr B). Since the determinant

is the product of the eigenvalues, det(A 0 B) = 1i, j(Xjij) =

(HAi)n(lt,uj)m = (det A)'(det B)m. 31. Since 4'4 = Ip, 4 is a linearly isometry and its columns form an

orthonormal set. Since R(4) c M and the two subspaces have the same dimension, (i) follows. (ii) is immediate.

32. If C is n x k and D is k X n, the set of nonzero eigenvalues of CD is

the same as the set of nonzero eigenvalues of DC.

33. Apply Problem 32.

34. Orthogonal transformations preserve angles.

35. This problem requires that you have a facility in dealing with condi tional expectation. If you do, the problem requires a bit of calculation

but not much more. If you don't, proceed to Chapter 2.

CHAPTER 2

1. Write x = Y2c x1 so (x, X) = Ec1(xi, X). Thus &j(x, X)l <

I1 Ici1&K(xi, X)l and &;t(x , X)I is finite by assumption. To show that

Cov(X) exists, it suffices to verify that var(x, X) exists for each x E V. But var(x, X) = var{Eci(xi, X)} = E cov{c,(xj, X),

cj(xj, X)). Then var{ci(xi, X)) = -[Ci(XiX)j2 [6ci(xi, X)12, which exists by assumption. The Cauchy-Schwarz Inequality shows that [cov{ci(xi, X), cj(xj, X)}]2 < var{ci(xi, X)) var{cj(xj, X)). But, var{ci(xi, X)) exists by the above argument.

2. All inner products on a finite dimensional vector space are related via

the positive definite quadratic forms. An easy calculation yields the result of this problem.

3. Let ( *, - )i be an inner product on Vi, i = 1, 2. Since fi is linear on Vi,

f (x) = (xi, x)i for xi E Vi, i = 1,2. Thus if XI and X2 are uncorre

lated (the choice of inner product is irrelevant by Problem 2), (2.2) holds. Conversely, if (2.2) holds, then Cov((x,, X0)1,(x2, X2)2)= 0 forx i EVi, i = 1,2 since (x1, ),I and (X2, )2 are linear functions.

4. Let s n - r and consider r- E ?r and a Borel set B, of Rr. Then

Pr{PX e B) = Pr(rX E B1, XE Rs}

=Pr(( O .S

)(X c xR

= Pr{(X) E B1 X Rs} = Pr{ X E B1).

The third equality holds since the matrix

tr oi

is in OnQ. Thus X has an ?r-invariant distribution. That X given X has an

Or-invariant distribution is easy to prove when X has a density with

respect to Lebesgue measure on Rn (the density has a version that

CHAPTER 2 475

satisfies f(x) = f(4x) for x E R', / E (9n,). The general case requires

some fiddling with conditional expections- this is left to the interested reader.

5. Let Ai = Cov(X,), i = 1,..., n. It suffices to show that var(x, 2X,) = I(x, Aix). But (x, Xi), i = 1,..., n, are uncorrelated, so var[l(x, Xi)I = I var(x, Xi) = 1:(x, Aix).

6. &U = 2pi,ei = p. Let U have coordinates U1,..., Uk. Then Cov(U) =

6UU' - pp' and UU' is a p x p matrix with elements UiUj. For i * j,

U,Uj = O and for i = j, UiUj =

Ui. Since &Ui = pi, I;UU' = Dp. When

0 < pi < 1, Dp has rank k and the rank of Cov(U) is the rank of

Ik - Dp1 /2pp'Dp1 I/2. Let u = Dp- 112p, so u E Rk has length one. Thus

Ik- uu' is a rank k - 1 orthogonal projection. The null space of

Cov U is span{e) where e is the vector of ones in Rk. The rest is easy.

7. The random variable X takes on n! values-namely the n! permuta

tions of x-each with probability l/n!. A direct calculation gives SX= xe where x = n-'E'xi. The distribution of X is permutation

invariant, which implies that Cov X has the form a2A where a I

and aiJ = p for i *j where - l/(n - 1) < p < 1. Since var(e'X) = 0,

we see that p = -l/(n - 1). Thus a 2 = var(X,) = n -[Ll(xi - x)2]

where X, is the first coordinate of X.

8. Setting D = -I, &X =-;X so X= 0. For i *j, cov{Xi, Xj)= cov(- Xi, Xj) = - cov(X>, Xj) so Xi and Xj are uncorrelated. The first

equality is obtained by choosing D with dii = -1 and djj = 1 in the

relation P,(X) = e,(DX).

9. This is a direct calculation. 10. It suffices to verify the equality for A = x O y as both sides of the

equality are linear in A. For A = x O y, (A, E) = (x, Yy) and (iu, Au) = (,u, x)(,, y), so the equality is obvious.

11. To say Cov(X) = In ? : is to say that cov((tr AX'), (tr BX')) =

trAM2B'. To show rows 1 and 2 are uncorrelated, pick A = e1v' and B = e2u' where u, v E RP. Let X1 and X2 be the first two rows of X. Then trAX' = v'X,, trBX' = u'X2, and trA2 B = 0. The desired

equality is established by first showing that it is valid for A xy', x, y E R', and using linearity. When A = xy', a useful equality is X'AX = EiExiyjXiXj' where the rows of X are X,,..., X".

12. The equation rArl = A for r1 e9p implies that A = cIp for some c.

13. Cov((r ? I)X) = Cov(X) implies Cov(X) - I e : for some E. Cov((I X +)X) = Cov(X) then implies 4i:4' = , which necessitates I = cI for some c > 0. Part (ii) is immediate since r ? o is an

orthogonal transformation on (E (V, W), ( . , * )).

14. This problem is a nasty calculation intended to inspire an appreciation for the equation Cov(X) = In X E.

15. Since E (X) = E (-X), &X = 0. Also, E(X) = E4J'X) implies Cov(X)

= cI for some c > 0. But 11XI12 = 1 implies c = l/n. Best affine

predictor of Xl given X is 0. I would predict XI by saying that X1 is

/1 - X'X with probability 2 and Xl is - V1 - X'X with probability 2.

16. This is just the definition of 0.

17. For (i), just calculate. For (ii), Cov(S) = 2I2 X I2 by Proposition 2.23. The coordinate inner product on R3 is not the inner product ( , *) on

52.

CHAPTER 3

2. Since var(X,) = var(Y1) = 1 and cov{X1, Y1) = p, IPI < 1. Form Z =

(XY)-an n x 2 matrix. Then Cov(Z) = In 0 A where

A(I P)

When IPI < 1, A is positive definite, so I,, 0 A is positive definite.

Conditioning on Y, E (XIY) = N(pY, ( - p2)I,,), so C(Q(Y)XIY) =

N(O,(I - p2)Q(Y)) as Q(Y)Y = 0 and Q(Y) is an orthogonal projec

tion. Now, apply Proposition 3.8 for Y fixed to get E(W) = (1 -

2)X

3. Just do the calculations. 4. Since p(x) is zero in the second and fourth quadrants, X cannot be

normal. Just find the marginal density of Xl to show that Xl is normal.

5. Write U in the form X'AX where A is symmetric. Then apply Proposi

tions 3.8 and 3.11.

6. Note that Cov(XO X) = 2I 0 I by Proposition 2.23. Since (X, AX) = (Xo X, A), and similarly for (X, BX), 0 = cov{(X, AX),

(X,BX)) = cov((XO X, A), (XO X, B)} = (A,2(I 0 I)B) = 2tr AB. Thus 0 = tr A'72BA'12 so A1/2BA1/2 = 0, which shows A1/2B1/2 = 0

and hence AB = 0.

7. Since &I;exp(itW)] = exp{ituj - ajitI], &[exp(itla^W))] = exp[itlaj1uj - (Y.IajIaI)IttI, so E(YaWj) = C(Qajuj, 2:IajIa). Part (ii) is immediate

from (i). 8. For (i), use the independence of R and ZO to compute as follows:

P{U < u} = P{Z0 < u/R) = Jf0P{Z0 < u/t)G(dt) = fro '(u/t)

G(dt) where 4D is the distribution function of ZO. Now, differentiate.

Part (ii) is clear.

CHAPTER 4 477

9. Let 633, be the sub a-algebra induced by T,(X) = X2 and let '2 be the

sub a-algebra induced by T2(X) = X2X2. Since 'i"2 C ,1 for any bounded function f(X), we have &(f(X)IV2) = 0; MX)1'3k) But for f( X) = h( X2 X,), the conditional expectation given 1' J can be computed via the conditional distribution of X2 X, given X2, which is

(3.3) e(X2'XIX2) = N(X2X2'221Y21, X2X2 ? 1112)

Hence 6(h(X2X,)'3?i1) is 63?2 measurable, so &(h(X2X,)t@2) = & (h(X2AX,) I)3,). This implies that the conditional distribution (3.3) serves as a version of the conditional distribution of X2 X, given X2 X2.

10. Show that T-'T,: Rn -4 Rn is an orthogonal transformation so l(C) = l((T-T,)(C)). Setting B = T,(C), we have Po(B) = vI(B) for Borel B.

11. The measures Po and v, are equal up to a constant so all that needs to be calculated is vo(C)/vl(C) for some set C with 0 < vI(C) < + oo. Do the calculation for C = {vi[v, v] < 1).

12. The inner product < on p is not the coordinate inner product. The "Lebesgue measure" on (SP, ( ,*) given by our construction is not l(dS) = HI<jdsjj, but is vO(dS) (2)P(P-')1(dS).

13. Any matrix M of the form

I b ...

b

M=i . * :p Xp : . ~~b

b ... b 11

can be written as M = a[(p - l)b + I]A + a(l - b)(I - A). This is

a spectral decomposition for M so M has eigenvalues a((p - l)b + 1) and a(l - b) (of multiplicityp - 1). Setting a = a[(p - l)b + 1] and /3 = a( -b) solves (i). Clearly, M- l = a- 'A + f3- 1 (I - A) whenever

a and /B are not zero. To do part (ii), use the parameterization (IL, a, /3)

given above (a = a2 and b = p). Then use the factorization criterion on the likelihood function.

CHAPTER 4

1. Part (i) is clear since Z/3 = - 3,Bizi for ,B E Rk. For (ii), use the singular

value decomposition to write Z = Ej1xju' where r is the rank of Z, {x,,.. ., xr) is an orthonormal set in R', (u,,..., Ur) is an orthonormal

set in Rk, M = span(x1,..., xr), and 6L(Z) = (span(u1,..., ur))l.

Thus (Z'Z 'I= iA 2uiu and a direct calculation shows that Z(Z'Z)-Z' rx xi, which is the orthogonal projection onto M.

2. Since C(Xi)= C( + Ei) where &tEi = 0 and var(ei) = 1, it follows

that e(X) = E(,6e + E) where &e = 0 and Cov(e) = I," A direct ap

plication of least-squares yields ,B = X for this linear model. For (iii), since the same ,8 is added to each coordinate of E, the vector of ordered

X's has the same distribution as the fBe + v where v is the vector of

ordered e's. Thus e (U) = E (,/e + i) so =U = ,Be + a0 and Cov(U) =

Cov(v) = E. Hence E(U - ao) = Cf(Pe + (p - ao)). Based on this

model, the Gauss-Markov estimator for ,B is /3 = (e'E 'e)- 'e'Y. '(U - ao). Since X = (l/n)e'(U - ao) (show e'ao 0 using the symme

try of f), it follows from the Gauss-Markov Theorem that var(/3) <

var(/3).

3. That M - X = M n w' is clear since w c M. The condition (PM -

P.)' = PM - P,, follows from observing that PMUP., = P,.PM =

P,. Thnus PM - P,,, is an orthogonal projection onto its range. That f(PM -P,,) = M - o is easily verified by writing x e V as x = xl + x2 +

X3 where x, E w, x2 E M - X, and x3 E M1. Then (PM - P,,)(x, +

X2 + X3)

= xi + X2

- XI = X2. Writing PM = PM - P, +

P. and not

ing that (PM - PW,)P. = 0 yields the final identity.

4. That 6A (A) = MO is clear. To show 'R(B,) = M, - MO, first consider

the transformation C defined by (Cy)i1 - ji., i = 1,..., I,j = 1,..., J. Then C2 = C = C', and clearly, 6@(C) C M,. But if y E M,, then

Cy = y so C is the orthogonal projection onto M,. From Problem 3

(with M = M, and X = Mo), we see that C - AO is the orthogonal

projection onto M, - Mo. But ((C - AO)y)ij = y.- y.., which is just

(B,y)i. Thus B, = C - AO so 6A(B,) = M, - Mo. A similar argu

ment shows IR(B2) = M2- M. For (ii), use the fact that AO + B, +

B2 + B3 is the identity and the four orthogonal projections are per

pendicular to each other. For (iii), first observe that M = M, + M2

and M, nl M2 = Mo. If It has the assumed representation, let v be the

vector with vij = a + P3i and let ( be the vector with =ij =

yj. Then

v E M, and t E M2 so ,u = v + E E M, + M2.. Conversely, suppose

A E Mo C (MI -

Mo) @ (M2 -

Mo)-say IL = + v + t. Since 8 E

09, 8i =6. for all i, j, so set a =..Since P E M, - Mo, vij - Pik = O

for all j, k for each fixed i and v..= 0. Take j = 1 and set Pi =Pi. Then vij /i for j = 1,..., J and, since it.= 0, Ifi = 0. Similarly,

setting yj= (lj, tij = yj for all i, j and since t..= 0, -yj = 0. Thus

ui = a + + -y.

where MP,8 = 2 yj

= ?

5. With n = dim V, the density of Y is (up to constants) f(YI,U, a2) = a-nexp[ - (l/2a2)Iy - AL112]. Using the results and notation Problem

CHAPTER 4 479

3, write V- = =ED(M- t)@EDM' so (M- t)@M'= w'. Under

Ho, ,u e w so 10 = P,,,y is the maximum likelihood estimator of u and

(4.4) = Inexp[- 2lIIQwy112]

where QR, = I - P,,. Maxiing (4.4) over a2 yields 6,? = n- 'IIQ',yII2. A similar analysis under H1 shows that the maximum likelihood

estimator of ,u is P = PMY and a6 = n- 1IQMYII2 is the maximum likelihood estimator of a 2. Thus the likelihood ratio test rejects for small values of the ratio

f (yi 0 'i2) A6n IIQMYII2 1n/2 A ' IIIyI2 J (y}se AA2) A-,n

IIQwyll2

But Q. = QM + PM_@ and QMPM_w = 0, SO IIQ.yII2 = IIQMYII2 + 1PM-_.,YI2. But rejecting for small values of A(y) is equivalent to

rejecting for large values of (A(y))-2"n - 1 = IIPM_,YII2/IIQMYlI2. Under Ho, IL E o so C(PM,Y)= N(0,a2PMM_w) and C(QMY)= N(O, G2QM). Since QMPM-.,, = 0, QMY and PM-,Y are independent and 4(lIPM-_wlYl) = X2r where r = dim M - dim w. Also,

et(1QM,y12)= O2X2k where k = dim M.

6. We use the notation of Problems 4 and 5. In the parameterization

described in (iii) of Problem 4, ,BI = P2 = * = PI iff ,u Ee M2. Thus w = M2 so M - w = M, - MO. Since M' is the range of B3 (Problem

1.15), 11B3y112 = IIQMyII2, and it is clear that I1B3Y112 = Y(y11 - 2EE y1+ y.)2. Also, since M-w = M1 -Mo, PM-@ = PM -PMO and

IIpM-_YII = IIpM,YII - II 1pMOy112 - =i JEj(Y - y )2

7. Since 6A(X')= 91(X'X) and X'y is in the range of X', there exists a

b e RK' such that X'Xb = X'y. Now, suppose that b is any solution.

First note that PmX = X since each column of X is in M. Since

X'Xb = X'y, we have X'[Xb - PMY] = X'Xb - XPMY = X'Xb -

(PMX)'y = X'Xb - X'y = 0. Thus the vector v = Xb - PMY is per

pendicular to each column of X (X'v = 0) so v e M' . But Xb E M,

and obviously, PmyY E M, so v E M. Hence v = 0, so Xb = PMY.

8. Since I E y, Gauss-Markov and least-squares agree iff

(4.5) (aPe+fiQe)MCM, for all a, 1 > O.

But (4.5) is equivalent to the two conditions P M C M and QeM c M.




But if e E M, then M = span{e) @ M, where M, c (span(e))'. Thus PeM = span(e) 5 M arnd QeM = M1 c M, so Gauss-Markoy equals least-squares. If e E MI, then M c (span e)I, so PeM= (0) and

QeM = M, so again Gauss-Markov equals least-squares. For (ii), if

e ? M' and e ? M, then one of the two conditions P M C M or

QeM C M is violated, so least-squares and Gauss-Markov cannot

agree for all a and /3. For (ii), since M C (span(e))' and M *

(span(e))', we can write Rn = span{e) ED M @ M, where M,

(span{e))' -M and M, * (0). Let P, be the orthogonal projection

onto M,. Then the exponent in the density for Y is (ignoring the

factor - 2) (y - p)' (a 'Pe + 'Qe) (Y - A) (PeY + PIY +

PM(Y -))Y(a'Pe + A-'Qe)(PeY + PIY + PM(Y - = a'y'Pey +#-ly'Ply + /-3(y - ,u)PM(y - tL) where we have used the fact

that Qe = P1 + PM and PIPM= 0. Since det(aPe + N3Qe)= af3t ,

the usual arguments yields a y = MY' a = Y'PeY, and / = (n - l) -ly'Ply as maximum likelihood estimators. When M= span{e),

then the maximum likelihood estimators for (a, ,u) do not exist-other

than the solution a = Pey and & = 0 (which is outside the parameter

space). The whole point is that when e E M, you must have replica

tions to estimate a when the covariance structure is aPe + /Qe.

9. Define the inner product (, ) on Rn by (x, y) = x'Ej ly. In the inner

product space (R',(, *)), Y = X,B and Cov(Y) = a2I. The transfor

mation P defined by the matrix X( X'E 'X) X'2'l satisfies p2 = p

and is self-adjoint in (Rn,(, )). Thus P is an orthogonal projection

onto its range, which is easily shown to be the column space of X. The

Gauss-Markov Theorem implies that i = PY as claimed. Since u =

X,B, X',L = X'X/3 so /3 = (X'X)- 'X',4. Hence /3 = (X'X)- 'X'A, which

is just the expression given.

10. For (i), each F e C(V) is nonsingular so F(M) C M is equivalent to

r(M)= M-hence F`(M) = M and F-' = F,. Parts (ii) and (iii)

are easy. To verify (iv), to(cFY + xo) = PM(cFY + xo) = cPMFY +

xo = cFPMY + xo = crto(Y) + xo. The identity Pmr = FPm for F E

eM(V) was used to obtain the third equality. For (v), first set F = I

and xo = - PMY to obtain

(4.6) t(y) = t(QMy) + PMY

Then to calculate t, we need only know t for vectors u E M' as

QMY e M1. Fix u E M1 and let z = t(u) so z E M by assumption. Then there exists a r e ?M(V) such that Fu = u and Fz = -z. For

this r, we have z = t(u) = t(Fu) = Ft(u) = Fz = -z so z = 0. Hence

t(u) = 0 for all u e M' and the result follows.



CHAPTER 5 481

11. Part (i) follows by showing directly that the regression subspace M is invariant under each I,, @ A. For (ii), an element of M has the form

U Z= (Zl3, Z232) E= E2, n for some #I E Rk and /#2 E Rk. To obtain an

example where M is not invariant under all In ? 2, take k = 1,

Zi = e,, and Z2 = ?2 SO IU iS

#I 0 0 #2

O O

That the set of such ,u's is not invariant under all I,, X is easily

verified. When Z1 = Z2 then ,u = ZIB where B is k x 2 with ith

column gli, i = 1, 2. Thus Example 4.4 applies. For (iii), first observe

that Z1 and Z2 have the same column space (when they are of full

rank) iff Z2 = Z1C where C is k x k and nonsingular. Now, apply part

(ii) with f2 replaced by C/2, so M is the set of ,i's of the form j = Z,B

where B E 2, k

CHAPTER 5

1. Let a1,..., ap be the columns of A and apply Gram-Schmidt to these

vectors in the order ap,, ap 1,.. ., aI. Now argue as in Proposition 5.2.

2. Follows easily from the uniqueness of F(S). 3. Just modify the proof of Proposition 5.4. 4. Apply Proposition 5.7 5. That F is one-to-one and onto follows from Proposition 5.2. Given

AeiE, F-1(A)E6p,nxG+ isthepair(O,U)whereA=4U.For (ii), F(r, UT') = rouTr = (r X T)(OU) = (IF T)(F(41, U)). If

F-1(A) = (4, U), then A = AU and 4 and U are unique. Then (r X T)A =1'AT' =F4'UT' and F4'E' Y,nand UT' E Gt. Uniqueness implies that F- '(F4UT') = (DP, UT').

6. When Dg(xo) exists, it is the unique n x n matrix that satisfies

(5.3) li llg(x) - g(xo) -

Dg(xo)(x -

xO)II - 0.

X-.*xo lix - xoli

But by assumption, (5.3) is satisfied by A (for Dg(xo)). By definition Jg(xo) = det(Dg(xO)).



7. With tii denoting the ith diagonal element of T, the set {Tltii > 0) is open since the function T -

ti is continuous on V to R'. But GT -

n {TItii > 0), which is open. That g has the given representation is just a matter of doing a little algebra. To establish the fact that

limX 0(IIR(x)II/IIxII) = 0, we are free to use any norm we want on V and Sp (all norms defined by inner products define the same topology). Using the trace inner product on V and S., IIR(x)ll2 = Ixx'I12 = tr xx'xx' and 11xI12 = tr xx', x E V. But for S > 0, tr S2 < (tr S)2 so

iIR(x)II/ IxII < trxx', which converges to zero as x -* 0. For (iii), write

S = L (x), string the S coordinates out as a column vector in the order

s11 S21, S22 S319 S32 S33,..., and string the x coordinates out in the

same order. Then the matrix of L is lower triangular and its determi

nant is easily computed by induction. Part (iv) is immediate from Problem 6.

8. Just write out the equations SS- = I in terms of the blocks and solve.

9. That p2 = p is easily checked. Also, some algebra and Problem 8 show that (Pu, v) = (u, Pv) so P is self-adjoint in the inner product

(-, *). Thus P is an orthogonal projection on (RP, (., *)). Obviously,

R(P) (xx (= ) ,Z }

Since

PX = -1222 0

IIPxII2 = (PX, PX) = - - )12? )

- (Y

- -12222 Z)2 (Y-212222 Z)

A similar calculation yields 11(I - P)x12 = Z'Y221z. For (iii), the expo nent in the density of X is - 2(x, x) = - 24IPxII2 - 411(1 - P)xjj2.

Marginally, Z is N(0, 222). so the exponent in Z's density is - 211(I -

P)xII2. Thus dividing shows that the exponent in the conditional density of Y given Z is - IllPx1I2, which corresponds to a normal

distribution with mean 72122921Z and covariance (21f' = -

E12222 21

10. On GT for j < i, tij ranges from - oo to + oo and each integral

contributes 2 -there are p( p - 1)/2 of these. For j = i, tii ranges

CHAPTER 6 483

from 0 to oo and the change of variable uii= t2/2 shows that the

integral over tii is (V)r-i-'F((r - i + 1)/2). Hence the integral is

equal to

1 (p (P -1))/4 2(P (P-1))/42 121(r-i- 1)rt +1)

which is just 2-Pc(r, p).

CHAPTER 6

1. Each g e Gl(V) maps a linearly independent set into a linearly

independent set. Thus g(M) c M implies g(M) = M as g(M) and M

have the same dimension. That G(M) is a group is clear. For (ii),

(g11 g12 ) EM M fory E Rq 9,21 922 1 01

iff g21y = 0 fory E Rq iff g21 =0. But

( glI g12 0 g22

is nonsingular iff both gI1 and g22 are nonsingular. That G1 and G2 are

subgroups of G(M) is obvious. To show G2 is normal, consider h e G2 and g e G(M). Then

( g22 0 rIr U g-l i

has its 2, 2 element Ir, so is in G2. For (iv), that G , n G2= (I) is clear. Each g E G can be written as

/g11 912 \ IqI 0 g1191

(\0 g22 \ 0 g22 0 Ir )

which has the form g = hk with h E G, and k E G2. The representa

tion is unique as G1 r G2 = (I}. Also, g,g2 =

h1k,h2k2 -

h1h2h- 'k1h2k2 = h3k3 by the uniqueness of the representation. 2. G(M) does not act transitively on V - (0) since the vector (h), y * 0

remains in M under the action of each g E G. To show G(M) is




transitive on V nl MC, consider

i=l,

with zt 0 and z2 * 0. It is easy to argue there is a g E G(M) such

that gx = x2 (since z, 0 andz2 * 0).

3. Each n x n matrix 1 E n can be regarded as an n2-dimensional vector. A sequence (1)j converges to a point x e R' iff each element

of F1 converges to the corresponding element of x. It is clear that the limit of a sequence of orthogonal matrices is another orthogonal

matrix. To show On is a topological group, it must be shown that the

map (r, 4) -- rF is continuous from en X O5)n to en -this is routine.

To show x(r) = I for all r, first observe that H = (x()lrF en) is a subgroup of the multiplicative group (0, oo) and H is compact as it is

the continuous image of a compact set. Suppose r E H and r * 1.

Then r1 E H for j = 1,2, ... as H is a group, but (ri) has no

convergent subsequence- this contradicts the compactness of H. Hence r = 1.

4. Set x=eu and {(u) = log X(eu), u E R'. Then {(ul + u2) =(ul) +

((u2) so ( is a continuous homomorphism on R' to R'. It must be

shown that ((u) = vu for some fixed real v. This follows from the

solution to Problem 6 below in the special case that V = R'.

5. This problem is easy, but the result is worth noting.

6. Part (i) is easy and for part (ii), all that needs to be shown is that p is

linear. First observe that

(6.6) O(vI + v2) = q4v1) + p(v2)

so it remains to verify that O(Xv) = X4(v) for X E R'. (6.6) implies

o(= 0 and 4(nv) = ncp(v) for n = 1,2,.... Also, 0(-v) = -?(v) follows from (6.6). Setting w = nv and dividing by n, we have j(w/n) = (l/n) (w) for n = 1,2,... Now 4((m/n)v) = m.p((l/n)v) =

(m/n)o(v) and by continuity, 4(Xv) = X@(v) for X > 0. The rest is

easy. 7. Not hard with the outline given. 8. By the spectral theorem, every rank r orthogonal projection can be

written FxoF' for some F E on. Hence transitivity holds. The equation

rx r' = xo holds for r E e iff r has the form

r (r ?22)

e



CHAPTER 6 485

and this gives the isotropy subgroup of xo. For F E e Q, Fxol =

rxo(Pxo)' and rxo has the form (40) where 4: n X r has columns that are the first r columns of r. Thus Fx0F' = 44'. Part (ii) follows by observing that I,4/ = '22'P if ' A = A for some A E Or.

9. The only difficulty here is (iii). The problem is to show that the only

continuous homomorphisms X on G2 to (xc, oo) are to for some real a.

Consider the subgroups G3 and G4 of G2 given by

G= ((P 0)xf e RP- 1 G4 =(I 0 u C (0,oo)}.

The group G3 is isomorphic to RP-' so the only homomorphisms are

x -) exp[Ef 'aixi] and G4 is isomorphic to (0, xo) so the only homo

morphisms are u -- uG for some real a. For k E G2, write

k (IP_1 ?) (IP-1 ?)(IP-1 ?) x u x I 0 u

so x(k) = exp[2aixil]ua. Now, use the condition X(klk2)= X(k=) X(k2) to conclude a,

= a2 = = ap- = 0 so X has the claimed

form. 10. Use (6.4) to conclude that

Iy = 2P(V;)flw (n, p)f+ ui,T 2exp i] dU u ~ ~ I ii <j

and then use Problem 5.10 to evaluate the integral over Gt. You will

find that, for 2y + n > p-1, the integral is finite and is I =

(VC2 7T) nP(n, p)/w(2y + n, p). If 2y + n < p - 1, the integral di

verges. 11. Examples 6.14 and 6.17 give Ar for G(M) and all the continuous

homomorphisms for G(M). Pick xo E RP n MC to be

Vo zo 0

where z' = (1,0 ,. . ., 0), zo E Rr. Then H. consists of those g's with

the first column of g12 being 0 and the first column of g22 being zo. To

apply Theorem 6.3, all that remains is to calculate the right-hand

modulus of Ho-say AO. This is routine given the calculations of

Examples 6.14 and 6.17. You will find that the only possible multi


pliers are x(g) = Ig,119g331 and Lebesgue measure on RP n MC is the only (up to a positive constant) invariant measure.

12. Parts (i), (ii), (iii), and (iv) are routine. For (v), J1(f)= ff(x),u(dx) and J2(f) = Jf(T-1(y))P(dy) are both invariant integrals on Sf(%). By Theorem 6.3, J1 = kJ2 for some constant k. To find k, take f(x)= ( s)-'sn(x)exp[- jx'x] so J1(f) = 1. Since s(T-'(y)) = v for y = (u, v, w),

J2(f) = (27TY vn exp[- iv2- nu2] du2v(dw) 2 2

1 F((n-1)/2) 1 2 (WT)n-l k

For (vi), the expected value of any function of x and s(x), say

q(5-, s(x)) is

&;q(5, s(x)) = fq(5, s(x))f(x)sn(x),U(dx)

r ~~~~~~dv = kjq(u, V)f (T-'(U, v, w)) v du y (dw)

=kJq(,v)vn2h(-2 v + n(u -) 2)dd

Thus the joint density of x and s(x) is

puv)= kVn2

h o + 2) (withrespecttodudv).

13. We need to show that, with Y(X) = X/IIXII, P(IIXII E B, Y e C) =

P{(II XjE B)P{Y E C). If P(IIXII E B) = 0, the above is obvious. If

not, set v(C) = P(Y E C,IIXI E B)/P{II XII e B) so v is a probability

measure on the Borel sets of {IyIIlII = 1) c Rn. But the relation

O(Ix) = f4(x) and the on invariance of P(X) implies that v is an

On-invariant probability measure and hence is unique -(for all Borel

B) -namely, " is uniform probability measure on (y{I yYII = 1).

14. Each x E X can be uniquely written as gy with g E 6n and y e 6 (of

course, y is the order statistic of x). Define '0Pn acting on 6Yn x '64 by



CHAPTER 7 487

g(P, y) = (gP, y). Then 4-'(gx) = g4-'(x). Since P(gx) = gP(x),

the argument used in Problem 13 shows that P(X) and Y(X) are

independent and P(X) is uniform on

CHAPTER 7

1. Apply Propositions 7.5 and 7.6. 2. Write X = AU as in Proposition 7.3 so 4 and U are independent. Then

P(X) = A+' and S(X) = U'U and the independence is obvious. 3. First, write Q in the form

where M is n X n and nonsingular. Since M is nonsingular, it suffices to show that (M- (A))C has measure zero. Write x = (X) where ic is

r x p. It then suffices to show that BC = (xIx E Ep ,,, rank(&) = p)C has measure zero. For this, use the argument given in Proposition 7.1.

4. That the 4p's are the only equivariant functions follows as in Example 7.6.

5. Part (i) is obvious. For (ii), just observe that knowledge of Fn allows you to write down the order statistic and conversely.

6. Parts (i) and (ii) are clear. For (iii), write x = Px + Qx. If t is

equivariant t(x + y) = t(x) + y, y e M. This implies that t(Qx) =

t(x) + Px (picky = Px). Thus t(x) = Px + t(Qx). Since Q = I - P, Qx E M' , so BQx =Qx for any B with (B, y) E G. Since t(Qx) E M, pick B such that Bx = - x for x e M. The equivariance of t then gives

t(Qx) = t(BQx) = Bt(Qx) = -t(Qx), so t(Qx) = 0.

7. Part (i) is routine as is the first part of (ii) (use Problem 6). An

equivariant estimator of u2 must satisfy t(afx + b) = a2t(x). G acts

transitively on 9 and G acts transitively on (0, oo) (6J for this case) so

Proposition 7.8 and the argument given in Example 7.6 apply. 8. When X E 9 with density f(x'x), then Y = X- '2 =

(I, ? 2'/2)X

has density f(2 -"2x'x2- 1/2) since dx/lx'xI1/2 is invariant under x -* xA for A E

Glp. Also, when X has density f, then E ((I ? A))X) = E(X) for all I' en nand A E Op. This implies (see Proposition 2.19)

that Cov(X) = cIn 0 Ip for some c > 0. Hence Cov((In 0 11/2)X) =

cIn 0 E. Part (ii) is clear and (iii) follows from Proposition 7.8 and

Example 7.6. For (iv), the definition of CO and the assumption on f




imply f(FCO ') = f(COF'F) = f(CO) for each r e O.? The uniqueness

of CO implies CO = aIp for some a > 0. Thus the maximum likelihood

estimator of I must be aX'X (see Proposition 7.12 and Example 7.10).

9. If e(X) = PO, then lE(IiXII) is the same whenever fC(X) e {PIP = gPo, g E C(V)} since x -- llxll is a maximal invariant under the action

of 6(V) on V. For (ii), E(ll XII) depends on ft through IJ,LI. 10. Write V = X ED (M - w) ED M' . Remove a set of Lebesgue measure

zero from V and show the F ratio is a maximal invariant under the

group action x -f alx + b where a > 0, b E w, and F c e(V) satis

fies r(w) c w, r(M - w) c (M - w). The group action on the

parameter (IL, a2) is ,u -- art + b and a2 -- a2a2. A maximal in

variant parameter is IIPM tt II 2/a 2, which is zero when t EC w. 11. The statistic V is invariant under xi -f Axi + b, i = 1,..., n, where

b E RP, A E Glp, and det A = 1. The model is invariant under this

group action where the induced group action on (,u, E) is yt -* A,u + b

and Y. - AYA'. A direct calculation shows 0 = det(E) is a maximal

invariant under the group action. Hence the distribution of V depends on (,u, L) only through 0.

12. For (i), if h E G and B E '@, (hPNB) = P(h- 1B) = fG(gQ)(h `B)

AL(dg)=fGQ(g- 'h- B)ft(dg)=fGQ((hg)- 'B)At(dg)=JQ(g- 'B)ft(dg)= P(B), so hP = P for h E G and P is G invariant. For (ii), let Q be

the distribution described in Proposition 7.16 (ii), so if 15(X)= P, then E(X) = I.(UY) where U is uniform on G and is independent of Y. Thus for any bounded 6i-measurable function f,

f(x)PI(dx) = (gy)ft(dg) Q(dy) = f (gx),u(dg)Q(dx).

Setf = IB and we have P(B) = JcQ(g-'B)f(dg) so (7.1) holds.

13. For y c 6J and B E fi3, define R(BIy) by R(Bly) = fGI(gy)jL(dg).

For each y, R( * lY) is a probability measure on (6, fi3) and for fixed B,

R(BI - ) is (6@, C() measurable. For P E 1'P, (ii) of Proposition 7.16

shows that

(7.2) fh(x)P(dx) = ffh(gy)ft(dg)Q(dy).

But by definition of R( -* ), JGh(gy)p(dg) = J h(x)R(dxIy), so (7.2)



CHAPTER 7 489

becomes

fh(x)P(dx) = J h(x)R(dxIy)Q(dy).

This shows that R( -Iy) serves as a version of the conditional distribu

tion of X given T(X). Since R does not depend on P E VP, T(X) is

sufficient. 14. For (i), that t(gx) = g o t(x) is clear. Also, X - Xe = QeX which is

N(O, Qe) so is ancillary. For (ii), (f(XI)IX = t) = &(f(XI - X

+X)iX =t) = &;(f (eZ(X) + X)jX = t) since Z(X) has coordinates Xi-X, i= l,..., n. Since Z and X are independent, this last condi

tional expectation (given X = t) is just the integral over the distribu

tion of Z with X = t. But e'Z(X) = XI - Xis N(O, 62) so the claimed

integral expression holds. When f(x) = 1 for x < uo and 0 otherwise,

the integral is just (I((uo - t)/8) where 4 is the normal cumulative

distribution function.

15. Let B be the set (- oc, uo] so IB(XI) is an unbiased estimator of

h(a, b) when P,(X) = (a, b)Po. Thus h(t(X)) = &(IB(XI)lt(X)) is an

unbiased estimator of h(a, b) based on t(X). To compute h, we

have &(IB(XI)lt(X))= P{X1 < uoit(X)) = P((X1 - X)/s < (uO -

X)/sl(s, X)). But (XI - X)/s Z1 is the first coordinate of Z(X) so is independent of (s, X). Thus h(s, X) = Pz,{Z1 < (uo - X)/s) =

F((uo - X)/s) where F is the distribution function of the first coordi nate of Z. To find F, first observe that Z takes values in F, {xlx e

Rn, x'e = 0, llxii = 1) and the compact group (9"(e) acts transitively on

S. Since Z(rX) = FZ(X) for F E (n(e), it follows that Z has a

uniform distribution on E (see the argument in Example 7.19). Let U

be N(O, In) so Z has the same distribution as QeU/ilQeUll and E(Z,) - (C,QeU/iIQeUi12) = PB((QeEI)'QeU/IIQeU 12). Since WIQee112 = (n

- 1)/n and QeU is N(O, Qe), it follows that F1 (Z,) = P4(((n -

1)/n)'/2W) where W1 = Ul/( j''U7I)/2. The rest is a routine com

putation. 16. Part (i) is obvious and (ii) follows from

(7.3) &(f(X)IT(X) = g) = &(f(T(X)(T(X)) Y'X)IT(X) = g)

= 6;(f(T(X)Z(X))IT(X) = g).

Since Z( X) and T( X) are independent and T( X) = g, the last member

of (7.3) is just the expectation over Z of f(gZ). Part (iii) is just an

application and QO is the uniform distribution on qp n. For (iv), let B

be a fixed Borel set in R' and consider the parametric function

h(Y.) = PI(XI E B) = JIB(X)(V2>7T-jI["2exp[- 2x':'x]dX, where X1 is the first row of X. Since T(X) is a complete sufficient statistic, the MVUE of h(E) is

(7.4)

h(T) = &(IB(XI)LT(X) = T) = P(T(T(X))'X, E BIT(X) = T).

But Z' = (T-r(X)X1)' is the first row of Z(X) so is independent of

T(X). Hence h(T) = P1(ZI E T'(B)} where P1 is the distribution of Z, when Z has a uniform distribution on Cp n. Since Z, is the first p

coordinates of a random vector that is uniform on (xl lIx I = 1, x E R'), it follows that Z1 has a density 4(ll11u2) for u E RP where 4 is given by

C( _ V-)(-p-2)/2 < V <

= otherwise

where c = r(n/2)/p2F r((n - p)/2). Therefore h(T)

fRPIB(TU)P(IlUll2)dU = (det T)- JRPIB(u)4(lT- U1ll2)du. Now, let B shrink to the point uo to get that (det T)- 1'(IIT- 'uOI12) is the MVUE for (V ) -p I /2exp[- lu'W luo].

CHAPTER 8

1. Make a change of variables to r, xl = s/llua1 and X2 = S22/a22, and

then integrate out xl and x2. That p(rlp) has the claimed form follows

by inspection. Karlin's Lemma (see Appendix) implies that 4(pr) has a monotone likelihood ratio.

3. For a = 1/2,...,(p - 1)/2, let X1,..., Xr be i.i.d. N(O, Ip) with r =

2a. Then S = X, X,' has 0fa, as its characteristic function. For a > (p -

1)/2, the function pa(s) = k(a)lslGexp[-2 trs] is a density with

respect to ds/lslsI(P 1)/2 on Sp'. The characteristic function of pa is 4a.

To show that 4J(EA) is a characteristic function, let S satisfy

&exp(i(A, S)) = 0.(A) = IIp - 2iAj". Then 2I/2S5i/2 has OJ(EA) as

its characteristic function. 4. IS(S) = E (PSP')impliesthatA = &SsatisfiesA = rAP Ifor allE e (.

This implies A = for some constant c. Obviously, c = &s1l. For (ii)

var(tr DS) = var(Efdisii) = Efd7 var(sii) + Fi:jdidjco v s

Noting that E(S) = (FSr') for r E (p, and in particular for permu

tation matrices, it follows that y = var(sii) does not depend on i and

, = cov(sI,, sjj) does not depend on i andj (i * j). Thus var(D, S> =

CHAPTER 8 491

-yE)d2 + 3EEi didj = (y - 3)fd7 + f(Efdi)2. For (iii), writeA E

SP as J'DF' so var(A, S) = var(rDr', S> = var(D, r'SI) =

var(D, S) = (y - fi)E'd72 + 3(Efdi)2 = (y - f3)trA2 + [3(tr A)2 =

(y - 1)KA, A) + 13(I, A)2. With T = (y - ),)Ip @ Ip + 1,BIp Ip, it

follows that var(A, S) = (A, TA), and since T is self-adjoint, this implies that Cov(S) = T.

5. Use Proposition 7.6.

6. Immediate from Problem 3.

7. For (i), it suffices to show that l ((ASA')') = W((AAA') -, r, v + r - 1). Since f (S-1) = W(A-, p, v + p - 1),- Proposition 8.9 implies

that desired result. (ii) follows immediately from (i). For (iii), (i) implies S = A- 2SA'12 iS IW(Ip, p, P) and E(5) = C(FS1') for all r 9E ,. Now, apply Problem 4 to conclude that &S = cIp where

c = &SH,. That c = (v - 2)-' is an easy application of (i). Hence

(v -2)-'I, = = A-1/2(&;S)A-"/2 so 6S = (v - 2)-'A. Also, CovS = (y - /3Ip , I, + f3I , Ip as in Problem 4. Thus CoV(S) =

(A'32 0 A/2)(CovS)(A"/2 ? A'/2) - (y - ,B)A e A + fAO A. For (iv), that P(SI1) = IW(A 1, q, v), take A = (Iq 0) in part (i). To show

f(S2',)= W(A 21I1, r, v + q + r - 1), use Proposition 8.8 on S

whichisW(A-,p,v+p- 1).

8. For (i), let p, (x)p2(s) denote the joint density of X and S with respect to the measure dx dslsI(P+ 1)/2. Setting T = XS- 1/2 and V = S, the

joint density of T and V is p,(tvl/2)p2(v)IVIr/2 with respect to dt dv/IVI(p+

1)/2_ the Jacobian of x -+ tv"'2 is Ivlr/2-see Proposition 5.10. Now, integrate out v to get the claimed density. That C(T) =

C(rTA') is clear from the form of the density (also from (ii) below). Use Proposition 2.19 to show Cov(T) = c,Ir 0 Ip. Part (ii) follows by

integrating out v from the conditional density of T to obtain the marginal density of T as given in (i). For (iii) represent T as: T given V

is N(O, Ir 0 V) where Vis IW(Ip, p, '). Thus T,, given Vis N(O, Ik ?

VI,) where V,, is the q x q upper left-hand corner of V. Since

(VI,) = IW(Iq, q, v), the claimed result follows from (ii).

9. With V = S- 1/2s1Si- 1/2 and S - S2' , the conditional distribution of V given S is W(S, p, m) and s (S) = IW(I , p, v). Since V is uncon

ditionally F(m, i, IP), (i) follows. For (ii), (T) = T(v, Ir, Ip) means

that 15(T) = f (XS'/2) where e (X) = N(O, Ir 0 Ip) and s (S) =

IW(Ip, p, v). Thus C(T'T) = e(S'/2X'XS'/2). Since E(X'X) =

W(Ip, p, r), (ii) follows by definition of F(r, ', Ip). For (iii), write F = T'T where e(T) = T(v, Irh Ip), which has the density given in (i)

of Problem 8. Since r > p, Proposition 7.6 is directly applicable to yield the density of F. To establish (iv), first note that e(F) = e(FFF')




for all 1 Ei p. Using Example 7.16, F has the same distributions as

AD+' where 4 is uniform on 9, and is independent of the diagonal matrix D whose diagonal elements A, > >, - Ap are distributed as

the eigenvalues of F. Thus A,,. .., Ap are distributed as the eigenvalues of S7-'S1 where S is W(Ip, p, r) and S2j' is IW(Ip, p, v). Hence

e( F-) = e (4'D-'4') = e(4ADA') where the diagonal elements of D, say A'

- > - * * > A , are the eigenvalues of ST 'S2. Since S2 is

W(Ip, p, v + p - 1), it follows that ADA' has the same distribution as

an F(v + p - 1, r - p + 1, Ip) matrix by just repeating the orthogo

nal invariance argument given above. (v) is established by writing F = T'T as in (ii) and partitioning T as T,: r x q and T2: r x (ip - q)

so

T'T=(~ T2T T2T2)

Since Q(T1) = T(v, Ir' Iq) and F1, = T1T1, (ii) implies that Q(F,,) =

F(r, v, Iq). (vi) can be established by deriving the density of XS- 'X'

directly and using (iii), but an alternative argument is more instructive. First, apply Proposition 7.4 to X' and write X = V'/24" where V E V = XX' is W(I, r, p) and is independent of A: p x r, which

is uniform on ';r p. Then XS'-X' = V1/2W-IV1/2 where W =

(4"S 14'f' and is independent of V. Proposition 8.1 implies that 0(W) = W(Ir, r, m - p + r). Thus l (W-1) = IW(Ir, r, m - p + 1).

Now, use the orthogonal invariance of the distribution of XS- 'X' to

conclude that E (XS- 1X') = , (D'Dr) where I' and D are independent, F is uniform on O r5 and the diagonal elements of D are distributed as

the ordered eigenvalues of W- 'V. As in the proof of (iv), conclude that

0(FD')= F(p,m -p + 1,Ir).

10. The function S -* S1/2 on Sp to + satisfies (IS I)!/2 = fS1/2f" for

IF E Op. With B(S1, S2) = (S1 + I 2) 1(S1 + S2f1/2, it follows that B(FSjF, FS2P') = FB(S,sS2)rF. SinceD(fSiF') = 0(S),i = 1,2, and S, and S2 are independent, the above implies that e(B) =-(rBBF)

for r E 9,,. The rest of (i) is clear from Example 7.16. For (ii), let

B1 = SI"2(S5 + S2)'S"2I so 0(B1) = (1FB,F') for F e 9.. Thus

E(BI) = C0(4DA') where 4 and D are independent, 4 is uniform on

Op,. Also, the diagonal elements of D, say 1I >- * * Ap > 0, are distrib

uted as the ordered eigenvalues of S,(SI + S2)-' so B1 is

B(m,, M2, Ip). (iii) is easy using (i) and (ii) and the fact that F(I +

F)-I is symmetric. For (iv), let B = X(S + X'X) -1X' and observe

that E(B) = 0(FBF'), E9 e,. Since m > p, S' exists so B =

P + S- "12X'XS- 1/2)- IS- 2X. Hence T = XS is2 is T(m

- p + 1, Ir, Ip). Thus 0(B) = 0(4D') where 4 is uniform on Or and



CHAPTER 9 493

is independent of D. The diagonal elements of D, say A1,..., Xr, are the eigenvalues of T(Ip + T'T) -T'. These are the same as the eigen

values of TT'(Ir + TT') - (use the singular value decomposition for T). But e(TT') = Ef(XS-'X') = F(p, m - p + 1, Ir) by Problem 9

(vi). Now use (iii) above and the orthogonal invariance of e(B). (v) is

trivial.

CHAPTER 9

1. Let B have rows .',..., Pk and form X in the usual way (see Example

4.3) so 4= ZB with an appropriate Z: n x k. Let R: I x k have

entries al,..., ak. Then RB = Ekait,u and Ho holds iff RB = 0. Now

apply the results in Section 9.1. 2. For (i), just do the algebra. For (ii), apply (i) with S1 = (Y - XB)'(Y

- XB) and S2 = (X(B - B))'(X(B - B)), so p(S1) < 4(S1 + S2) for

every B. Since A > 0, trA(Si + S2) = trAS1 + trAS2> trAS1 since

trAS2 > 0 as S2 > 0. To show det(A + S) is nondecreasing in S > 0,

First note that 4L + S1 < A + S1 + S2 in the sense of positive definite ness as S2 > 0. Thus the ordered eigenvalues of (A + S1 + S2), say

A1,..., A,p, satisfy Ai > ,i, i = l,...,p, where u,,..., ,,p are the

ordered eigenvalues of A + SI. Thus det(A + S1 + S2) > det(A + SI). This same argument solves (iv).

3. Since E(E4'A') = C(EA') for 4 E Op, the distribution of EA' depends only on a maximal invariant under the action A -b A4, of 4 on Glp. This maximal invariant is AA'. (ii) is clear and (iii) follows since the reduction to canonical form is achieved via an orthogonal transforma tion Y = VY where r E n. Thus Y = rFp + rEA'. F is chosen so r, has the claimed form and Ho is B1 = 0. Setting E = FE, the model has the claimed form and e(E) = E(E) by assumption. The arguments given in Section 9.1 show that the testing problem is invariant and a maximal invariant is the vector of the t largest eigenvalues of

Y,(Y3Y3)-'Y. Under Ho, Y1 = E1A', Y3 = E3A' so Y(Y3'Y3)-'Y, =

El (E'E3)- E --W. When e(FE) = e&(E) for all F E Q, write E =

AU according to Proposition 7.3 where 4i and U are independent and 4 is uniform on p nF Partitioning 4 as E is partitioned, Ei = 4iU, = 1,2,3, so W= 4IWU((43U)'4,3U) IU'4' = 41(43f'4,. The rest

is obvious as the distribution of W depends only on the distribution of 4.

4. Use the independence of Y1 and Y3 and the fact that &((Y3Y3)-' = (m

-p- 1-12-1


5. Let r ECe2be given by

and set Y= YE. Then E(Y) = N(ZBE, I, X r?E2E). Now, let BE have columns 3,B and f2. Then Ho is that f3 = 0. Also E'rE is

diagonal with unknown diagonal elements. The results of Section 9.2 apply directly to yield the likelihood ratio test. A standard invariance argument shows the test is UMP invariant.

6. For (i), look at the i, j elements of the equation for Y. To show

M2 I M3, compute as follows: (au'2, u3') = tr au'j u' = u'j u'a = 0

from the side conditions on a and P8. The remaining relations Ml I M2 and M1 I M2 are verified similarly. For (iii) consider (Im X A)(Ituu'2 + au'2 + ulft') =

[Lu,(Au2)' + a(Au2)' + u,(A1)' = tyuu'2 + yau'2

+.Suf3' E M where the relationsPu2 = u2andQ,B = 2Bwhen u'2 = 0 have been used. This shows that M is invariant under each Im ? A. It

is now readily verified that ,u - Y.., & = I.- Y..and P3=i Y-Y... For (iv), first note that the subspace w = (xlx E M, a = 0) defined by

Ho is invariant under each Im X A. Obviously, X = Ml@ DM3. Con

sider the group whose elements are g = (c, E, b) where c is a positive

scalar, b E Ml ED M3, and E is an orthogonal transformation with

invariant subspaces M2, M1 @ M3, and MI. The testing problem is

invariant under x -k cEx + b and a maximal invariant is W (up to a

set a measure zero). Since W has a noncentral F-distribution, the test

that rejects for large values of W is UMP invariant.

7. (i) is clear. The column space of W is contained in the column space of

Z and has dimension r. Let xl,... , Xr, Xr+I.. X xk, Xk+1,. , xn be

an orthonormal basis for Rn such that span(x,,..., Xr) = column

space of W and span{xl,..., Xk} = column space of Z. Also, let

yl, ..., yp be any orthonormal basis for RP. Then {x i yjIi = 1, ....

r, j = 1,..., p} is a basis for 6i(Pw ? Ip), which has dimension rp.

Obviously, 6iK(PW ? Ip) c M. Consider x E X so x = ZB with

RB = 0. Thus (Pw 0 Ip)x = PWZB = W(W'W)- 'W'ZB =

W(W'W)-'R(Z'Z)-'(ZZ)B = W(W'W)- 1RB = 0. Thus 9L(Pw 0

Ip) D w, which implies 6Yu(Pw 0 Ip) c wx. Hence 65(Pw 0 Ip) C M

nl (Of. That dimo = (k - r)p can be shown by a reduction to

canonical form as was done in Section 9.1. Since X c M, dim(M - o) = dim M - dim X = rp, which entails 6Ai(Pw 0 Ip) = M - w. Hence

Pz X Ip - Pw 0 Ip is the orthogonal projection onto w.

8. Use the fact that E'E is diagonal with diagonal entries al, a2, a3, a3, a2 (see Proposition 9.13 ff.) so the maximum likelihood estimators a,, a2,



CHAPTER 10 495

and a3 are easy to find-just transform the data by F. Let D have

diagonal entries al, &2, a3, a3, a2 so L:- rD1 gives the maximum likelihood estimators of 2, pl, and P2.

9. Do the problems in the complex domain first to show that if Z,,..., Zn are i.i.d. (rN(O, 2H), then H = (I/2n) IjZj Z. But if Zj = Uj + ij and

Xi = i

then H= (1/2n)Ej(Lj + iJ')(Ly - )= (/2n)[(S11 + S22) +

i(12 - S2 50 = (H). This gives the desired result.

10. Write R = M(Ih 0)r where M is r x r of rank r and r EeOp. With

8 = r,F, the null hypothesis is (Ir 0)8 = 0. Now, transform the data by

F and proceed with the analysis as in the first testing problem considered in Section 9.6.

11. First write Pz = PI + P2 where P1 is the orthogonal projection onto e and P2 is the orthogonal projection onto (column space of Z) n

(span e)-'. Thus PM = PI 1 IP + P2 ? IP. Also, write A(p) = yPI +

8Q1 where y = 1 + (n - l)p, 8 = 1 - p, and Q1 = I,, - P1. The rela

tions P1P2 = 0 = Q1P1 and P2Q1 = Q1P2 = P2 show that M is in

variant under A(p) 0 E for each value of p and :. Write ZB = eb' +

L2zjbj so Q1Y is N2 j(Q z1)b,(Q1A(p)QI) X E). Now, Q1A(p)Q= 8Q1 so Q1Y is N(12k(Qlzj)bj, SQ? 0 2:). Also, PY is N(eb', yP, ? 2).

Since hypotheses of the form RB = 0 involve only b2,..., bp, an

invariance argument shows that invariant tests of Ho will not involve

PIY-so just ignore P1Y. But the model for QIY is of the MANOVA

type; change coordinates so

Now, the null hypothesis is of the type discussed in Section 9.1.

CHAPTER 10

1. Part (i) is clear since the number of nonzero canonical correlations is always the rank of 212 in the partitioned covariance of (X, Y). For (ii), write



Cov(X, Y)= (2 1 22

where 2T2 has rank t, and I > 0, Y222> 0. First, consider the case when q < r, 1, ='Iq 22 = Ir' and

{D 00 212( )

where D > 0 is t x t and diagonal. Set

A = (Do ):qx t, B= (Do):rXt

so AB' = I22 Now, set A1I = Iq-AA', A22 = Ir-BB', and the

problem is solved for this case. The general case is solved by using Proposition 5.7 to reduce the problem to the case above.

2. That 212 = 8e1e' for some 8 E R' is clear, and hence 212 has rank

one-hence at most one nonzero canonical correlation. It is the square root of the largest eigenvalue of I`E = 82zj'e-1eej21e2e. The only nonzero (possibly) eigenvalue is 8 2e :l-jele' e e e2. To de scribe canonical coordinates, let

11 el 1= 22 ej

I 1 ,/2,l WI

1 /2e

and then form orthonormal bases {t), 32,..., 1q) and ( vr,..., wr) for R" and RK. Now, set vi = I2i

j 2t2

j for i = 1,..., q,

j=l,..., r. Then verify that X, = v,X and Yj = wj'Y form a set of

canonical coordinates for X and Y.

3. Part (i) follows immediately from Proposition 10.4 and the form of the covariance for (X, Y). That 8(B) = trA(I - Q(B)) is clear and the

minimization of 6(B) follows from Proposition 1.44. To describe B, let 4: p x t have columns a,. .., a, so 4'4 = I, and Q = A+'. Then show directly that B = 4,' I/2 is the minimizer and CBX = Q is the best predictor. (iii) is an immediate application of (ii).

4. Part (i) is easy. For (ii), with ui =xi -

n n

A(M, ao) = Elixi -

(P(xi -

ao) + ao)112 =

Ellui -

Pui2 I I

n n n

= EIIQuiII2 = E trQuiu' = trQ3uiuu' = trS(a0)Q. I I I

CHAPTER 10 497

Since S(ao) = S(Jx) + n(x~ - ao)(x - ao)', (ii) follows. (iii) is an ap plication of Proposition 1.44.

6. Part (i) follows from the singular value decomposition: For (ii), (X E tp 1jx = AC, e E tp k) is a linear subspace of Ep and the orthogonal projection onto this subspace is (4,') ? Ip. Thus the closest point to A is ((4,4) ? I)A = 44'A, and the C that achieves

the minimum is C = 4'A. For B E k write B -AC as in (i). Then

IIA - BR12 > inf inf lA - 4CIj2 = infhtA - 4/'A112 = inf IAQ112.

The last equality follows as each 4 determines a Q and conversely. Since II AQ1l2 = trAQ(AQ)' trAQ2A' = tr QAA',

IIA- Rh2 - inf trQAA'.

Q

Writing A = E:PXuiv' (the singular value decomposition for A), AA' = fXiuiuu' is a spectral decomposition for AA'. Using Proposition 1.44, it follows easily that

p inf tr QAA'= EA.. Q k+l

That B achieves the infimum is a routine calculation. 7. From Proposition 10.8, the density of W is

h(wjO) = fPn-2(wIOu1/2)f(u)du

where Pn-2 is the density of a noncentral t distribution and f is the density of a X -2 distribution. For 0 > 0, set v = OU"/2 so

h(wIO) = 02 Pn-2(WIV)f 02

v dv

Since pn-2(wIv) has a monotone likelihood ratio in w and v and

f(v2/02) has a monotone likelhood ratio in v and 0, Karlin's Lemma implies that h(wh0) has a monotone likelihood ratio. For 0 < 0, set V = 0U- 1/2, change variables, and use Karlin's Lemma again. The last

assertion is clear.

8. For U2 fixed, the conditional distribution of W given U2 can be described as the ratio of two independent random variables-the numerator has a X2+2K distribution (given K) and K is Poisson with parameter A/2 where

A = p2(l - p2)- U2 and the denominator is

xn -r- 1 Hence, given U2, this ratio is 'Yr+2K n-r-1 with K described

above, so the conditional density of W is

fl (W|P, U2) E fr+2k, n-r-1I (W) ( kI2)

where A4' ( /2) is the Poisson probability function. Integrating out U2 gives the unconditional density of W (at p). Thus it must be shown that Su24(kIA/2) = h(kjp)-this is a calculation. That f(tIp) has a

monotone likelihood ratio is a direct application of Karlin's Lemma. 9. Let M be the range of P. Each R e 'Rs can be represented as R = A+'

where 4 is n x s, 4'+ = Is, and PA = 0. In other words, R corresponds

to orthonormal vectors 4', . . ., 4 (the columns of 4 ) and these vectors

are in M' (of course, these vectors are not unique). But given any two

such sets-say 4',..., AS and 6 8,..., 65, there is a r E e(P) such that P4i = Si, i = 1,..., s. This shows C(P) is compact and acts transi

tively on Psg so there is a unique ((P) invariant probability distribu

tion on <Ps. For (iii), AROA' has an 6(P) invariant distribution on

's-uniqueness does the rest.

10. For (i), use Proposition 7.3 to write Z = 4U with probability one where 4' and U are independent, 4' is uniform on Y,p and U E Gt.

Thus with probability one, rank(QZ) = rank(Q4). Let S > 0 be inde

pendent of 4 with f (S2) = W(Ip, p, n) so S has rank p with probabil

ity one. Thus rank(Q4) = rank(Q4S) with probability one. But 4'S is N(0, In X Ip), which implies that Q4S has rank p. Part (ii) is a direct

application of Problem 9. 12. That 4 is uniform follows from the uniformity of r on Q. For

(ii), () = C(Z(Z'Z)-1/2) and A = (Ik 0)4 imphes that E(4)=

(9(X(X'X + Y'Y) -'). (iii) is immediate from Problem 11, and (iv) is

an application of Proposition 7.6. For (v), it suffices to show that

Jf(x)P,(dx) = Jf(x)P2(dx) for all bounded measurablef. The invari ance of Pi implies that for i = 1,2, ff(x)Pi(dx) = ff(gx)Pi(dx), g E G. Let v be uniform probability measure on G and integrate the

above to get ff(x)Pi(dx) = J( fGf(gx)v(dg))Pi(dx). But the function x -+

JGf(gx)v(dg) is G-invariant and so can be written f(T(x)) as T is

a maximal invariant. Since P,(- '(C)) = P2( '(C)) for all measur

able C, we have Jk(T(x))P,(dx) = fk(T(x))P2(dx) for all bounded

CHAPTER 10 499

measurable k. Putting things together, we have Jf(x)P1(dx)=

ff(T(x))P1(dx) = ff(T(x))P2(dx) =Jf(x)P2(dx) so P1 = P2. Part (vi) is immediate from (v).

13. For (i), argue as in Example 4.4:

tr(Z - TB)Y7.(Z - TB)'

= tr(Z - TB + T(B - B))2-'(Z- TB + T(B - B))'

= tr(QZ + T(B - B))Y-?(QZ + T(B - B))'

= tr(QZ)' - '(QZ)' + tr T(B - B)2 (B-B)'T'

. tr(QZ)'E- '(QZ)' = trZ'QZ.-'1.

The third equality follows from the relation QT = 0 as in the normal

case. Since h is nonincreasing, this shows that for each 2: > 0,

supf(ZIB, 2) =f(ZIB, 2) B

and it is obvious that f(ZIB, Y.) = I2I -h(trSl- '). For (ii), first note that S > 0 with probability one. Then, for S > 0,

sup f(ZIB, 2) = supf(ZiB 2) HI UHo w:>0

= sup III-,/2h (tr SY)

= IS-n/2 sup ICIn12h (tr C). c>o

Under Ho, we have

supf(ZIB, 2) Ho

- sup l1yV-n/2I222-n/2h(tr l2'S,- + tr:2y'S22) ii > 0, i = 1, 2

lS11" 'n/2S22 -n/2 sup IC 1dn/2IC22In/2h(trC,1 + trC22). Cii>O, i= 1,2

This latter sup is bounded above by

sup ICIn/2h(trC) k, c>o




which is finite by assumption. Hence the likelihood ratio test rejects for small values of k1Id11'-'2IS22Ln/21SIn/2, which is equivalent to rejecting for small values of A(Z). The identity of part (iii) follows from the equations relating the blocks of 2 to the blocks of 2'.

PartitionBintoBI: k X qandB2: k X rso&X= TB1 andY = TB2. Apply the identity with U = X - TB1 and V = Y - TB2 to give

f(ZIB, E) = lyl-n/2 22 11-n/2

xh[tr(Y- TB2 - (X- TB1)1Ij'y12)

x 2-1.(Y- TB2 -(x-TB1)- j'12)

+tr(X- TB1)2-'(X- TB1)'].

Using the notation of Section 10.5, write

f (X, YIB, E) - 1 111-n72 12221-n/2

xhtr(Y -

WC) Y.- 1(Y -WC),

+tr(X- TB,)E-'l(X- TBI)T]

Hence the conditional density of Y given X is

f, (YlC, B,,~ 211 I 22 -1 X)

= nE22.1V-/2h(tr(Y - WC)2Y- .1(Y - WC)' +

whereiq = tr(X- TB1)Y'(X- TB1) and (f{q)f = Je h(truu' + ii)du. For (iv), argue as in (ii) and use the identities established in Proposition 10.17. Part (v) is easy, given the results of (iv)-just note that the sup over Y. l and B1 is equal to the sup over qj > 0. Part (vi) is

interesting- Proposition 10.13 is not applicable. Fix X, B1, and Y. and note that under Ho, the conditional density of Y is

f2( YIC2, 22 1, 1)

- I22.1V-/2h(tr(Y - TC2) p22 ,I(Y - TC2) + q>p(?)

This shows that Y has the same distribution (conditionally) as Y=



CHAPTER 10 501

TC2 + El2?/21 where E E Cr,n has density h(trEE' + q)qQ). Note that E(rEA) = e(E) for ali r' E e and A E 0r, Let t = min(q, r)

and, given any n x n matrix A with real eigenvalues, let X(A) be the

vector of the t largest eigenvalues of A. Thus the squares of the sample canonical correlations are the elements of the vector X(RyRx) where Ry= (QY)(Y'QY)-'(QY), R* = QX(X'QX)'-QX, since

S ( {X'QX X'Q Y - y'QX r'Q Y

(You may want to look at the discussion preceding Proposition 10.5.) Now, we use Problem 9 and the notation there-P = I - Q. First,

R -

Er, Rx EG @q, and ?(P) acts transitively on ffl and Yq. Under Ho

(and X fixed), f (QY) = 2(QE 1), which implies that f (rRyn) =

E(Ry), r1 E (P). Hence Ry is uniform on 9, for each X. Fix

Ro E 6YUq and choose Fo so that roR0 ro= Rx, Then, for each X,

E(A(RyRo)) = C(A(ToRyRoro)) = C(X(FoRyF0orORo'))

= E((X(rORyrORx) = (X(RyRx)).

This shows that for each X, X(RyRx) has the same distribution as X(RYRO) for Ro fixed where Ry is uniform on 96, Since the distribu tion of A(RYR0) does not depend on X and agrees with what we get in

the normal case, the solution is complete.



BIBLIOGRAPHY

Anderson, T. W. (1946). The noncentral Wishart distribution and certain problems of multi

variate statistics. Ann. Math. Stat., 17, 409-431.

Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York.

Anderson, T. W., S. Das Gupta, and G. H. P. Styan (1972). A Bibliography of Multivariate

Analysis. Oliver and Boyd, Edinburgh.

Andersson, S. A. (1975). Invariant normal models. Ann. Stat., 3, 132-154.

Andersson, S. A. (1982). Distributions of maximal invariants using quotient measures. Ann.

Stat., 10, 955-961.

Arnold, S. F. (1973). Applications of the theory of products of problems to certain patterned covariance matrices. Ann. Stat., 1, 682-699.

Arnold, S. F. (1979). A coordinate free approach to finding optimal procedures for repeated measures designs. Ann. Stat., 7, 812-822.

Arnold, S. F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley, New York.

Basu, D. (1955). On statistics independent of a complete sufficient statistic. Sankhya, 15, 377-380.

Billingsley, P. (1979). Probability and Measure. Wiley, New York.

Bj?rk, A. (1967). Solving linear least squares problems by Gram-Schmidt orthogonalization.

BIT,1, 1-21.

Blackwell, D. and M. Girshick (1954). Theory of Games and Statistical Decision Functions.

Wiley, New York.

Bondesson, L. (1977). A note on sufficiency and independence. Preprint, University of Lund,

Lund, Sweden.

Box, G. E. P. (1949). A general distribution theory for a class of likelihood criteria. Biometrika,

36, 317-346.

Chung, K. L. (1974). A Course in Probability Theory, second edition. Academic Press, New

York.

Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press, Princeton, N.J.

Das Gupta, S. (1979). A note on anciliarity and independence via measure-preserving transfor

mations. Sankhya, 41, Series A, 117-123.

503



504 BIBLIOGRAPHY

Dawid, A. P. (1978). Spherical matrix distributions and a multivariate model. /. Roy. Stat. Soc.

B, 39, 254-261.

Dawid, A. P. (1981). Some matrix-variate distribution theory: Notational considerations and a

Bayesian application. Biometrika, 68, 265-274.

Deemer, W. L. and I. Olkin (1951). The Jacobians of certain matrix transformations useful in

multivariate analysis. Biometrika, 38, 345-367.

Dempster, A. P. (1969). Elements of Continu?las Multivariate Analysis. Academic Press, Read

ing, Mass.

Eaton, M. L. (1970). Gauss-Markov estimation for multivariate linear models: A coordinate

free approach. Ann. Math. Stat., 41, 528-538.

Eaton, M. L. (1972). Multivariate Statistical Analysis. Institute of Mathematical Statistics,

University of Copenhagen, Copenhagen, Denmark.

Eaton, M. L. (1978). A note on the Gauss-Markov Theorem. Ann. Inst. Stat. Math., 30, 181-184.

Eaton, M. L. (1981). On the projections of isotropic distributions. Ann. Stat., 9, 391-400.

Eaton, M. L. and T. Kariya (1981). On a general condition for null robustness. University of

Minnesota Technical Report No. 388, Minneapolis.

Eckhart, C. and G. Young (1936). The approximation of one matrix by another of lower rank.

Psychometrika, 1, 211-218.

Farrell, R. H. (1962). Representations of invariant measures. 77/. /. Math., 6, 447-467.

Farrell, R. H. (1976). Techniques of Multivariate Calculation. Lecture Notes in Mathematics

#520. Springer-Verlag, Berlin.

Giri, N. (1964). On the likelihood ratio test of a normal multivariate testing problem. Ann.

Math. Stat., 35, 181-190.

Giri, N. (1965a). On the complex analogues of T2 and Attests. Ann. Math. Stat., 36, 664-670.

Giri, N. (1965b). On the likelihood ratio test of a multivariate testing problem, II. Ann. Math.

Stat.,S6, 1061-1065.

Giri, N. (1975). Invariance and Minimax Statistical Tests. Hindustan Publishing Corporation,

Dehli, India.

Giri, N. C. (1977). Multivariate Statistical Inference. Academic Press, New York.

Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations.

Wiley, New York.

Goodman, N. R. (1963). Statistical analysis based on a certain multivariate complex Gaussian

distribution (An introduction). Ann. Math. Stat., 34, 152-177.

Hall, W. J., R. A. Wijsman, and J. K. Ghosh (1965). The relationship between sufficiency and

invariance with applications in sequential analysis. Ann. Math. Stat., 36, 575-614.

Halmos, P. R. (1950). Measure Theory. D. Van Nostrand Company, Princeton, N.J.

Halmos, P. R. (1958). Finite Dimensional Vector Spaces. Undergraduate Texts in Mathematics,

Springer-Verlag, New York.

Hoffman, K. and R. Kunze (1971). Linear Algebra, second edition. Prentice Hall, Englewood

Cuffs, N.J.

Hotelling, H. (1931). The generalization of Student's ratio. Ann. Math. Stat., 2, 360-378.

Hotelling, H. (1935). The most predictable criterion. /. Educ. Psych., 26, 139-142.

Hotelling, H. (1936). Relations between two sets of vari?tes. Biometrika, 28, 321-377.

Hotelling, H. (1947). Multivariate quality control, illustrated by the air testing of sample

bombsights, in Techniques of Statistical Analysis. McGraw-Hill, New York, pp. 111-184.



BIBLIOGRAPHY 505

James, A. T. (1954). Normal multivariate analysis and the orthogonal group. Ann. Math. Stat.,

25, 40-75.

Kariya, T. (1978). The general MANOVA problem. Ann. Stat., 6, 200-214.

Karlin, S. (1956). Decision theory for Polya-type distributions. Case of two actions, I, in Proc.

Third Berkeley Symp. Math. Stat. Prob., Vol. 1. University of California Press, Berkeley,

pp. 115-129.

Karlin, S. and H. Rubin (1956). The theory of decision procedures for distributions with

monotone likelihood ratio. Ann. Math. Stat. 27, 272-299.

Kiefer, J. (1957). Invariance, minimax sequential estimation, and continuous time processes.

Ann. Math. Stat., 28, 573-601.

Kiefer, J. (1966). Multivariate optimality results. In Multivariate Analysis, edited by P. R.

Krishnaiah. Academic Press, New York.

Kruskal, W. (1961). The coordinate free approach to Gauss-Markov estimation and its

application to missing and extra observations, in Proc. Fourth Berkeley Symp. Math. Stat.

Prob., Vol. 1. University of California Press, Berkeley, pp. 435-461.

Kruskal, W. (1968). When are Gauss-Markov and least squares estimators identical? A

co-ordinate free approach. Ann. Math. Stat., 39, 70-75.

Kshirsagar, A. M. (1972). Multivariate Analysis. Marcel Dekker, New York.

Lang, S. (1969). Analysis II. Addison-Wesley, Reading, Massachusetts.

Lawley, D. N. (1938). A generalization of Fisher's z-test. Biometrika, 30, 180-187.

Lehmann, E. L. (1959). Testing Statistical Hypotheses. Wiley, New York.

Mallows, C. L. (1960). Latent vectors of random symmetric matrices. Biometrika, 48, 133-149.

Mardia, K. V., J. T. Kent, and J. M. Bibby (1979). Multivariate Analysis. Academic Press, New

York.

Muirhead, R. J. (1982). Aspects of Multivariate Statistical Theory. Wiley, New York.

Nachbin, L. (1965). The Haar Integral. D. Van Nostrand Company, Princeton, N.J.

Noble, B. and J. Daniel (1977). Applied Linear Algebra, second edition. Prentice Hall,

Englewood Cliffs, N.J.

Olkin, I. and S. J. Press (1969). Testing and estimation for a circular stationary model. Ann.

Math. Stat., 40, 1358-1373.

Olkin, I. and H. Rubin (1964). Multivariate beta distributions and independence properties of

the Wishart distribution. Ann. Math. Stat., 35, 261-269.

Olkin, I. and A. R. Sampson (1972). Jacobians of matrix transformations and induced

functional equations. Linear Algebra Appl., 5, 257-276.

Peisakoff, M. (1950). Transformation parameters. Thesis, Princeton University, Princeton, N.J.

Pillai, K. C. S. (1955). Some new test criteria in multivariate analysis. Ann. Math. Stat., 26, 117-121.

Pitman, E. J. G. (1939). The estimation of location and scale parameters of a continuous

population of any form. Biometrika, 30, 391-421.

Potthoff, R. F. and S. N. Roy (1964). A generalized multivariate analysis of variance model

useful especially for growth curve problems. Biometrika, 51, 313-326.

Rao, C. R. (1973). Linear Statistical Inference and Its Applications, second edition. Wiley, New

York.

Roy, S. N. (1953). On a heuristic method of test construction and its use in multivariate

analysis. Ann. Math. Stat., 24, 220-238.

Roy, S. N. (1957). Some Aspects of Multivariate Analysis. Wiley, New York.



506 BIBLIOGRAPHY

Rudin, W. (1953). Principles of Mathematical Analysis. McGraw-Hill, New York.

Scheff?, H. (1959). The Analysis of Variance. Wiley, New York.

Segal, I. E. and Kunze, R. A. (1978). Integrals and Operators, second revised and enlarged edition. Springer-Verlag, New York.

Serre, J. P. (1977). Linear Representations of Finite Groups. Springer-Verlag, New York.

Srivastava, M. S. and C. G. Khatri (1979). An Introduction to Multivariate Statistics. North

Holland, Amsterdam.

Stein, C. (1956). Some problems in multivariate analysis, Part I. Stanford University Technical

Report No. 6, Stanford, Calif.

Wijsman, R. A. (1957). Random orthogonal transformations and their use in some classical

distribution problems in multivariate analysis. Ann. Math. Statist., 28, 415-423.

Wijsman, R. A. (1966). Cross-sections of orbits and their applications to densities of maximal

invariants, in Proc. Fifth Berkeley Symp. Math. Stat. Probl., Vol. 1. University of

California Press, Berkeley, pp. 389-400.

Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika, 1A, 471-494.

Wishart, J. (1928). The generalized product moment distribution in samples from a normal

multivariate population. Biometrika, 20, 32-52.



Index

Affine dependence: invariance of, 405

measure of, 404, 418, 419

between random vectors, 403 Affinely equivalent, 404 Almost invariant function, 287 Ancillary, 285 Ancillary statistic, 465 Angles between subspaces:

definition, 61 geometric interpretation, 61

Action group, see Group

Beta distribution: definition, 320 noncentral, 320 relation toF, 320

Betarandom variables, products of, 236, 321,

323 Bilinear, 33 Bivariate correlation coefficient:

density has monotone likelihood ratio, 459 distribution of, 429, 432

Borel a-algebra, 70

Borel measurable, 71

Canonical correlation coefficients: as angles between subspaces, 408, 409

definition, 408 density of, 442 interpretation of sample, 421 as maximal correlations, 413

model interpretation, 456 and prediction, 418

population, 408 sample, as maximal invariant, 425

Canonical variates: definition, 415 properties of, 415

Cauchy-Schwarz Inequality, 26 Characteristic function, 76 Characteristic polynomial, 44 Chi-square distribution:

definition, 109, 110 density, 110, 111

Compact group, invariant integral on, 207 Completeness:

bounded, 466 independence, sufficiency and, 466

Complex covariance structure: discussion of, 381 example of, 370

Complex normal distribution: definition, 374, 375 discussion of, 373 independence in, 378 relation to real normal distribution,

377 Complex random variables:

covariance of, 372

covariance matrix of, 375

mean of, 372

variance of, 372 Complex vector space, 39 Complex Wishart distribution, 378 Conditional distribution:

fornormal variables, 116, 117 in normal random matrix, 1 18

507



508 INDEX

Conjugate transpose, 39 Correlation coefficient, density of in normal

sample, 329 Covariance:

characterization of, 75 of complex random variables, 372 definition, 74 of outer products, 96, 97

partitioned, 86 properties of, 74 of random sample, 89

between two random variables, 28

Covariance matrix, 73 Cyclic covariance:

characterization of, 365 definition, 362 diagonalization of, 366 multivariate, 368

Density, of maximal invariant, 272, 273-277

Density function, 72 Determinant:

definition, 41 properties of, 41

Determinant function: alternating, 40 characterization of, 40 definition, 39 as n-linear function, 40

Direct product, 212 Distribution, induced, 71

Eigenvalue: and angles between subspaces, 61

definition, 44 of real linear transformations, 47

Eigenvector, 45 Equivariant, 218

Equivariant estimator, in simple linear model,

157

Equivariant function:

description of, 249 example of, 250

Error subspace, see Linear model

Estimation: Gauss-Markov Theorem, 134

linear, 133 of variance in linear model, 139

Expectation, 71

Factorization, see Matrix F Distribution:

definition, 320 noncentral, 320 relation to beta, 320

F test, in simple linear model, 155

Gauss-Markov estimator: definition, 135 definition in general linear model, 146 discussion of, 140-143 equal to least squares, 145

existence of, 147 in k-sample problem, 148

for linear functions, 136 in MANOVA, 151

in regression model, 135 in weakly spherical linear model, 134

Generalized inverse, 87 Generalized variance:

definition, 315 example of, 298

Gram-SchmidtOrthogonalization, 15 Group:

action, 186, 187 affine, 187, 188 commutative, 185 definition, 185 direct product, 212 general linear, 186

isotropy subgroup, 191 lower triangular, 185, 186 normal subgroup, 189, 190 orthogonal, 185 permutation matrices, 188, 189 sign changes, 188, 189 subgroup, 186 topological, 195 transitive, 191 unimodular, 200 upper triangular, 186

Hermitian matrix, 371 Homomorphism:

definition, 218 on lower triangular matrices, 230, 231 matrices, on non-singular, 230

Hoteling's 7e:

complex case of, 381

as likelihood ratio statistic, 402



INDEX 509

Hypothesis testing, invariance in, 263

Independence: in blocks, testing for, 446 characterization of, 78 completeness, sufficiency and, 466 decomposition of test for, 449 distribution of likelihood ratio test for, 447 likelihood ratio test for, 444,446

MANOVA model and, 453,454 of normal variables, 106, 107

of quadratic forms, 1 14

of random vectors, 77

regression model and, 451 sample mean and sample covariance, 126,

127 testing for, 443,444

Inner product: definition, 14 for linear transformations, 32 norm defined by, 14 standard, 15

Inner product space, 15 Integral:

definition, 194 left invariant, 195 right invariant, 195

Intraclass covariance: characterization of, 355 definition, 131, 356 multivariate version of, 360

Invariance: in hypothesis testing, 263 and independence, 289 of likelihood ratio, 263 in linear model, 296, 256,257

in MANOVA model with block covariance, 353

in MANOVA testing problem, 341 of maximum likelihood estimators, 258

Invariance and independence, example of, 290-291,292-295

Invariant densities: definition, 254 example of, 255

Invariant distribution: example of, 282-283 on nxp matrices, 235

representation of, 280 Invariant function:

definition, 242 maximal invariant, 242

Invariant integral: on affine group, 202, 203

on compact group, 207

existence of, 196 on homogeneous space, 208, 210

on lower triangular matrices, 200 on matrices (nxp of rank p), 213-218

on m-frames, 210,211

and multiplier, 197 on non-singular matrices, 199 on positive definite matrices, 209, 210

relatively left, 197 relatively right, 197, 198 uniqueness of, 196 see also Integral

Invariant measure, on a vector space, 121-122

Invariant probability model, 251 Invariant subspace, 49 Inverse Wishart distribution:


Isotropy subgroup, 191

Jacobian: definition, 166 example of, 168, 169, 171, 172, 177

Kronecker product: definition, 34 determinant of, 67

properties of, 36,68 trace of, 67

Lawley-Hotelling trace test, 348 Least squares estimator:

definition, 135 equal to Gauss-Markov estimator, 145

ink-sample problem, 148 inMANOVA, 151 in regression model, 135

Lebesque measure, on a vector space, 121-125

Left homogeneous space: definition, 207 relatively invariant integral on, 208-210

Left translate, 195

Likelihood ratio test: decomposition of in MANOVA, 349 definition, 263



510 INDEX

Likelihood ratio test (Continued) in MANOVA model with block covariance,

351 in MANOVA problem 340 in mean testing problem, 384, 390

Linear isometry: definition, 36 properties of, 37

Linear model: errorsubspace, 133 error vector, 133 invariance in, 156, 157, 256, 257

with normal errors, 137 regression model, 132 regression subspace, 133

weakly spherical, 133 Linear transformation:

adjoint of, 17,29

definition, 7 eigenvalues, see Eigenvalues invertible, 9

matrix of, 10 non-negative definite, 18 null space of, 9

orthogonal, 18 positive definite, 18 range of, 9

rank, 9

rank one, 19

self-adjoint, 18 skew symmetric, 18 transpose of, 17

vector space of, 7

Locally compact, 194

MANOVA: definition, 150

maximum likelihood estimator in, 151 with normal errors, 151

MANOVA model: with block diagonal covariance, 350 canonical form of, 339 with cyclic covariance, 366 description of, 336 example of, 308 and independence, 453,454 with intraclass covariance, 356 maximum likelihood estimators in, 337 under non-normality, 398

with non-normal density, 462 testing problem in, 337'

MANOVA testing problem: canonical form of, 339

complex case of, 379

description of, 337 with intraclass covariance structure, 359 invariance in, 341 likelihood ratio test in, 340,347

maximal invariant in, 342,346 maximal invariant parameter in, 344

uniformly most powerful test in, 345 Matric t distribution:


Matrix: definition, 10

eigenvalue of, 44 factorization, 160, 162, 163, 164 lower triangular, 44, 159 orthogonal, 25 partitioned positive definite, 161, 162 positive definite, 25 product, 10 skew symmetric, 25 symmetric, 25 upper triangular, 159

Maximal invariant: density of, 278-279 example of, 242,243, 246

parameter, 268

and product spaces, 246

representing density of, 272 Maximum likelihood estimator:

of covariance matrix, 261

invariance of, 258 in MANOVA model, 151

in simple linear model, 138

Mean value, of random variable, 72

Mean vector: of coordinate random vector, 72

definition, 72 for outer products, 93

properties of 72

of random sample, 89

M-frame, 38

Modulus, right hand, 196 Monotone likelihood ratio:

for non-central chi-squared, 468,469

fornon-centralF, 469 for non-central Student's t, 470

and totally positive of order 2,467

Multiple correlation coefficient:



INDEX 511

definition, 434 distribution of, 434

Multiplier: on affine group, 204

definition, 197 and invariant integral, 197

on lower triangular matrices, 201

on non-singular matrices, 199 Multivariate beta distribution:

definition, 331 properties of, 331, 332

Multivariate F distribution: definition, 331 properties of, 331

Maximal invariant, see Invariant function

Multivariate General Linear Model, see MANOVA

Non-central chi-squared distribution: definition, 110, 111 for quadratic forms, 1 12

Noncentral Wishart distribution: covariance of, 317

definition, 316 density of, 317

mean of, 317

properties of, 316 as quadratic form in normals, 318

Normal distribution: characteristic function of, 105, 106 complex, see Complex normal distribution conditional distribution in, 1 16, 117

covariance of, 105, 106

definition, 104

density of, 120-126

density of normal matrix, 125

existence of, 105, 106

independence in 106, 107

mean of, 105, 106

and non-central chi-square, 111 and quadratic forms, 109

relation to Wishart, 307

representation of, 108 for random matrix, 1 18

scale mixture of, 129, 130

sufficient statistic in, 126, 127, 131 of symmetric matrices, 130

Normal equations, 155, 156

Orbit, 241 Order statistic, 276-277

Orthogonal: complement, 16 decomposition, 17 definition, 15

Orthogonal group, definition, 23 Orthogonally invariant distribution, 81 Orthogonally invariant function, 82 Orthogonal projection:

characterization of, 21 definition, 17 in Gauss-Markov Theorem, 134 random, 439, 440

Orthogonal transformation, characterization of, 22

Orthonormal: basis, 15 set, 1 5

Outer product: definition, 19 properties of, 19, 30

Parameter, maximal invariant, 268 Parameter set:

definition, 146 in linear models, 146

Parameter space, 252 Partitioning, a Wishart matrix, 310 Pillai trace test, 348 Pitman estimator, 264-267 Prediction:

affine, 94 and affine dependence, 416

Principal components: and closest flat, 457, 458

definition, 457 low rank approximation, 457

Probability model, invariant, 251 Projection:

characterization of, 13

definition, 12

Quadratic forms: independence of, 1 14, 115

in normal variables, 109

Radon measure: definition, 194 factorization of, 224

Random vector, 71 Regression:

multivariate, 451



512 INDEX

Regression (Continued)

and testing for independence, 451 Right translate, 195 Roy maximum root test, 348, 349

Regression model, see Linear model Regression subspace, see Linear model Relatively invariant integral, see Invariant

integral

Sample correlation coefficient, as a maximal invariant, 268-271

Scale mixture of normals, 129, 130 Self adjoint transformations, functions of, 52 Singular Value Decomposition Theorem, 58 Spectral Theorem:

and positive definiteness, 51 statement of, 50, 53

Spherical distributions, 84 Sufficiency:

completeness, independence and, 466 definition, 465

Symmetry model: definition, 361 examples of, 361

Topological group, 195 Trace:

of linear transformation, 47

of matrix, 33

sub-k, 56 Transitive group action, 191 Two way layout, 155

Uniform distribution: on M-frames, 234

on unit sphere, 101

Uncorrelated: characterization of, 98 definition, 87 random vectors, 88

Uniformly most powerful invariant test,

in MANOVA problem, 345

Vector space: basis, 3

complementary subspaces, 5 definition, 2 dimension, 4 direct sum, 6 dual space, 8 finite dimensional, 3 linearly dependent, 3 linearly independent, 3 linear manifold, 4 subspace, 4

Weakly spherical: characterization of, 83 definition, 83 linear model, 133

Wishart constant, 175 Wishart density, 239, 240

Wishart distribution: characteristic function of, 305 convolution of two, 306 covariance of, 305 definition, 303 density of, 239, 240

for nonintegral degrees of freedom, 329

in MANOVA model, 308 mean of, 305

noncentral, see Noncentral Wishart

distribution nonsingular, 304 partitioned matrix and, 310

of quadratic form, 307

representation of, 303 triangular decomposition of, 313, 314

Wishart matrix: distribution of partitioned, 31 1

ratio of determinants of, 319



multivariate statistics: a vector space approach || multivariate statistics: a vector space approach

Documents