canonical variate analysis: some practical aspects by ... · 5.5 c-v plot for arctanh correlations...

CANONICAL VARIATE ANALYSIS: SOME PRACTICAL ASPECTS

by

Norman Albert Campbell

Thesis submitted for the degree of Doctor of Philosophy in

the University of London and for the Diploma of Membership

of the imperial College

1

ABSTRACT

Techniques and guidelines are developed for more effective

application of canonical variate analysis.

The influence function is used to develop criteria for the

detection of atypical observations in discriminant analysis. For

Mahalanobis D2, the influence function is a quadratic function of the

discriminant score. The use of robust estimators of means and of

covariances, in conjunction with probability plots of associated

Mahalanobis distances, is shown to lead to enhanced detection of

atypical observations.

Robust M-estimation for canonical variate analysis is developed,

based on a functional relationship formulation. An alternative approach,

based on M-estimation of the canonical variate scores, is also presented.

Graphically-oriented procedures for comparing within-groups

covariance matrices are developed, using basic ideas from analysis of

variance and regression. A multivariate comparison leading to

graphical representation is also considered.

The role of shrunken estimation procedures in canonical variate

analysis is examined. Marked improvement in the stability of the

canonical vectors can be effected when a direction(s) of small between-

groups variation coincides with a direction(s) of small within-groups

variation.

A functional relationship model is used to develop methods for

comparing canonical variate analyses for several independent sets of

data. Criteria for examining the parallelism and coincidence of

discriminant planes, and the dispersal of the means, are given.

The usual canonical variate analysis is generalized to the

situation where the covariance matrices are not assumed to be equal.

Three generalizations are developed, corresponding to different

formulations of the usual approach.

Analyses of data from various fields are given throughout the

thesis to illustrate the application of the approaches developed.

2

ACKNOWLEDGEMENTS

I would like to thank Professors D.R. Cox and M.J.R. Healy for

their extensive contributions to an enjoyable and rewarding two

years of study at Imperial College. Their advice and encouragement

throughout the period and their constructive comments on an earlier

draft of the thesis and on papers arising therefrom is greatly

appreciated. I would also like to thank fellow students John

Tomenson, Daryl Pregibon and Peter Rundell, who discussed various

aspects of the work with me.

It is a pleasure to be able to acknowledge the collaboration

of colleagues in zoology, botany and genetics. Problems arising

during collaborative studies have led to the techniques proposed in

this thesis. Bruce Phillips provided the initial stimulation with

his data on geographic variation in whelks. My interest in multivariate

analysis dates from this project. Stephen Hopper, Chris Green,

Darrell Kitchener and John Dearn have spent many hours discussing the

role of multivariate studies in biological problems, and improving

my biological understanding of the areas of application. Collaboration

with Cathy Campbell, Rod Mahon, Tony Watson and Lou Koch is also

appreciated. William Atchley and Richard Reyment have encouraged me

to develop improved techniques for analyzing multivariate data.

My thanks are also due to the Commonwealth Scientific and

Industrial Research Organization and in particular the Division of

Mathematics and Statistics for the CSIRO Divisional Postgraduate

Studentship which made this period of study possible.

TABLE OF CONTENTS

3

Page

1

2

3

6

8

10

10

13

ABSTRACT

ACKNOWLEDGEMENTS

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

CHAPTER 1 A general introduction

1.1 Outline of the thesis

1.2 Canonical variate analysis

1.3 Canonical variate analysis - a functional relationship formulation

1.4 Geometry of canonical variate analysis

1.5 Computation of the canonical variate solution

19

24

26

32

CHAPTER 2

1.6 Adequacy of discriminant functions and subsets of variables

Detection of atypical observations in discriminant analysis 37

2.1 Influence function 37

2.2 Influence function in discriminant analysis 39

2.2.1 Influence function for Mahalanobis a2 41

2.2.2 Influence function for discriminant means 42

2.2.3 Influence function for the discriminant function coefficients 43

2.2.4 Approximations and representation of the influence functions 43

2.3 Probability plots to detect possible atypical values 45

2.4 An example 47

4

Page

CHAPTER 3 Robust procedures to examine variation within a group 55

3.1 Introduction 55

3.2 Robust estimation of multivariate location and scatter 56

3.3 Robust principal components analysis 62

3.4 Some practical examples 65

3.5 Discussion 74

CHAPTER 4 Robust canonical variate analysis 80

4.1 M-estimation of the canonical variate scores 80

4.2 Robust M-estimation of the canonical vectors 85

4.3 Some practical examples 96,

4.4 Discussion 105

CHAPTER 5 Graphical comparison of covariance matrices 109

5.1 Introduction 109

5.2 Graphical comparisons 110

5.2.1 Individual-Average plot 111

5.2.2 A multivariate comparison 121

5.2.3 Orthogonalized variables 126

5.3 Some examples 128

5.4 Further practical aspects 149

CHAPTER 6 Shrunken estimators in canonical variate analysis 151


6.2 Shrunken or ridge-type estimators in discriminant analysis 153

6.3 Mean square error of shrunken estimators for discriminant analysis 157

6.4 Shrunken estimators in canonical variate analysis 161

6.5 Practical aspects 165

6.6 Discussion 174

5

Page

CHAPTER 7 Comparison of canonical variates 179


7.2 Comparison of solutions 183

7.2.1 Individual orientation and dispersal 185

7.2.2 Common orientation, individual dispersal 187

7.2.3 Common orientation and common dispersal 189

7.2.4 Coincidence but individual dispersal 191

7.2.5 Coincidence and common dispersal 192

7.2.6 Common orientation, dispersal and position 194

7.2.7 Likelihood ratio statistics 197

7.3 An example 197

7.4 Discussion of some practical aspects 204

CHAPTER 8 Canonical variate analysis with unequal covariance matrices 207


8.2 Generalizations of the usual solution 209

8.2.1 Weighted between-groups formulation 209

8.2.2 Likelihood ratio formulation 211

8.2.3 Functional relationship formulation 216

8.3 Computation of the generalized solutions 220

8.4 Performance of the generalizations when the covariance matrices are equal 223

8.5 Comparison of solutions 228

8.6 Practical application 231

REFERENCES 236

6

LIST OF TABLES Page

3.1 Extract of listing of Thais data 68

3.2 Stem-and-leaf plot for ratios of robust to usual

variances for Thais data and for generated multivariate

Gaussian data 77

3.3 Stem-and-leaf plot for ratios of robust to usual

variances for scorpion data 78

4.1 Underlying means, standard deviations and correlations

for generated data 94

4.2 Summary of robust M-estimation canonical variate analyses

of generated data 95

4.3 Canonical roots and vectors for Dicathais data 97

4.4 Summary of non-unit weights from robust analyses of

Dicathais data 100

4.5 Canonical roots and vectors for Thais data 101

4.6 Summary of non-unit weights from robust analyses of

Thais data 103

5.1 Analysis of variance SSQ's from comparison of regressions

calculations with row means as regressor variables 117

5.2 Variances and correlations for the grasshopper data 129

5.3 Fitted linear regressions and analysis of variance table

for log variances for grasshopper data 133

5.4 Analysis of variance table for arctanh correlations for

grasshopper data 135


for log variances for Thais data 140

5.6 Group rankings for each row of I-A plot for arctanh

correlations for Thais data 145


for arctanh correlations for Thais data 146

List of Tables (continued)

7

Page

5.8 Correlation coefficients for Thais data 148

6.1 Means, pooled standard deviations and correlations

for Dicathais data 166

6.2 Eigenanalysis and canonical variate analyses for

Dicathais data 167

6.3 Canonical roots and vectors for alternative shrunken

estimator formulations for Dicathais data 173

7.1 Representation of comparisons of models of interest 181

7.2 Summary of main results for various models 198

7.3 Summary of grasshopper data 200

7.4 Canonical roots and vectors and determinants for

various models for grasshopper data 203

8.1 Simulation results for comparison of generalized

solutions with usual solution 225

8.2 Maximized log likelihoods for Afrobolivina data 233

8

LIST OF FIGURES Page

2.1 Plot of change in Mahalanobis D2 against

discriminant score 50

2.2 Gamma probability plot of influence function values

for D2

2.3 Gamma probability plot of influence function values

for length of coefficient vector

3.1 Gaussian probability plots of cube root of Mahalanobis

squared distances for group 3 of Thais data

3.2 Gaussian and gamma probability plots of Mahalanobis


4.1 Components of squared distance d2 for robust canonical km

variate analysis

4.2 Canonical variate means for Dicathais data

4.3 Canonical variate means for Thais data

4.4 Gaussian and gamma probability plots of Mahalanobis


5.1 I-A, Q-Q and R-R plots for arctanh correlations for

generated data

5.2 I-A, Q-Q and R-R plots for log variances for grasshopper

data

51

52

69

72

92

98

102

104

122

130

5.3 M-S and C-V plots for log variances for grasshopper data 136

5.4 I-A plot for arctanh correlations for grasshopper data 137

5.5 C-V plot for arctanh correlations for grasshopper data 137

5.6 I-A and Q-Q plots for log variances for Thais data 139

5.7 M-S and C-V plots for log variances for Thais data 141

5.8 I-A plot for arctanh correlations for Thais data 144

5.9 M-S and C-V plots for arctanh correlations for Thais data 147

6.1 Plots of canonical variate coefficients and roots for

Dicathais data 168

9

List of Figures (continued)

7.1 Representation of three groups for three sets for

various models

7.2 Canonical variate means for grasshopper data

8.1 Plot of canonical variate means versus depth for

borehole samples for Afrobolivina

Page

182

201

234

10

CHAPTER ONE: A GENERAL INTRODUCTION

1.1 Outline of the Thesis

Canonical variate analysis, or multiple discriminant analysis, is a

widely used multivariate technique, particularly in biology, geology,

and medicine. The emphasis in the former fields is on description

and summarization of group differences, while in the latter field the

emphasis is primarily on allocation or diagnosis.

The motivation for the study reported here arose from extensive

consultation and collaboration with colleagues in CSIRO, University of

Western Australia and the W.A. Museum, on the application of multi-

variate techniques to biological and agronomic problems.

In the course of these case studies (see author references), it

became obvious that despite the widespread applicability of canonical

variate analysis, surprisingly little is available to guide the

applied statistician in the use of the approach. What little guidance

exists relates to the two-group discriminant function, and even here

the emphasis is almost solely on allocation rates. The general aim of

this study is to examine in detail various practical aspects of

canonical variate analysis, and where necessary to develop techniques

and provide guidelines for more effective application of the approach.

The emphasis in this study is on canonical variate analysis as a

multivariate approach which provides a description and summary of

multivariate differences between groups. Allocation is not considered.

A general outline of canonical variate analysis, with particular

emphasis on the underlying geometry, is given in the remaining Sections

of this Chapter. The remainder of the present Section provides an

introduction to subsequent Chapters.

11

Procedures for detecting atypical observations are considered in

Chapters Two, Three and Four. The provision of analyses little

influenced by such observations is considered in Chapter Four.

A graphically-oriented approach for indicating atypical observations

in discriminant analysis, based on the influence function, is developed

in Chapter Two. In Chapter Three, the role of robust M-estimation of

means and covariances is considered. The use of probability plots of

Mahalanobis distances to indicate atypical observations is examined.

A functional relationship formulation for canonical variate analysis

is used in Chapter Four to develop a robust M-estimation approach to

canonical variate analysis. An alternative approach, based on robust

M-estimation of the canonical variate scores, is also presented.

The basic assumptions in canonical variate analysis are that the

vectors of observations on distinct individuals are independently and

identically distributed and that the group covariance matrices are

equal. The original development of the discriminant function, due

to Fisher (1936), does not assume a specific distributional form.

The optimum properties of the discriminant function for the Gaussian

case were first presented by Welch (1939). The assumption of a multi-

variate Gaussian distribution leads to a formal derivation of the

canonical vectors using a likelihood ratio approach, and this turns

out to have some useful extensions. The assumption of an underlying

multivariate Gaussian distribution can be examined using techniques

described in Gnanadesikan (1977, Section 5.4.2). In particular,

probability'plots of Mahalanobis distances are very useful. A refine-

ment, using robust estimates, is presented in Chapter Three.

The commonly used test procedure for examining the equality of

covariance matrices, namely the likelihood-ratio test based on

12

determinants, is known to be very sensitive to departure from

Gaussian form. Moreover, no readily-interpretable information is

provided as to how the matrices differ. In Chapter Five of this

study, graphically-oriented procedures for comparing covariance

matrices are developed, using basic ideas from analysis of variance

and regression. These procedures are complemented by formal multi-

variate tests, together with further graphical description.

Chapter Six considers the stability of the canonical vectors.

The use of shrunken estimators is developed. Their adoption is shown

to lead to improved stability when certain of the directions describing

the within-groups variation are associated with small between-groups

variation.

A common problem in multivariate discrimination studies is the

analysis and comparison of several sets of data, each set relating to

the same physical or biological problem. Chapter Seven develops

likelihood ratio criteria for comparing the canonical variate solutions

for different sets of data. Criteria for examining the parallelism and

concurrence of the discriminant planes, and the dispersal of the

canonical variate means, are given.

Chapter Eight generalizes the usual canonical variate analysis

to the situation where the covariance matrices are not assumed to be

equal; three generalizations are developed, corresponding to different

formulations of the usual canonical variate problem. With the

assumption of equal covariance matrices, all formulations lead to the

same eigenanalysis. However, each generalization leads to a slightly

different solution, though two of them can be considered as special

cases of the third. All three generalizations are computationally

more complicated than the usual solution. The usual canonical variate

solution is well understood both conceptually and theoretically, and

13

there are undoubted advantages in using it if possible. Procedures

are suggested for comparing the generalizations with the usual

solution, to determine the effect of differences in covariance

structure on the directions of maximum between-group variation.

1.2 Canonical Variate Analysis

Consider g groups of data, with v variables measured on each of

nk individuals for the kth group. Let xkm represent the vector of

observations on the mth individual for the kth group (m = 1,...,nkt

k = 1,...,g). Define the sums of squares and products (SSQPR)

matrix for the kth group as

nk

Sk = E (xkm - xk)(xkm - xk)T m=1

where

nk -1 __

xk nlc mEl xkm (1.2)

and write

g W = E Sk = S (1.3)

k=1

for the within-groups SSQPR matrix on

nW g

= E (nk - 1) k=1

(1.4)

degrees of freedom (d.f.).

14

Define the between-groups SSQPR matrix as

g

where

— — — _ B = E nk (xk - xT) (xk - xT) T

k=1

-1 g - xT = nT E nkxk

k=1

(1.5)

(1.6)

and

g nT v E

k nk .

=1 (1.7)

Note that nT will also be written as n, without the subscript.

The simplest formulation of canonical variate analysis is the

distribution-free one of finding that linear combination of the original

variables which maximizes the variation between groups, relative to the

variation within groups. That is, find the canonical vector c1 which

maximizes the ratio ciBc1/c'Wc1; the vector is usually scaled so that

c1Wc1 = nW. The maximized ratio gives the first canonical root fl.

The canonical vector c1 and canonical root f1 can be found by explicit

use of a function maximization routine (and this is done in the

generalization in Section 8.2.1). However, use of Lagrange multipliers

leads directly to the eigenanalysis

(B - fW)c = 0 . (1.8)

Write

C = (c1, ..., ch)

15

and

F = diag(fl, ..., fh)

where

h = min(v,g-1) . (1.9)

Then the eigenanalysis in (1.8) leads to

BC = WCF

with

CTWC = nWI (1.10)

and

CTBC = n';

the canonical variates are uncorrelated both within and between groups,

and have unit variance within groups. The approach described in this

paragraph will be referred to in Section 8.2.1 as the weighted between-

groups formulation.

For two groups, Fisher's linear discriminant function results.

Write dx = xi - x2, nT = nl + n2 and define Mahalanobis squared distance

i ~/ 1 as D2 = d 1Dpx. Then c = D- `

nwl 'x and f = nWlnln2nT1D2. T

The distribution-free approach given above follows the original

derivation of the linear discriminant function by Fisher (1936) and the

generalization to canonical vectors by Fisher (1938), Bartlett (1938)

16

and Hotelling (1936). Rao (1948, 1952) generalized the linear

discriminant function by finding the linear combinations cix which

maximize the total Mahalanobis distance between all pairs of groups

in the reduced number of dimensions. The sum of the squares of

distances between the canonical variate means for either formulation

in all v dimensions for any pair of groups in the analysis is equal

to the corresponding Mahalanobis D2 for the pair of groups. However,

the first p canonical vectors as defined in (1.10) do not in general

maximize the total D2 over all pairs of groups in p dimensions. For

this formulation, the unweighted between-groups matrix

B = E (xk - xU)(xk - xU)T, with xU = g-1 E xk, replaces B in k=1 k=1

(1.8) and (1.10). Rao (1952, Sections 9c.2 and 9d.1) and Gower

(1966, p.589) discuss the two formulations.

Write

T = B + W;

then an equivalent formulation is to maximize the ratio

ciBcl/ciTcl, leading to the eigenanalysis

(B - r2T)c = 0 . (1.12)

The ratio r1 is the square of the first sample canonical correlation

coefficient. The vector c1 is scaled so that ciTcl = nw(1 - ri)-1 =

nw(1 + f1), so that once again c1Bcl = nwr1(1 - r1)-1 = nwf and

T c4Wci = nw.

Now assume that xkm ti Nv(uk,E). The maximized likelihood when the

uk are unrestricted is

(20-nv/2I n-1W I -n/2e-nv/2 (1.13)

17

with 2 v(v+l) + gv estimated parameters. The maximized likelihood for

the hypothesis specifying equality of the uk is

(2,0 -nv/2 (n 1(w + B) I -n/2e-nv/2 (1.14)

with 2 v(v+l) + v estimated parameters. This leads to the well-known

likelihood ratio statistic given by IWI/IW + BI, and commonly referred

to as Wilks A. The statistic A may be written as

_ h h A Jw1/1w+BI = IW I /ITI II+W 1BI-1 = II (l+fi)-1 = II (1-ri)

i=1 i=1

(1.15)

it is asymptotically distributed as x2 on v(g-1) d.f. An improved

approximation due to Bartlett is given in Kshirsagar (1972, p.301).

The non-centrality parameter for the x2 distribution is the trace of the

population analogue of W 1B. The matrix W 1B is referred to here as the

sample non-centrality matrix. As (1.10) shows, an eigenanalysis of

this matrix gives the sample canonical roots and vectors. The approach

described in the paragraph will be referred to in Section 8.2.2 as the

likelihood ratio formulation.

Now assume that all g of the vxl vectors of group means uk lie on

a p-dimensional hyperplane (p<h) or that there are v-p linear functional

relationships between the means. This is equivalent to specifying that

Pk = u0 EY' k (1.16)

where ' is the vxp matrix of population canonical vectors. This approach,

which is used extensively in Chaptem Four, Seven and Eight, is outlined

in Section 1.3. It again leads to the eigenanalysis (1.10); the

18

estimator for '' is given by the first p columns of C. The maximized

likelihood is found to be

-nv/2 -1 -n/2 hn

n/2 -nv/2 (27r) In WI { (1 + i)} f e

i=p+l (1.17)

with 2 v(v+1) + v + vp - p2 + p(g-1) estimated parameters. The approach

outlined in this paragraph will be referred to in Section 8.2.3 as the

functional relationship formulation.

The functional relationship model (1.16) and associated maximized

likelihood in (1.17) encompasses the hypotheses resulting in the

maximized likelihoods in (1.15) and (1.14). The hypothesis of no

restriction on the means specifies that no reduction in dimensionality

is possible, so that p = h, and (1.17) reduces to (1.13). The hypothesis

of equality of the uk is equivalent to specifying that uk = uo in (1.16),

so that p = 0, and since

h { lI (1 + f.)}-1 = In 3111/1n- 1 047+B)1,

i=1

(1.17) reduces to (1.14).

An explicit eigenanalysis exists for the estimator of 'Y in (1.16).

However, if function maximization routines were to be used, and they

are in the generalizations in Chapter Eight since explicit solutions do

not result, the maximized likelihoods in (1.14) and (1.13) set bounds

for the maximized likelihood corresponding to (1.16) as p varies from

1 to h.

19

1.3 Canonical Variate Analysis - A Functional Relationship

Formulation

The descriptive appeal of canonical variate analysis lies in its

ability to provide a graphical representation of the essential

differences between the groups in a reduced number of dimensions.

Group similarities and differences can be readily discerned from a

scatter plot of group means using the important canonical variates

for the coordinate system. Since the canonical variates are chosen to

be uncorrelated within groups and are usually standardized to have

unit standard deviation within groups, Euclidean distance is the

appropriate metric for interpreting distances. The number of canonical

vectors, p, required to describe the between-groups variation specifies

the effective dimensionality of the space spanned by the group means.

The specification that p canonical vectors are required is equivalent

to the specification that the vectors of group means lie on a

p-dimensional hyperplane, with p < h. An excellent discussion of this

aspect is given in Kshirsagar (1972, pp. 354-360).

Consider again g independent v-variate Nv(uk,E) populations, and

assume that all g of the vxl vectors of population means lie: on a

p-dimensional hyperplane, where p is specified. This can be written

as the following model

uk - uo + EICk (1.18)

where f is the vxp matrix of population canonical vectors. In (1.18),

uo is an unknown vxl fixed vectors E is the unknown vxv population

covariance matrix, assumed common for all populations; and the Ck are

unknown pxl vectors. The population canonical vectors L are uncorrelated

within groups, and are standardized to have unit standard deviation

20

within groups, so that TTET = IpXp. Writing EYE = E in (1.18),

with =TE -lū = I, it follows that the columns Ci of E are basis vectors

for the canonical variate space, with the ?k specifying the coordinates

for each mean.

Rao (1973, Section 8c.6) gives results for the model in (1.18) when

E is known. Anderson (1951) derives the functional relationship solution

via a canonical correlation or regression formulation, and gives

(Anderson, 1,51, Section 7) some results for the g-sample problem

(which is equivalent to the formulation in (1.18)). A direct derivation

is given here, using results for matrix differentiation given in Bibby

and Toutenburg (1977, Appendix B).

Consider again the model

k U0 + ETck I

then the relevant part of the log likelihood is

g _ _ -n log I E ( - trE 1S - E nk (xk-uŌ ETCk) TE-1 (x.-p0 Z %V? k) . (1.19)

k=1

Differentiation with respect to (w.r.t.) Ck gives

Write

k = (TTE y) -1 ET)(xk - 110).

P = ET(TTET) 1',T ,

(1.20)

(1.21)

noting that P2 = P and that (I-P)TE-1(I-P) = E-1(I-P). Here P is a

generalized projection operator with respect to the metric E.

The log likelihood in (1.19) becomes

21

g _ _ -n loglE) - trE-1S - E nk(xk-u0)TE-1(I-P)(xk-V0),

and differentiation w.r.t. u0 leads to

(1_13)11o = (I-P)xTT

(1.22)

Then the log likelihood in (1.19) maximized w.r.t. U0 and ck is

-n loglEJ - trE-1S - trE 1B + trE 1PB; (1.23)

here

E-1P = T ('YTET) -1TT

.

Using results in Bibby and Toutenburg (1977, Appendix B),

differentiation of (1.23) w.r.t. E and w.r.t. 'Y gives

A-1 "-1 A-1 A ATAA -1AT A ATAA -aAy -nE + E (S+B)E - Y' ('Y ET) Y' BT (T E'Y) 'V = 0 (1.24)

k=1

and

ATAA -14T A 4rA lATA ATAA -1AT -(T E'V) 'Y BT (T ET) 'Y E + ('Y E'V) 'Y B = 0 . (1.25)

Now introduce the usual conditions in canonical variate analysis,

namely that the canonical vectors are uncorrelated within groups with

unit variance, and are uncorrelated between groups, viz.

ATAA 'Y ET = I

and (1.26)

'Y TBT = nFp

where F is a diagonal matrix.

Substitution of (1.26) into (1.24) leads to

nE = S + B - EWnF'YTE,

while substitution of (1.26) into (1.25) leads to

A AA

BT = E'YnF .

22

(1.27)

(1.28)

A

Postmultiplication of (1.27) by 'Y, and substitution of (1.26) and

(1.28) gives

AA A nE'Y = ST, (1.29)

which gives the fundamental canonical variate equation

A Bey = SlFp

which is of the same form as (1.10). Premultiplication of (1.29) by

AT and use of (1.26) gives

'I 'TET = Y'Tn-1ST = I .

Substitution of (1.29) into (1.27) gives

nE = S +B-S'IIFWTSn 1

S+ B-B/TTSn 1 (1.31)

.4 AT + B - B'YFT Bn -1 .

p

(1.30)

23

It now remains to specify the canonical vectors 4' which maximize

(1.23). From (1.27),

nE(I +'YF'Y E) = S + B P

and so

114 n I E III + YF'YTE l = IS+BI

or

11211 + Fpi = IS+BI.

Also,

trE-1 (S+B) = n tr(I + 'VFp'Y E) = n tr(I + Fp)

and

trE-1PB = tr'YV B = n tr F .

P

Hence the maximized log likelihood in (1.23) becomes

-n log n-'VIS+BI + n logII+Fi - nv.

Now partition C and F in (1.10) as

C =(Cp, Cq)

(1.32)

F 0 F ( P )

0 F q

24

where C is vxp and F is pxps then the log likelihood is maximized

A by choosing `Y = C. That is, the first p vectors of C give the

required canonical vectors under the functional relationship formulation.

A A It now remains to find expressions for the

~k and hence the uk.

From (1.20) and (1.26),

k = Cp

- u0),

and so from (1.18) and (1.29).

-1 T uk = uo + nSC C(xk - uo) P P

From (1.21), (1.29) and (1.22), this becomes

T uk = xT + VCpCP(xk - xT)

with V = n-1S. The canonical variate means are given by

ATA = C

PTxk .

These are the usual canonical variate means found by substituting the

vectors of sample means in the first p canonical variate equations.

1.4 Geometry of Canonical Variate Analysis

Geometrically, canonical variate analysis may be considered as a

two-stage rotation procedure, as illustrated in Rempe and Weber (1972).

The first stage involves orthogonal rotation of the original variables

to new uncorrelated variables. This may be accomplished in a number of

ways, one of the most common being to determine the principal components

of the original data, which corresponds to finding the principal axes

25

of the pooled within-groups covariance ellipsoid. The new

uncorrelated variables are then scaled by the square roots of the

corresponding eigenvalues to have unit variance, so that the resulting

variables are orthonormal. The second stage of the procedure involves

a principal components analysis of the SSQPR matrix of the group means

in the space of the orthonormal variables. Mahalanobis D2 is the

square of the usual Euclidean distance in the rotated scaled orthonormal

variable space. Note that it is in the rotated scaled space, in which

concentration ellipsoids have become concentration spheres, that the

canonical variate group means are most usefully plotted.

The geometrical approach for the eigenanalysis first-stage rotation

may be expressed algebraically as follows. Write W in terms of its

eigenvectors U = (u1,...,uv) and eigenvalues E = diag(el,...,ev), i.e.

write

W = UEUT . (1.33)

The matrix of scaled eigenvectors UE-1/2

= (ul/V1 ,. .. 'u / v)

provides the first-stage orthonormalization, to

z = E 1/2UTx . (1.34)

The between-groups SSQPR matrix for these variables is given by

E 1/2UTBUE-1/2 (1.35)

The second-stage principal components analysis is

(E 1/2UTBUE-1/2 - fI)a = O (1.36)

and gives the canonical roots fi and canonical vectors for the orthonormal

26

variables ai directly. Premultiplication by UE-1/2

and comparison

with (1.8) shows that the canonical vectors ci for the original variables

x are found from the ai by

ci = UE-1/2ai . (1.37)

Note that the canonical variate scores cix and aiz are the same.

1.5 Computation of the Canonical Variate Solution

The usual numerical procedure is to determine the canonical roots f

and canonical vectors a of W-1/2BW1/2; the c then follow since

c = W-1/2a. This procedure implicitly involves orthonormalization of

the original variables x, by transformation to new variables z = W-1/2

x

with identity within-groups matrix. The square-symmetric matrix

W1/2BW1/2 is simply the between-groups SSQPR matrix for the ortho-

normalized variables.

The two-stage eigenanalysis or principal components analysis given

in the previous Section provides one approach.

Let X* be the gxv matrix of group means and Z* = X*UĒ 1/2 be the

matrix of group means for the orthonormal variables defined in (1.34).

It is assumed that X* is mean-centred by columns, such that NX* = 0,

where

N = diag(ni,...,ng) .

For an unweighted between-groups formulation, 1 X* = 0.

Let

X = N1/2X*

and

N1/2Z* = XUE 1/2 (1.38)

27

For an unweighted between-groups formulation, X = X* and i.= Z*.

Then

-T- B = X (1.39)

and the between-groups SSQPR matrix for the orthonormal variables is,

from (1.35) and (1.38) ,

E-112UTX LXUE 112 = Z Z.

The eigenanalysis given in (1.36) becomes

(Z Z - fI)a = 0 .

With

m = Za ,

this may be written as

ZZ m = fm,

(1.40)

(1.41)

(1.42)

where

mTm = f .

From (1.41), (1.38) and (1.37),

m = XUE-1/2a - Xc (1.43)

28

and so N-1/2m = X*c which is just the vector of canonical variate

means.

The eigenanalysis of ZZ as in (1.42), rather than of the more

usual Z in (1.40), is often referred to as the Q-technique analysis

(Gower, 1966).

From (1.43),

XX m = X Xc.

But from (1.39) and (1.10),

X Xc = fWc

So

Xm= fWc

and hence

c = f-1W 1X m. (1.44)

From (1.42) , (1.39) and (1.33) ,

XUE 1U X m = fm

or

XW 1X m = fm . (1.45)

The eigenanalysis based on (1.42) (or (1.45)) and (1.44) will be

preferable when g « v.

29

An alternative procedure (Ashton, Healy and Lipton, 1957; Gower,

1966) is to base the first-stage orthonormalization on a triangular

decomposition or Cholesky decomposition of W.

Let

W = UTUT

where UT is a vxv upper triangular matrix.

The orthonormal variables from the first stage are given by

ZT = UTTx .

The orthonormalization is a successive procedure: z1 is xl apart from

scaling; z2 is a function of xl and x2 only, and represents the residual

component of x2 after regressing on xl (=z1); and finally zv represents

the residual component of xv after regressing on all the previous x's.

Moreover, the ith diagonal term of U-1 is simply the inverse of the

ith diagonal term of UT; and the square of the latter is the residual

or conditional SSQ of xi given xl " " 'xi-l•

Now UT can be written as

UT = E

TUU (1.46)

where ET = diag(uTll

...,uTvv), and UU is a unit triangular matrix.

Write

ZT = XUU'ET1 . (1.47)

The eigenanalysis for the second stage is

(z - fI)aT = 0 (1.49)

and

C = UUlET1aT .

The eigenanalysis in (1.48) can be written,from (1.47), as

(ETIUUPX LXUUlETl - fI)aT = 0 . (1.49)

30

The procedures outlined above involve formation of SSQPR matrices.

The trend now is to use numerically more accurate procedures which

avoid this, by working directly on decompositions of the original data

matrix (Chambers, 1977, Chapter 5). The procedures discussed above can

be specified in terms of singular value (SVD) and QR decompositions

(see Chambers, 1977, Sections 5e and 5b). Consider the Q-technique

formulation: by analogy with principal components via singular value

decomposition (Chambers, 1977, p.125), it follows immediately that

ZO = YGAT

is the appropriate singular value decomposition for the second stage of

the canonical variate analysis. Here

A0 01, ... ►rm ) ,

F = diag(fl,...,f) = G2 ►

and ZO is either Z or ZT, whence a0 is either a or aT, while

M = 4111 = YG = YF1/2

This procedure will determine the second-stage analysis more

accurately; it also unifies the usual and Q-technique approaches.

However, the first stage still involves formation of SSQPR matrices.

31

Hence it would be desirable to be able to determine the eigenvectors

or triangular matrix for the first-stage orthonormalization using

numerically accurate procedures. An obvious approach is to determine

Z and E (or ZT and ET) simultaneously - but this does not appear

possible.

One possible computational approach for the first stage is as

follows. Let X be the nxv matrix of observations and let XG be the

nxv matrix such that the column sum for each group is zero; then

XGXG = W. The first-stage simultaneous rotation can be effected by

a singular value decomposition of XG as YGGGUT, which gives E = GG

and so Z = XUGG'. Since XG is mean-centred by groups, YG is of little

direct interest. Similarly, if XG = YTUT is the QR decomposition,

then ZTUT = X. There is, of course, the actual size of the data

matrix to be considered. It is not uncommon for a canonical variate

problem to include more than 500 observations; the Dicathais example

considered in Chapters Four and Six includes more than 850 observations,

while Hopper and Campbell (1977) consider 670 observations and 30

variables. The size of the data matrix and the computational time may

preclude explicit consideration of the above approaches for the first-

stage computations.

If numerically accurate procedures are to be used for the first-

stage computations, an alternative approach to that given in the previous

paragraph is to consider the nxv matrix X0 such that the overall column

sum is zero. Then XTXO = T. Here a singular value or QR decomposition

of X0 will lead directly to orthonormalized variables which, when

averaged, will produce matrices which play the same role as "For FT

respectively. The resulting singular values of Z or IT will lead to the

canonical correlation coefficients ri, with fi = ri(1 - ri)-1.

32

1.6 Adequacy of Discriminant Functions and Subsets of Variables

This Section reviews results for examining in greater detail the

nature of the discrimination provided by the canonical vectors. Let

xv denote all v variables, let x denote a subset of p variables, and

let xq denote the remaining q = v-p variables. Write the Wilks ratios

in (1.15) corresponding to xv and to x as Av and A. If Ap is similar

to Av, then it is reasonable to conclude that the variables xq do not

contain any additional information for discrimination when the variables

x are first included. The hypothesis of no additional information in

the xq given the x has also been termed the hypothesis of sufficiency

of the x by Rao (1970). Since the Wilks ratios are invariant under

linear transformations, the subset x may be equated with hypothetical

discriminant vectors; in this context, Bartlett (1951) refers to the

adequacy of the hypothetical vectors for discrimination.

One formulation for the hypothesis of no additional information,

which is outlined below, is to consider the equality of the conditional

means E(xq/xp) for each group, assuming an underlying multivariate

Gaussian distribution. The Wilks ratio for the hypothesis is

Aq.p = Av/Ap, as given in (1.53) below. The derivation can also proceed

via a multivariate analysis of covariance formulation with x as the

covariates and xq as the variates (see Rao, 1952, Section 7d.4). Rao

(1970, p.589) presents other formulations.

The hypothesis of no additional information can be extended to

examine the adequacy of hypothetical discriminant vectors. Here p

hypothetical variables ETx are equated with the xp, and q hypothetical

variables with the xq. The approach can be extended to consider

agreement in direction of the sample and hypothetical discriminant

vectors, and =planarity of the means. Consider initially a single

hypothetical discriminant vector t. Then the Wilks ratio for the single

33

variable ETx is A = ETWi/ETTC and the adequacy of ETx is examined by

considering Av/AC. Note that the ratio of the between-to-total SSQ

for the hypothetical variable is 0 = 1 - AE - ETBE/ETTC which is the

square of the canonical correlation coefficient for the hypothetical h

vector. Wilks Av can be written as (1 - r1) II (1 - r.) using (1.15) i=2

and so

h Av/Ag

= {(1 - r2)/(1 - 0)} II (1 - ri). i=2

The first term (1 - ri)/(1 - 0) represents the ratio of within-to-total

SSQ for the sample and hypothetical vectors and is a reflection of

their agreement in direction. The second term II (1 - rfi) is a i=2

measure of the lack of collinearity of the group means. The above

formulation is essentially due to Bartlett (1951).

The extension to p hypothetical discriminant vectors E is immediate:

with 1 - 0 _ (~TTTSI, the direction term is II (1 - r?)/(1 - 0), i=1

while the coplanarity term is II (1 - r2). Radcliffe (1966) notes

i-p+1 that tests based on the above factorization are approximate in the sense

that the exact distributions of the factors are not known, nor is it

claimed that the factors are independent or almost independent.

Bartlett (1951) and Williams (1961, 1967) have developed exact factori-

zations using regression arguments, though the so-called approximate

factorization has undoubted practical appeal.

The direction and coplanarity terms can be derived as ratios of

maximized likelihoods. The direction term corresponds to the ratio of

maximized likelihoods for the hypothesis that ETx is adequate for

discrimination versus the hypothesis that the means lie on a

p-dimensional hyperplane. The second, coplanarity, term corresponds

to the ratio of maximized likelihoods for the p-dimensional hyperplane

h

34

hypothesis versus the hypothesis of no restriction on the means.

Radcliffe (1967) has derived the likelihood ratios in the context of

a canonical correlation or reduced-rank regression formulation. A more

direct approach in the context of canonical variate analysis is to

specify various models or hypotheses for the means uk as in Section 1.3

and this is now outlined.

Partition xv, uk and t as follows. Write xv = CxP,xTq)T,

Pk = (uPk,ucik)T, and E - ( Epp E') . Similar partitions will be qP qq

used for the various matrices and vectors introduced in Section 1.2.

The hypothesis that there is no information in xq for discrimination

conditional on x may be specified as the equality of the conditional

means u , where u = u - S u and S = E E-1 Write g q•P►k q•P,k qk qP Pk qP qP pp

W = E Sk as in (1.3), and W = W - W WW . The maximized k=1 qq•P qq qP PP Pq

likelihood for no restriction on the conditional means is easily shown

to be

(20-np/2In 1 ppI-n/2e np/2(2~r) nq/2I

n 1Wgq.p1-n/2e-nq/2.

Since

IWI 114 11Wgq.pl. (1.50)

this likelihood is equivalent to the usual likelihood given in (1.13).

Let T, T and T pp pq qq

Tgq.p as for Wgq.p above.

specifying equality of the

be the partition of T in

The maximized likelihood

conditional means u q•P,k

(1.11), and define

for the hypothesis

but no restriction

on the upk is

(2w)-nv/2In 1W I-n/2In 1T I-n/2e-nv/2 PP qq.p

35

with 2 (v+l) + gp + q estimated parameters. Using a determinantal

identity similar to that in (1.50), this may be written as

(21r)-nv/21 1 PP I-n/2In-1TI-n/2In-1T

PPIn/2e-nv/2

(1.51)

Note that the variables xq do not enter explicitly into any of the

maximized likelihoods in (1.13), (1.14) or (1.51).

If xp is equated with the hypothetical canonical variates x,

then

W = 5TF15 PP

and (1.52)

T = TTE = ET(B+W)E.

PP

The hypothesis that the means liee on a p-dimensional hyperplane

is discussed briefly in Section 1.2 and outlined in more detail in

Section 1.3. The maximized likelihood is given in (1.17).

The likelihood ratio statistic for examining the adequacy of the

p variables x is given by comparing the maximized likelihood for the

unrestricted hypothesis with that specifying equality of the conditional

means. From (1.13) and (1.51), this is

{(IWPPIITI)/(IWIITPPI)}-n/2 = (Av/Ap)

n/2 (1.53)

as noted in the second paragraph of this Section.

Now equate x with ETx. From (1.52), A = IETWEI/IET(B+W)EI, and

the adequacy of the p hypothetical discriminant vectors E is again

examined by considering Av/Ap.

=

n/2

} (1.54)

36

The ratio of maximized likelihoods for the hypothesis that ETx

is adequate for discrimination versus the hypothesis that the means

lie on a p-dimensional hyperplane is, from (1.51), (1.52) and (1.17),

and the relationships in (1.15),

A-n/21TI-n/2 Ā n/2 -n/2

h (111 -n/2{ II

(14.f ))-n/2

i-p+l

h {A II (l+f ,) }-n/2

v i=p+1 1

p { n (14.f.)-1)-n/2 i=1

This is the direction term referred to above, raised to the power n/2,

since A = 1 - 4).

The ratio of maximized likelihoods for the p-dimensional hyperplane

hypothesis versus the hypothesis of no restriction on the means is,

from (1.17) and (1.13), and the relationships in (1.15),

{ II (1+f)}-n/2 = { II (1-r1)}n/2 imp+l i=p+1

(1.55)

This is the coplanarity term referred to above raised to the power n/2.

The product of the two ratios in (1.54) and (1.55) is h

{ II (1-ri)/Ap}n/2, which,from (1.15), is simply (Av/Ap)n/2 as in (1.53). i=1

CHAPTER TWO: DETECTION OF ATYPICAL OBSERVATIONS IN DISCRIMINANT

ANALYSIS

In this Chapter, the influence function is used to develop criteria

for detecting atypical observations in discriminant analysis. Section

2.1 discusses the influence function. Section 2.2 develops the

influence function for various aspects of discriminant analysis:

Mahalanobis D2, the discriminant function group means, and functions

of the vector of coefficients. For Mahalanobis D2, the influence function

is a quadratic function of the deviation of the discriminant score for

the perturbed observation from the discriminant score for the mean of

the corresponding group. Chi-squared approximations to the distributions

of the influence functions of interest are also developed in Section 2.2,

and graphical representation is considered in Section 2.3. An example

is given in Section 2.4.

2.1 Influence Function

The applied statistician is often faced with the problem of how to

detect and then treat apparently atypical observations. The detection

of such observations in multivariate data is often more complex than

in the univariate case, since the effect of an atypical observation on

the means, variances and on the correlations between the variables may

need to be taken into account. Gnanadesikan and Kettenring (1972) and

Gnanadesikan (1977) discuss problems of detection of atypical observations

in multivariate data and point out some of the difficulties (see, in

particular, Gnanadesikan, 1977, Section 6.4.2).

One obvious and intuitively appealing approach is to carry through

an analysis with and without a suspected atypical observation, and to

37

38

compare the results so obtained. The influence function of Hampel

(1974) provides a useful tool to formalize this. The theoretical

influence function for a particular parameter 6, such as the mean,

is found by perturbing the distribution function F by adding a small

contribution from a unit mass at the point x, evaluating the parameter

at the perturbed distribution function, and subtracting the parameter

evaluated at the unperturbed distribution function. A formal definition

is given by Gnanadesikan (1977, p.272) (see also Devlin et al, 1975).

Essentially, the influence functiōn is the derivative of the parameter 0

w.r.t. the distribution function ?.

Distinction is made between the theoretical influence function,

described in the previous paragraph, and the sample influence function,

in which an actual observation is deleted. A A

If A is an estimator based on n observations, and A-m is an

A estimator, of the same form as A, determined without the mth observation,

A A Devlin et al. (1975) have suggested calling I-(xm;0) = (n - 1)(A - 8- )

the sample influence function. It is the sample influence function

which is of practical interest in the present context, though it is more

convenient to study the theoretical influence function, and extend the

results to the sample case.

When the parameter of interest involves more than one population,

as in multivariate between-group studies, the theoretical influence

function is then determined by perturbing only one of the distribution

functions in the above way (in the sample, by eliminating an observation

from only one of the groups); the parameter is evaluated for one of

the distribution functions perturbed and the others unchanged, and

the parameter evaluated for all the distribution functions unperturbed

is then subtracted.

39

This may be written more formally as follows. Consider a general

parameter A = T(Fl,...,Fk,...,Fg), expressed as a functional of the

distribution functions Pk, k = 1,...,g. The perturbed distribution

function may be written

Fk = (1 - E)Fk + cox,

where Sx is the distribution function which assigns unit probability to

the point x. Write ek = T(Fl,...,Fk,...,Fg) for the parameter evaluated

at the perturbed distribution function; the influence function at x is

then given by

Ik(x;A) = lim $ A

E-►0

The subscript k is not retained in the remainder of this Chapter, since

only the distribution function for the first population is perturbed.

In the evaluation of the theoretical influence function only terms of

order E need be retained; for the sample version, this is equivalent to

assuming that terms of order (n-2

) can be ignored.

2.2 Influence Function in Discriminant Analysis

Consider the population linear discriminant function, given by

TE-lx = *Tx, where x ti N(uk,E) if x belongs to population k = 1,2, and

S = ul - u2. The parameters considered here are Mahalanobis Q2 = STE-1S;

the discriminant means -1

~, uk; and the vector of coefficients ~ = E S.

These all involve E-1 and uk. Hence to determine the influence functions,

it is first necessary to consider the effect of the perturbation on uk

and E-1.

40

To do this, formally consider

E = w1EF + w2EF 1 2

with w1 + w

2 = 1 and w

k > 0

where EF = j(x - uk)(x - uk)dFk k

(2.1)

(2.2)

and uk = jxdFk. (2.3)

In the following derivation, it is assumed that EF1 = EF2 (i.e.

that the covariance matrices are equal). The adoption of general wl

and w2 in (2.1) is to cover the possibility of unequal sample sizes in

the extension to the sample influence function.

Now perturb the first population, evaluating (2.2) and (2.3) at

F1 (1 - 1 + edx. Write -► to indicate the parameter after

perturbation; then

ul + (1 - e)ul + ex = ul + e(x - ul) = ul + ez

where z = x - ul, and so the expression for S becomes

d-;d+ez. (2.4)

Similarly

EF (1 - e)E + ezzT, F1 F1

and so

E + (1 - ew1)E + ewlzzT

giving

-1 ew1E-1zzTE-1

E-1 (1 - ew1)-1

(E-1 - T -1 1-ew1 +ew1zE z

(1 + ew1)E-1 - ew1E-1zzTE-1 , (2.5)

41

to order e.

2.2.1 Influence function for Mahalanobis G2

Mahalanobis 02 is defined as (u1 - u2)TE-1(u1 - u2). Evaluation

of A2 at the perturbed distribution function Fl and the unperturbed

distribution function E.2' using (2.4) and (2.5), gives

A2 -r (d + ez)T{(1 + 1)E-1 - ew1E-1zzTE 1}(d + ez).

Again retain terms only up to order e, and write

$ ° 6TE-1z, (2.6)

which leads to A2 -► (1 + ewl)A2 + 2e$ - ewl02 .

Hence the influence function for A2 becomes

I(xJa2) w1A2 + 2. - wig2.

In (2.6), 4 = E-16 is the vector of discriminant coefficients, so

that $ is simply *(x - p1), which is the deviation of the discriminant

score from the discriminant mean for the first population.

42

Note that P is not standardized to unit variance within populations;

the variance is a2. Write the standardized vector of coefficients as

Ost A-1 O, and let Ost = a 1O. Since 0 N N(0,A2), then

ist ti N(O,1)

which corresponds to the usual form for the discriminant score.

The influence function for e2

in terms of Ost

is given by the

following main result:

I(x1A2) = w1A2 + st - w1A2 st (2.7)

2.2.2 Influence function for discriminant means

The discriminant mean for the kth population is $TUk• Now follow

the approach outlined in Section 2.2.1, and define

nk = zTE-luk,

which leads to

TUl + (1 + cw1)IPTU1 + e4 - ew1On1 + en1

and hence

I(x;*TU1) - w1PT11] + - wisnl + ni.

Similarly ,

I( T x;iU2) T = wl~U2 - wl+n2 + n2 •

43

2.2.3 Influence function for the discriminant function coefficients

The vector of discriminant coefficients is U • E-1d. From (2.4)

and (2.5)

I(x, ) • w1* - wi+Elz + E lz

wl* + (1 - wiS)E-1z. (2.8)

A simple scalar summary is the squared length of the vector. The

influence function for the squared length is given by

I (x; *TVS) • 2w1*T* - 2w1,*TE-1z + 2*TE lz

• 2wl*TVS + 2(1 - w1,) VITE-lz. (2.9)

By the results (2.4) and (2.5), or equivalently by (2.8) and (2.7), the

influence function for the standardized coefficients is

i (x;*s ) .. { - 2 l0 (2 0A-2)

ist - -lE-lz (wo - 1) .

2.2.4 Approximations and representation of the influence functions

The influence function is a useful tool for assessing the effect of

a point x on the parameter of interest. It is also possible to consider

the influence function as a random variable, since it is a mathematical

transformation of a random variable x and as such will have a probability

distribution. The distribution is considered here for the theoretical

influence function and the results are extended in Section 2.3 to the

sample influence function.

44

With x assumed to follow a multivariate Gaussian distribution,

the distribution of the influence function for A2 is negatively skewed.

Differentiation of (2.7) w.r.t. Ost shows that the maximum value of

I(x,A2) occurs at Ost = (wIA)-1 (or 0 = wll); hence

I x(x,A2) = w1A2 + w1 . From initial experience with the approach,

it seems to be more convenient to consider I x(x;A2) - I(x;A2) = ma

IM(x;A2) say, where

IM (x;A ) = w 1(1 - 2w1"st+ w1A2 st ) , (2.10)

since its distribution is non-negative and positively skewed, and so can

be more readily approximated.

From (2.10), IM(x;A2) can be written as

IM(x;A2) = w1A2(0st- w11A-1)

2 •

That is, IM(x;A2) is distributed as w1A2 times a non-central chi-squared

variate with 1 d.f. and non-centrality parameter (w2A2)-1. The moments

of IM(x;A2) are readily evaluated: E(IM) = wll + wiA2, and

E(IM) = w12

+ 6A2 + 3w2A4.

Johnson and Kotz (1970, Section 28.8) suggest approximating the

non-central chi-squared distribution by a gamma distribution; empirical

evidence also supports the approximation. Equating the moments for

IM(x;A2) with those for the bXv

distribution gives v = 1 + (2w12 A2 +

w1A4)-1,

and b = v-1(1 + w12A-2).

The form of the influence function in (2.7) suggests plotting the

change in Mahalanobis A2 against the corresponding discriminant score.

Useful graphical representations and approximations also exist for the

other summary statistics in discriminant analysis. The influence function

45

for the squared length of the coefficient vector, given in (2.9),

involves the discriminant score and *TE-1z. With = E-1z, a plot of

the influence function for 1pT*

against 4 and x = *T shows a quadratic

relationship, with the maximum occurring at • = wl, K = 0. Its value

is then 2w1pT*. For graphical representation, a plot of the change

in squared length of * against some linear combination of 0 and K

will usually be adequate, since the two are generally highly correlated.

The variances and covariance of and K are STS, STA-2S and STA-lc

where E = rArT and S = A-1/2rTd. Either the regression of 0 on K or

the first eigenvector seems suitable for graphical representation.

As for the theoretical influence function for I(x;A2), the shape

of the distribution for I(x;lT*) suggests considering instead

IM(xi)TIP) = I x(x;*TVS) - I(xi*TIP) where ma

IM(X ) 4) = 2(w1 A0

st - 1) K (2.11)

since its distribution is positively skewed; the second moment is more

difficult to evaluate, though it can be found by following the approach

given in Kshirsagar (1972, Chapter 6, Section 5).

2.3 Probability Plots to Detect Possible Atypical Values

In practical applications of discriminant analysis, interest usually

focubes on the degree of group separation, reflected in Mahalanobis D2,

and on the relative orders of magnitude of the coefficients as indicators

of the important discriminating variables.

The sample analogues of the theoretical influence functions are now

considered. As noted in Section 2.1, results for the theoretical influence

function carry over to the sample case if, in the derivation, e is replaced

46

by -1/(n - 1) and terms of order (n-2) can be ignored. Rather than

consider sample analogues of I(x;A2) in (2.7) and I(x;bT,) in (2.9),

it is more instructive, as in Section 2.2.4, to consider the sample

influence functions corresponding to IM(x;A2) in (2.10) and to

I(x;*T*) in (2.11). The sample influence functions are given by

•stby cs(xm - xl) where cs is the standardized vector of

sample discriminant coefficients, xm is the mth observation, and ;lc

is the vector of means for the kth group, k = 1,2. The weights wk

are given by nk(ni + n2)-1. Mahalanobis A2 is replaced by D2, while

T — — I) 6 becomes (x1 - x2)

T V-2 (I - x1), where V is the pooled covariance

matrix on n1 + n2 - 2 d.f.

The results presented in Section 2.2.4 hold only asymptotically

for the sample influence functions. Before proceeding with detailed

applications in practical situations, the adequacy of the chi-squared

approximations was checked against data generated from bivariate

Gaussian distributions with unit variances and correlation 0.9. The

means were taken as (0,0) and (0,2); 50 observations were generated

for the first group. A brief summary of one such run is as follows.

The observed variances are 0.70 and 0.90, and the correlation 0.89„

Mahalanobis D2 is 20.67. The following results are for the sample values

2 - D2 D m, rather than (n - 1) = 49 times this quantity. (Note that for

the sample influence function, (n - 1)(D2 - D2m), the n refers to the

number of observations associated with the perturbed group). A Q-Q

gamma plot of (n - 1)-1IM(x;D2), with the parameters estimated by

maximum likelihood using the smallest 45 order statistics (Wilk,

Gnanadesikan and Huyett, 1962), is similar to that derived from the x 2

approximation. The maximum likelihood estimate of the shape parameter

is 0.55; the x2 approximation gives 0.50. The corresponding estimates

replacing

47

for the scale parameter are 2.05 and 2.16 respectively. A Q - Q

plot of the sample influence function values for IM(x;cTc) against

the quantiles of a gamma distribution with parameters estimated by

maximum likelihood also provides a good approximation.

The examination of the results for the simulated data, and

subsequent practical experience, suggests a gamma approximation for

IM(x;D2) and for IM(x;cTc) to be adequate. The approach is now applied

to a practical example.

2.4 An Example

The following examination of data on two species of the rock crab

Leptograpsus is presented to show how the influence function and related

summary statistics can be used to assess the relative influences of the

different observations on the statistics of interest. It is then

necessary for the statistician to examine the absolute influence of the

observation; even when a few points appear to have a large relative

influence, they still may not affect the statistics of interest to any

important extent. Chapters Three and Four discuss robust procedures

for accommodating atypical observations.

Campbell and Mahon (1974) examined morphological divergence between

the two species of rock crab, here referred to as the blue species and

the orange species. A canonical variate analysis showed the major

divergence to be that between the species, with sexual dimorphism

being considerably less marked. In further unpublished work, the sexes

have been combined and this is done here. Five characters were measured

on 100 individuals of each species. Bivariate scatter plots are presented

in Figure 4 of Campbell and Mahon (1974).

The observed variances and correlations are very similar for the two

species. A discriminant analysis shows no overlap of discriminant scores.

48

The observed Mahalanobis D2 is 27.8.

Figures 2.1(a) and (2.1(b) show plots of the change in D2 against

the deviation of the standardized discriminant score from that for the

species mean - the term discriminant score will be taken to imply this.

The quadratic nature of the plots is obvious, with the maximum

occurring around cs(m - x1) = 0.4 (= 2/D) in each case. Mahalanobis

D2 is decreased for observations with discriminant scores between

approximately -0.5 and 1.5 while D2 is actually increased when an

observation with a discriminant score outside this interval is deleted.

The greatest increase in D2 corresponds to the deletion of observations

whose scores are furthest from that for the mean of the other species.

Examination of the 100 discriminant scores for each species shows

some negative skewness. For example, the first and last deciles, the

quartiles and median for the two species are (interpolated from the

ecdf and rounded): orange -1.50, -1.0, 0.0, 0.75, 1.25; and blue

-1.25, -0.75, 0.0, 0.50, 1.00. The scores for the orange species are

more dispersed.

Application of the often used, but only asymptotically true,

argument that the discriminant scores can be treated as standard

Gaussian deviates would suggest that three orange specimens and three

blue specimens should be examined further. The maximum increase in D2

corresponds to a discriminant score of 3.0 for the blue and 2.4 for

the orange species.

Figures 2.2(a) and 2.2(b) show Q-Q gamma plots of IM(D2), with

parameters estimated by maximum likelihood from the smallest 95 order

statistics. The linearity of both plots is apparent; the slope of the

plot for the orange species is very close to unity, as is that for the

blue species if the atypical observation is ignored. The estimated

shape parameters are 0.65 and 0.70 respectively. The linearity of the

Figure 2.1 Plot of D2 - D2m

against cs(xm - x) for

(a) orange species - score for blue mean is 5.27

(b) blue species - score for orange mean is -5.27

Figure 2.2 Gamma probability plot of IM(x;D2), with parameters

estimated by maximum likelihood for

(a) orange species - shape parameter = 0.65

(b) blue species - shape parameter = 0.71

Figure 2.3 Gamma probability plot of IM(x;cTc), with parameters

estimated by maximum likelihood for

(a) orange species

(b) blue species

The symbol *, • or # in these Figures and the Figures in the

following Chapters represents one individual; a number, on the

same figure, indicates that number of overprintings; and 9

indicates at least 9 overprintings

49

(a) Orange

I

0.15

0.00

N -015 cn

.6 0 0 i0-0.30 ~- .0 ca

.0 -0.45 a) 0) C c0 , -o•sōf--

-0.75

-0.90

132*

1

32156 3.41 2

21• 3* a

• 1 2

1 2 • 1

• • •

• •

•

I

(b) Blue 6.1279 99 *914154 sQl•

*2 1 • •1 1* .

12 1 • • 1

N •

•

3• 3• 2*

-2.0 -1.0 0.0 1.0 2.0 -1.0 0.0 1.0 2.0 3.0 Discriminant Score Discriminant Score

Figure 2.1

(a) Orange

**

..)(441-

1.2

0.9

0•6

(b) Blue

* *

1* * * * *1 * 1* - 0.3 _ *2 .- *2 - )Ic1-* * *1

I 13 *2 22 122*

32 * 22 31 25 *54* *8 1 I 1 1 49$6

9 I I ( I 0.0 0.3 0.6 0.9 1.2 1.5 0.0 0.16 0.32 0.48 0.64

Gamma quantiles Gamma quantiles Figure 2.2

0.80

121 ** 12 1

3 51

5 18* 6

16- (a) Orange

12

** *

4

I I I 4 8 12 16

Gamma quantiles 0

92 20

(b) Blue

* *1

*1* 1*

*2* *32

32 3*

651 776

67* 0 2 4 6

Gamma quantiles Figure 2.3

53

plots contrasts with the lack of linearity of a Gaussian probability

plot of the discriminant scores; the latter is not surprising for

this data set in view of the skewness discussed above.

Figures 2.3(a) and 2.3(b) show Q-Q gamma plots of IM(cTc), with

parameters estimated by maximum likelihood. Again the main trend for

each species lies close to the unit slope line through the origin.

For the orange species, the observed distribution is slightly shorter-

tailed; slight curvature is also evident for the blue species, though

three atypical observations are indicated.

None of the summary graphs and analyses indicates any atypical

observations for the orange species. The histogram and ecdf of the

discriminant scores, the plot of D2 - D2m

against cs( fi - x1), and the

gamma plot for the influence function of cTc all indicate that three

blue observations warrant further examination. The gamma plot for the

influence function for D2 indicates one of the three as being obviously

atypical. Examination of Figure 4(a) in Campbell and Mahon (1974)

shows three observations with carapace width larger than expected on

the basis of that for front lip when compared with the remaining

observations.

The detailed examination of the rock crab example using the

influence function and related summary statistics has identified one,

and possibly three, observations for the blue species which warrant

further consideration. Moreover, observations from the orange species

with similar discriminant scores to two of the three blue observations

are shown to have minimal influence on D2. The rock crab data had

previously been screened by visual scanning of observations ordered by

largest variable measurement, by comparison of correlations and

variances across groups, and by examination of canonical variate scores

after a preliminary species/sex analysis. It was unlikely, therefore,

54

that observations having a substantial effect on the analysis would

have remained undetected. And this is so; the deletion of the one,

obvious possible atypical value increases D2 by only 0.9 (to 28.7),

so that the absolute influence of the value is in this case minimal.

What this study has shown, somewhat surprisingly, is the asymmetric

way in which observations influence D2. Moreover, inclusion of an

observation lying furthest from the mean of the other group decreases

rather than increases D2. In the latter case the increase in variances

and/or change in correlation must offset the increase in the separation

of the means. The linear and quadratic components for Ost

in (2.7)

reflect the linear and the quadratic nature of (2.4) and (2.5)

respectively.

The analysis of the Leptograpsus data illustrates that inspection

of discriminant scores per se may sometimes be misleading. Obvious

atypical scores in discriminant and canonical variate analysis are

often taken to be indicative of incorrect measurement or incorrect

allocation of specimens (for example with respect to sex). However,

the assumption of an approximate Gaussian distribution of the

discriminant scores to guide such decisions may not always be justified.

55

CHAPTER THREE: ROBUST PROCEDURES TO EXAMINE VARIATION WITHIN A GROUP

In this Chapter, the performance of robust procedures when applied

to multivariate data is examined. Robust M-estimation of means and

covariances is reviewed in Section 3.2, and the use of the robust

estimates in conjunction with probability plots of associated

Mahalanobis squared distances is considered. It is shown that

detection of atypical observations is enhanced when the robust

estimates are used, rather than the usual estimates. The weights

associated with the robust estimation can also themselves be used to

indicate atypical observations. The robust estimates are found to be

similar to the usual estimates for uncontaminated data. A procedure

for ro,.,ust principal components analysis is given in Section 3.3.

Typical data sets are examined in Section 3.4, while some general

recommendations are given in Section 3.5.

3.1 Introduction

The increasing interest in robust procedures over recent years

has been motivated in part by the observation that actual data sets

contain occasional gross errors. This becomes important for largish

data sets, since careful quantitative inspection is difficult. Since

the performance of classical procedures is seriously influenced by

atypical values, robust methods which are little influenced by such

values provide an attractive alternative. The survey papers by Huber

(1972) and Hampel (1973) summarize the important methods and results

from the earlier years of univariate robust studies, while Hampel

(1977) includes a review of more recent results. An introductory paper

by Hogg (1977) gives the basic ideas.

56

The emphasis throughout this Chapter is on the provision of

estimates of means and of covariances for a single group which are

little influenced by atypical observations, and on the detection of

observations having undue influence on the estimates. The procedures

to be described are based on the assumption of symmetry of the under-

lying distribution; in fact, a multivariate Gaussian form is examined

in the probability plotting. It seems essential to make some

distributional assumption, otherwise the concept of an atypical

observation has little meaning (see Barnett and Lewis, 1978). It is

assumed in what follows that the data are consistent with an approximately

symmetric distribution. If preliminary analyses indicate that such an

assumption is not warranted, it is assumed that a suitable transformation

is applied to achieve approximate symmetry. Gnanadesikan (1977,

Chapter 5) suggests procedures to achieve this.

3.2 Robust Estimation of Multivariate Location and Scatter

Healy (1968) and Cox (1968) have suggested an extension of

probability plots of univariate data to the multivariate situation, by

plotting the Mahalanobis squared distance of each observation against

the order statistic for a chi-squared distribution with v d.f., where

v is the number of variables.

If x represents the vxl vector of sample means, and V the sample

covariance matrix, then the Mahalanobis squared distance of the mth

observation from the mean of the observations is defined by

dm T -1 = (xm - x) V (xm - x) .

57

Gnanadesikan (1977, p.172) discusses probability plots of dm in

detail; as with univariate Gaussian plotting, its particular appeal

is that it combines examination of the distributional assumption of

a multivariate Gaussian form with detection of atypical observations

and is especially suited to informal graphical description.

Observations which are grossly atypical in a single component

can often be detected using univariate techniques applied to each

variable. However for multivariate data, observations are often only

found to be atypical when the value for each variable is considered in

relation to the other variables. The extract of data listed in Table

3.1 in Section 3.4 shows that the atypical values are within the range

of the data when each variable is considered separately. However, the

underlined values fail to maintain the pattern of relationships between

the variables evident in the majority of the observations.

As noted in the previous Chapter, an atypical multivariate vector

of observations will influence both the means and covariances. The

tendency will usually be to deflate the correlations and possibly

inflate the variances, and so to inflate the size of the associated

concentration ellipsoid. This will in general have the effects of

decreasing the Mahalanobis distance for the atypical observation and

distorting the rest of the plot.

The Mahalanobis distances play a basic role in multivariate

M-estimation. From the applied viewpoint, M-estimators can be considered

as a simple modification of classical estimators; the contribution of an

observation to the statistic(s) of interest is gicien full unit weight if

it is a reasonable observation, otherwise its contribution is downweighted.

They can be considered as classical estimators after a weight function

has been applied to the data (Hampel, 1977). The weight function is

given by wm = w(tm)/tm. Here m is a measure of deviation reflecting

58

the discrepancy of the mth observation from the robust average value,

relative to a robust measure of scatter, and w is a bounded influence

function (Hampel, 1974), linear over the range of values of tm

corresponding to reasonable data, but bounded outside this range.

For robust M-estimation of multivariate location and scatter, the

appropriate measure of deviation turns out to be the Mahalanobis

distance (see below). Hampel (1973) has suggested that the influence

and hence the weight of an extreme atypical observation should be zero,

so that w should redescend for values of tm sufficiently large.

M-estimators of multivariate location and scatter are discussed

ny Maronna (1976), Hampel (1977) and Huber (1977a,b). The equations

used here to define robust M-estimators of mean and covariance are

given in (3.4) and (3.5) below.

An outline of their derivation is as follows. Consider an

elliptically symmetric density of the form IEI-1/2h(6) where

6m = {(xm - u)TE-1(xm - u)}1/2. Then the relevant part of the log

likelihood is

- 2 log n

I + E log h(6 m).

m=1

Write

u (6m ) = -h(6m) lh' (6m) 6m1

and let dm be defined analogously to 6m with u and E replaced by their

maximum likelihood estimators.

59

Then differentiation w.r.t. p gives

A n n u = E u(d )x / E u(d ).

m=1 m m m=l

m (3.1)

Differentiation w.r.t. E gives

n 2 = n-1 E u (dm) (x m - p) (xm - 1#1)T.

m=1

Huber (1977a, p.168; 1977b,p.42) has suggested the modified form

E = E u(d ) (x - p) (x - p)T/ E v(d) m=1 m m m m=1

m (3.2)

with arbitrary v(dm) as the most general form for an affinely invariant

M-estimator of covariance. This form is adopted for the definition of

Vc in (3.5) below, with the v(dm) related to the weights to give an

unbiased estimator when all observations have full weight.

The above derivation is for an elliptically symmetric density;

for the multivariate Gaussian density, u(dm) = 1 and so the usual

estimators result. The appropriate form for robust M-estimators results

by associating the elliptical density with a contaminated multivariate

Gaussian density. The robust estimators effectively give full weight to

observations assumed to come from the main body of the data, from the

uncontaminated distribution, but reduced weight or influence to

observations from the tails of the contaminating distribution. In

practice, this means downweighting the influence of observations with

unduly large Mahalanobis distances.

To define robust M-estimators of location and scatter, rewrite

(3.1) and (3.2) as

A u E d

• m

lw (dm) )x n/ E mlw (dm )

m=1 m

where

w (dm) = dmu (dm) ,

and

E = E dm

• -20 (dm) (xm - u ) (x m - ū ) T/ E v (dm)

m=1 m=1

where

0 (dm) = dmu (dm) . (3.3)

For robust M-estimation, the influence of an observation must be

bounded. As defined, w reflects the linear influence of an observation

on the sample mean, and 0 reflects the quadratic influence of an

observation on the variances and covariances. The bounded forms

chosen for w are the influence function proposed by Huber (1964) and a

re-descending form which gives the qualitative behaviour proposed by

Hampel (1973), as given in (3.6) and (3.7) below. If w(dm) is bounded,

then (3.3) suggests taking +(dm) = dmw(dm), which is not bounded. A

more suitable approach is to bound 4(dm) directly; a simple choice is

to set 0 (dm) = w2 (dm)

The equations used here to define robust estimators of means and

covariances are as follows:

n n —c = E w x/ E w

m m=1 m m m=1

(3.4)

60

and

n n Vc = £ w2 (x

m - x ) (x

m - x ) T/ ( £ w2m - 1). m=1 m m=1

(3.5)

61

where

wn = w(dm) = w(dm)/dm

and

dm = {(xm - x )TV

C-1(xm - x)}1/2

The solution for x and Vc is iterative.

The two forms of w used here are:

(i) the non-descending form proposed by Huber (1964)

w(dm) = dm if dm < bl

= b1 if dm > b1;

(3.6)

(ii) a re-descending form suggested by Hampel (pers. comm.)

w(dm) = dm if dm < bl

= blexp{- 2(dm-bl)2/b2} if dm > bl .

(3.7)

I have found that b2 in the range 2.5 to 1.0 gives good performance;

the smaller the value of b2, the faster is the rate at which w descends.

62

The constant b1 is taken here as

b1 = V17 + b0/VI ,

where b0 is in the range 1.64-3.09; I have usually taken b0 = 2.0.

This form of bl is derived by assuming that dm ti x.. Fisher's square

root approximation gives dm ti N (v, 1/1) (strictly E (dm) =

Then, on the assumption that most of the data points are reasonable

observations, b0 is here equated with a percentage point of the

standard Gaussian distribution.

The approach adopted here is to determine x and Vc and the

associated weights based on the non-descending w, possibly for a range

of values of b0. The initial estimates are the usual means and

covariance matrix. Then the redescending w is introduced, and up to

10 iterations are performed for each value of b2. Examination of

estimates from successive iterations with v = 7 shows that around six

iterations are needed to effect the qualitative changes reported in

Section 3.4; little change in distances and hence weights for the

redescending w seems to result after ten. iterations.

3.3 Robust Principal Components Analysis

A principal components analysis of the covariance matrix V (or

associated correlation matrix R) seeks a linear combination ym = uTxm

of the original variables xm such that the usual sample variance of the

ym is a maximum. The solution is given by an eigenanalysis of V

V = UEUT; viz.

63

the eigenvectors ui of U define the linear combinations while the

corresponding diagonal elements ei of the diagonal matrix of eigenvalues

E are the sample variances of the derived variables.

It is arguable that the direction ul should not be determined by

one or two atypical values. Consider an example in which the data

form a tight ellipse except for one observation (see, e.g., the figure

in Hinkley, 1978); this observation may well determine the direction

of the first eigenvector.

An obvious procedure for robustifying the analysis is to replace V

by the robust estimator Vc; this is the M-estimator solution to robust

principal components. This procedure weights an observation according

to its total distance dm from the robust estimate of location. But

this distance can be decomposed into components along each eigenvector;

and an observation may have a large component along one direction and

small components along the remaining directions and hence not be

adequately downweighted. It is therefore appealing to apply robust

M-estimation of mean and variance to each principal component. The

direction cosines will then be chosen to maximize the robust variance

of the resulting linear combination. The aim is to determine those

observations having undue influence on the resulting directions, and

to determine directions which are little influenced by atypical

observations.

The proposed procedure is as follows.

1. Take as an initial estimate of u1 the first eigenvector from an

eigenanalysis of V or Vc.

2. Form the principal component scores ym = uixm.

3. Determine the M-estimators of mean and variance of ym, and the

2 associated weights m. The median and {0.74 (interquartile range)}

64

of the ym can be used to provide initial robust estimates.

Here 0.74 = (2 x 0.675)-1 and 0.675 is the 75% quantile for

the N(0,1) distribution. This choice of initial estimate of

variance ensures that the proportion of observations downweighted

is kept reasonably small.

3(a) After the first iteration, take the weights m

as the minimum of

the weights for the current and previous iterations; this is

necessary to prevent oscillation of the solution. n n

4. Calculate x = E w x/ E w and m=1 m m m=1 m

n n Vw = E w2 (x - x ) (xm - x ) T/ ( E w2 - 1) .

m=1 m m m=1 m

5. Determine the first eigenvalue and eigenvector ui of Vw.

6. Repeat steps (2) to (5) until successive estimates of the eigenvalue

are sufficiently close.

To determine successive directions ui, 2 < i < v, project the data

onto the space orthogonal to that spanned by the previous eigenvectors

ui,...,ui_l, and repeat steps (2) to (5); take as the initial estimate

the second eigenvector from the last iteration for the previous eigen-

vector. The proposed procedure for successive directions can be set

out as follows.

7. Form xim = (I - Ui_iUiTl)xn, where Ui_1 = (u* 'i-1

8. Repeat steps (2) to (5) with xim

replacing m, and determine ui.

The covariance matrix based on the xim will be singular, with

rank v-i+l. However only the first eigenvalue and eigenvector

are required.

65

9. The principal component scores are given by uixim

ui(I - Ui-11-i)xm and hence u = (I -

Ui-lUi-1) i.

Steps (7), (8) and (9) are repeated until all v eigenvalues ei

and eigenvectors ui, together with the associates weights, are determined.

Alternatively the procedure may be terminated after some specified

proportion of variation is explained.

When the principal components analysis is based on the correlation

matrix, the data must be standardized to unit variance for each variable

to determine successive eigenvalues and eigenvectors. Two possibilities

exist: the data can be standardized by the robust estimates of standard

deviation determined from Vc before determining ui; or the robust

estimates of standard deviation from Vw can be used to standardize

the data following the calculation of ui.

Finally, a robust estimate of the covariance or correlation matrix

can be found from U*E*U*T, to provide an alternative robust estimate.

Both this approach and that described in the previous Section give a

positive definite correlation/covariance matrix. Robust estimation of

each entry separately does not always achieve this.

3.4 Some Practical Examples

The examples discussed here are used to illustrate results which

are typical of those obtained when the robust procedures discussed in

the previous Sections are applied to data arising from practical problems.

With experience, a recommended procedure has now evolved. Initially,

however, a fair amount of exploratory work, using five data sets typical

of discrimination studies in morphometric and medical problems, was

carried out, First, the usual estimates of means and covariance matrix

66

were calculated . Gamma (2,2) probability plots of the associated

dd and Gaussian probability plots of the dm 3 were made. The latter

uses the well-known Wilson-Hilferty (1931) transformation of a gamma

variate to Gaussian form (see also Healy, 1968, p.159). Then robust

M-estimates were determined, with non-descending w and with b0 in

the range 1.64-3.09 (in the range of .95-.999 percentage points of

the N(0,1) distribution). The weights were noted and probability plots

of associated distances were made. Finally the redescending w was

introduced with b2 = 2.5(0.25)0.75. The values b2 = 2.25, 1.75 and

1.25 were found to be sufficiently representative, while a value of

b0 = 2.0 seems to indicate atypical observations and yet not be too

sensitive to random variation (see Section 3.5).

The first data set to be discussed is taken from a study of

geographic variation in the whelk Thais lamellosa (C. Campbell, 1978).

Data are available for twelve groups on the west coast of North

America. The group sizes are: 50, 72, 99, 76, 37, 36, 46, 46, 51, 34,

28 and 43. Measurements were made on twenty variables; canonical

analyses show that seven of these provide much of the between-groups

discrimination. Robust covariance estimation was applied to the twelve

groups based on the seven variables.

The probability plots with b0 = 2, b2 = 1.25 show that nine of the

groups have one or two atypical observations, while one group has seven

atypical observations. The associated weights are all less than 0.35;

18 of the 21 atypical observations have zero weight. Probability plots

of the usual distances indicate only ten of the atypical observations,

with a further four or five doubtful. The analyses are carried out

group by group; a between-groups analysis is discussed in Chapter Four.

Four of the 21 atypical observations have two atypical values in

the vector, giving 25 out of a total of 4326 (= 618 x 7 variables)

67

variable values which are atypical. The variables for each group show

high correlations, so that with the elementary precaution of listing

the observations in increasing order of the largest variable (here

overall length), atypical values are readily apparent once the

observation is indicated. When the atypical observations are compared

with those above and below them on the listing (see Table 3.1),

corrections of either 100 or 50 units would provide good agreement for

all but four values, while a further two are obviously the result of

interchanging the order of two numbers (e.g. 015 should almost certainly

read 105). The correction of 100 or 50 units is an obvious 'and acceptable

one since the dial on the calipers used to measure the whelks records to

50 units, and it is to be expected in more than 4000 measurements that

occasionally the linear scale will be misread.

Figure 3.1(a) shows a Gaussian probability plot of usual distances

for group 3 (n = 99) (i.e. a Q-Q probability plot of cube root of

squared Mahalanobis distances against Gaussian order statistics with

the distances calculated using the usual means and covariances); there

is some indication of one atypical observation (#79). Robust

M-estimation with b0 = 2.0, b2 = = (non-descending w) gives a weight of

0.03 for this observation; a second observation (#97) has a weight of

0.11. With b2 = 1.25 (redescending w), both observations have a zero

weight. Figure 3.1(b) shows a Gaussian probability plot of robust

distances when b0 = 2.0, b2 = 1.25. Two observations, 79 and 97, are

clearly atypical. Both are atypical in the second variable, the first

being out by 100 units and the second by 50 units. Figure 3.1(c) shows

a Gaussian probability plot of robust distances with b0 = 2.0, b2 = 1.25

after the two observations have been corrected. The linearity of the

plot is obvious. None of the weights is now less than 0.35. The usual

estimate of standard deviation is 48.9; the robust estimate is 44.9

Table 3.1

Extract of listing of data for Group 2 (n = 72) for

Thais data, and overall summary.

vi v2 v4 v6

observation 27 313 258 208 45

28 313 271 212 44

29 314 264 197 43

30a 315 265 265 44

31 316 279 224 40

32 317 265 200 41

33a 318 200 255 42

34 321 271 213 45

68

minimum value

maximum value

mean-original data

std devn

mean-corrected datab

std devn

208 173 150 34

416 355 269 61

320.5 -268.3 212.1 44.1

42.44 36.77 26.67 5.43

320.5 269.0 210.7 44.1

42.44 35.92 25.39 5.43

aobservations indicated as possibly atypical by robust M-estimation;

underlined values are probably out by 50 units.

bobservations indicated as possibly atypical are adjusted by 50 units.

69

4 +

4-79

NE 2.4 -f-

0.4 4-

4 4

1311 24#

25541 453

1445# #22

#**#112

4•4 +(a)

-3 2 -1.6 ō 1 6 3.2

Gaussian quantiles

1.0.0 + (J )

0-79

-97

#14*4 244423'

44555652 ####ii324

0.0 + be z.o, b2=1.25 .4. + 4 + + +

-3.2 -1.6 0.0 1.6 3.2

Gaussian quantiles

Figure 3.1 - Q-Q plots of Mahalanobis squared distances for Thais data from group 3

la) Gaussian plot - cube root of usual squared distances

(b) Gaussian plot - cube root distances - redescending estimates

be 00

70

3.12+(c)

pE

m1.92 4 N 0

0

4-

#

*11** 12 2*

31 454*

46* 25*

442 143

2* **

***14

+

# 0.72 +

+ + -5.2 -1.6

be2.0, bz=x.25 + + + + 0.0 1.6 3.2 Gaussian quantiles

Figure 3.1 (cont.) - Q-Q plots - Thais data - group 3

(c) Gaussian plot - cube root of squared distances - redescending function - corrected data

71

while the robust estimate after correcting the data is 45.8. The

correlations of v2 with the remaining variables are increased by 0.02

to 0.06.

Figure 3.2(a) shows a Gaussian probability plot of usual distances

for group 8 (n = 46). The curvilinearity suggests a departure from

multivariate Gaussian form. Robust M-estimation with b0 = 2.0,

b2 = 1.25 gives two zero weights (the corresponding weights with

b2 = = are 0.17 and 0.15 respectively). The Gaussian probability plot

of robust distances (b2 = 1.25) in Figure 3.2(b) indicates two atypical

observations (#24 and #28). The atypical values are out by 100

units for v2 for #24 and by 50 units for v3 for #28; the Q-Q plot of

corrected distances, in Figure 3.2(c), is now linear. Figure 3.2(d)

shows a gamma probability plot of usual distances for group 8 (i.e.

a Q-Q probability plot of squared Mahalanobis distances against gamma

quantiles with parameters (0.5, 3.5)). While the indication that there.

are two atypical values is clearer than in Figure 3.2(a), there is some

curvilinearity in the plot. A gamma probability plot of robust distances,

corresponding to Figure 3.2(b), gives the same striking indication of

two atypical observations. For this group, the standard deviations are

little affected by the'robust estimation, with that for v2 reduced from

36.1 to 34.7. However the correlations are increased: for example,

r(1,2) increases from 0.863 to 0.985 and r(2,3) from 0.855 to 0.965.

Observation 24 also appears to be atypical for v4 by about 30 units,

r(2,4) and r(3,4) are increased from 0.476 and 0.455 to 0.773 and 0.816

respectively.

The second data set is taken from an unpublished study of morpho-

metric divergence in male and female scorpions occurring in Australia.

Nine variables were measured on each specimen. The male data for one

species (n = 181) are discussed here.

-1- 2S

+ 6.9t(b)

-f

3.36 ++(Q)

72

t

L24

I

if* i

111 *11

$11# 11*

1 #i

44.

iv 00 + 4 + 4 0.0 1.2 2.4

Gaussian quantiles

+ + + 2s

* * 0 0.96 t

-2.4 -1'2

t24

N E

*' 01 3.9 1- m fa 0

t 11111#

11111 111111

# # # *0010 0.9 +

+ + -2.4 -1.2

bo2.0,b2 1.25 .~. + + + 0.0 1.2 2.4

Gaussian quantiles

Figure 3.2 - Q-Q plots of Mahalanobis squared distances for Thais data from group 8

(a) Gaussian plot - cube root of usual squared distances

(b) Gaussian plot - cube root distances - redescending function

73

+ -1- +

M

++(C) + *

** ti m w 0

2. 1+ ** #i**

*i4 *it it

1 1i1

#1111 $*

*4

i i+ #4

b=2.0, bell; + + + -I- t

-2.4 -1.2 0.0 1.2 2.4 Gaussian quantiles

+ + +

`*^• 24

32.0 + (d)

1 * *#

1** i1#

112 12124

121# 0.0 t #** 60=a +

0.0 5.0 10.0 15.0 20.0

Gamma (0.5,3.5) quantiles

Figure 3.2 (cont.) - Q-Qrplots - Thais data - group B

(c) Gaussian plot - cube root of squared distances redescending function - corrected data

(d) Gamma plot - flual squared distances

74

Robust covariance estimation (b0 = 2.0, b2 = 1.25) indicates five

observations with weights less than 0.35; one of these (#139) has a

weight of 0.34. The associated Gaussian probability plot shows only

four atypical observations. A robust principal components analysis

indicates a further three observations with weight less than 0.10 for

at least one component and a further sixteen with weight less than 0.35.

The advantage of the combined approach is that different combinations

of variables are examined. For example, one of the observations (#162)

w.t.th a zero weight for robust covariance estimation (b0 ='2.0, b2 = 1.25)

has zero weight for principal components 5 and 6. Examination of the

eigenvectors shows variable 8 to have a high loading for these components.

The measurement of 3.5 was checked by my co-worker and found to be 6.0

(the robust estimate of standard deviation is 0.77). Another observation

(not yet rechecked) has a weight of 0.60 for robust covariance estimation

(b0 = 2.0, b2 = 1.25) but weights of 0.06, 0.21, 0.27 and 0.23 for

principal components 3-6. Of those observations checked and corrected,

the common error was a measurement out by 1 or 1.5 mm; the metal dial

calipers were graduated to 0.05 mm.

3.5 Discussion

The value of Q-Q plots of Mahalanobis distances to assess the

distributional properties of the data and to indicate possibly atypical

values is well-recognized in multivariate studies. Because of the

fundamental role that the distances play in robust M-estimation of

location and scatter, the combination of robust Mahalanobis distances

and Q-Q plots seems an obvious one. As the examples reported here show,

the combined approach enhances the detection of atypical values; in the

whelk example, a probability plot of the usual distances fails to

75

indicate an atypical observation, due largely to inflation of the

variances for some variables.

The weights wm associated with the dm also indicate atypical

observations. Extensive examination of a number of data sets

(v = 3, 4, 5, 7, 9) shows that a weight of less than 0.30 with

b0 = 2.0, b2 = 1.25 (corresponding approximately to a weight of less

than 0.60 with b2 = =) has always indicated an atypical observation;

this judgment is based on an examination of the corresponding Q-Q

plots and of the variable values for the observations. A weight of

more than 0.70 with b2 = = is associated with a typical observation.

For v between 4 and 10, a weight of approximately 0.42 with b2 = 1.25

(or 0.66 with b2 = =) corresponds to a squared distance whose value

coincides with the 0.1% point of the Xv distribution. Some flexibility

exists with the choice of b0; values in the range 1.65 to 2.35 give

similar qualitative results.

If the robust estimates are to be used in subsequent statistical

analyses, such as principal components and canonical variate analysis,

it is important that they differ little from the usual estimates when

applied to uncontaminated data. Now for multivariate Gaussian data,

0(6m) = dm which is distributed as andd hence E(0) = v. For the

non-descending w function defined in (3.6) (remember 0 = w2), Huber

(1977a, p.183) gives the equation for the correction factor to standardize

the estimates so that they have the correct asymptotic values for the

Gaussian distribution. As his Table 1 shows, the robust estimate of

covariance is within 2% of the usual estimate for v between 4 and 10.

Table 3.2(a) gives a stem-and-leaf plot of the ratio of robust to usual

estimates of variance for the Thais data (v = 7, g = 10), while Table 3.3

gives results for the scorpion data (v = 9, g = 19). Table 3.2 also

gives the ratios for ten sets of generated multivariate Gaussian data;

76

each set consists of six groups with sample size and underlying means

and covariances corresponding to the first six Thais groups (hence

v = 7 and g = 60). The actual Thais data show good agreement with a

multivariate Gaussian form. It seems reasonable to conclude from

Tables 3.2 and 3.3 that the robust estimates of the variances are

generally within 2% of the usual estimates for well-behaved data.

For the generated data, three observations have low weight; and this is

reflected in the corresponding ratios of variances, all of which are

less than 0.97. For the actual data sets, a low ratio for a variable

is always associated with an atypical value for one (or more) of the

observations having low weight wm. A recommended approach is to determine the means, covariances,

distances and associated weights for b0 = =; for b0 = 2.0, b2 = =,

and for b0 = 2.0, b2 = 1.25. Gaussian probability plots of cube root

of squared distances (with b0 = 2.0, b2 = 1.25) together with the

magnitude of non-unit weights will indicate atypical observations.

For more than six or seven variables, a robust principal components

analysis is also useful for identifying atypical observations.

The Mahalanobis squared distances are usually plotted against the

quantiles of a gamma distribution with shape parameter v/2. The results

of this study show that the Wilson-Hilferty cube-root transformation

behaves well on the dm. There is good agreement between Q-Q plots of

dm versus Gamma (2, 2) and of dm/3 versus N(0,1), though the cube root transformation tends to lessen the visual impact of the large distances

on the gamma plot. The general conclusion here reinforces the remarks

of Healy (1968, p.159) that the normal or Gaussian plot seems more

than adequate for detecting atypical observations and examining the

multivariate Gaussian assumption.

Table 3.2

Stem-and-leaf plot for ratios of robust to usual variances

for (a) Thais data Or = 7, g = 10) and (b) generated multi-

variate Gaussian data with same underlying structure as

(a)

1.05

Thais data (v = 7, g = 60).

actual

1

(b) generated

1.04 4

1.03 1, 1, 3, 4, 5, 5

1.02 0, 0, 1, 2, 2, 3, 5, 5, 5, 6, 8, 9, 9

1.01 0, 3, 4, 7, 8, 9 1, 2, 2, 3, 3

1.00 3, 5, 8 0(243), 1(19) , 2(9), 3, 3, 4, 4, 5, 6, 6, 6, 7

0.99 3, 4, 5, 6, 6, 6, 8 0(9), 1(6), 2, 3(10), 4(7), 5(5), 6(7), 7(7), 8(13), 9(19)

0.98 0, 0, 0, 2, 3, 3, 5, 6, 1, 1, 2, 2, 4, 5, 5, 6, 6, 6, 8, 8, 8, 9, 9, 9 9, 9, 9

0.97 3 3, 4, 4, 7, 8, 8

0.96 3, 4, 4, 5, 7, 9 1, 2, 2, 2, 7, 9

0.957, 0.933, 0.936, 0.936, 0.952, 0.954, 0.957, 0.947, 0.934, 0.937, 0.922, 0.923, 0.911, 0.934, 0.938, 0.927, 0.928, 0.929, 0.89, 0.85, 0.84, 0.81, 0.910, 0.912, 0.912, 0.89, 0.88, 0.78, 0.59 0.88, 0.87, 0.87, 0.86, 0.82

77

8,

Table 3.3

Stem-and-leaf plot for ratios of robust to usual

variances for scorpion data (v = 9, g = 19)

1.02 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 4

1.01 3, 4, 6, 8, 9, 9, 9

1.00 0(83), 3(10), 7, 8

0.99 1, 1, 4, 6, 7

0.98 0, O, 1, 2, 3, 3, 6

0.97 1, 1, 4, 7, 7

0.96 0, 1, 4, 5

0.95 5

0.94 1, 2, 3, 4, 9

0.93 0, 9

0.92 2, 4, 8, 9

0.91 4, 8

0.90 0, 2

0.89, 0.89, 0.89, 0.88, 0.88, 0.87, 0.86

0.83, 0.82, 0.81, 0.80, 0.80, 0.80,

0.77, 0.73, 0.71

0.68, 0.60, 0.58, 0.56.

78

79

A robust principal components analysis provides a useful complement

to robust covariance estimation when there are more than a few variables.

The eigenvectors usually provide contrasts between subsets of variables

(after the first), and hence observations will have low weight only for

those principal components with large loadings for the variable(s) with

the atypical values. This is well-illustrated by the scorpion data:

an observation was not atypical from the Q-Q plots and non-unit weight,

whereas the robust principal components analysis indicated the observation

and gave some guide as to the variable involved.

The procedures discussed in this Chapter and in Chapter Four on

robust between-groups procedures can be readily used for routine

screening of data. With the increasing trend to direct entry of data

at or near collection time, automatic application of robust estimation

procedures will indicate possibly atypical observations; the measure-

ments can then be checked while the individuals or specimens are still

available. The computer costs involved are minimal when compared with

the costs of the experimental or survey work involved in collecting

the data and the time spent reanalyzing data following detection of

errors at a later stage of analysis.

CHAPTER FOUR: ROBUST CANONICAL VARIATE ANALYSIS

In this Chapter,, robust procedures for canonical variate analysis

are developed. Robust M-estimation is applied to the scores for each

canonical variate in Section 4.1 to determine appropriate weights to

define robust estimates of the between- and within-groups SSQPR

matrices. Robust M-estimation for canonical variate analysis, based

on the functional relationship formulation outlined in Section 1.3,

is developed in Section 4.2. The weights are shown to depend on the

distance of an observation from the canonical variate mean for the

group. For uncontaminated data, the robust M-estimation procedure

performs similarly to the usual canonical variate analysis. Two

typical data sets are examined in Section 4.3. The ordinary canonical

vectors are little affected by the presence of atypical observations,

though the canonical roots are considerably influenced. Some general

conclusions are given in Section 4.4.

4.1 M-estimation of the Canonical Variate Scores

When different groups of data are available, as in multivariate

discrimination studies, the robust procedures described in Chapter

Three can be applied to each group separately. This will provide

robust estimates of means and of covariance matrices for canonical

variate analysis; possibly atypical observations are also identified.

To be more specific, let k and VZ denote robust M-estimates of the

vector of means and of the covariance matrix for the kth group,

determined as in (3.4) and (3.5) with the group subscript added, and

let w denote the associated weights. The pooled within-groups km

SSQPR matrix WC and the between-groups SSQPR matrix BC are formed in

80

81

an obvious way, analogous to (4.2) and (4.1) below, using the weights 1

wkm. The canonical roots and canonical vectors of We Bc are then

found, with the vectors scaled as in (iv) in Section 4.:2 below.

With this approach, an observation is weighted according to its

total distance dkm _ {(xkm - k)TVk 1(xkm - xk)}1/2 . For similar

Vk, this is essentially similar to the overall Mahalanobis distance

based on Wc. But this latter distance can be partitioned into a

component along the canonical variate plane, and a component orthogonal

to it. As such, the approach may be relatively insensitive to

observations atypical for one linear combination, but typical for the

rest.

Now canonical variate analysis can be considered as a one way

analysis of variance for a linear combination ykm = cTxkm of the

original variables. The procedure for robust canonical variate analysis

proposed in this Section is to apply robust M-estimation to the

canonical variate scores ykm for each group. For robust regression,

and hence one-way analysis of variance, the appropriate measure of

deviation tkm on which to base the weight function is the residual

(Huber, 1977b). In the context of one-way analysis of variance for

canonical variate analysis, the residuals for a given c are

cTxkm - dick. The influence function for Mahalanobis D2 in Chapter Two

is a quadratic function of this deviation score.

The procedure is as follows:

(i) Take as an initial estimate of c either the usual canonical

vector or that resulting from the usual analysis based on M-estimates

k and Vk for each group.

(ii) Form the canonical variate scores ykm for each group, and

determine the weights wkm associated with M-estimation of the mean.

The variance is either set equal to 1 or is estimated simultaneously

82

(see below).

(iia) After the first iteration, take the weights wkm as the

minimum of the weights for the current and previous iterations.

(iii) Calculate

nk

nk = E wkm , m=1

nk

E wkm km/nk, m=1

and

nk nk

Vk = E w2 (xkm - k) (xkm - k) T/ ( E w2 - 1) . km

m=1 m=1

Calculate

g g xT

= E nk k / E nk, k=1 k=1

and form

g Bw - E nk (k - T)( k - T)T

k=1 (4.1)

and

g nk Ww = E ( E w2 - 1)Vk .

km k=1 m=1 (4.2)

(iv) Determine the first canonical root f and canonical vector c

from (Bw - fWw)c = 0. The vector c is scaled such that

g nk cTWwc = E ( E w2 - 1). km

k=1 m=1

with

y* k

or

= Ek k

w (y - y*)2/ Ē wk - 1. t= 1 kR k!t k R=1 t var* (y)

83

(v) Repeat steps (ii) to (iv) until successive estimates of the

canonical root are sufficiently close.

The procedure for robust M-estimation in (ii) follows that in

Section 3.2. The non-descending form of w in (3.6) is chosen initially:

w km) = tkm if tkm < b

1

= b1 if tkm > b1

where

t2 _ km - y*) 2/var* (ykm)

nk nk

E wkkykt/

E wki

k=l i=1

and

var*(ykm) - 1

Here

wkm - w(tkm) = w(tkm)/tkm

The constant b1 is taken here as b1 = 1 + b0ii/21 r have usually taken

b0 = 2.25.

84

Once convergence of yk is achieved, the redescending form of w

in (3.7) is introduced for up to ten more iterations. This form is

w (tam) = t if t < bl

= I . exp{-O.5(tkm - bl)2/b2} if tkm > blj

from the results in Chapter Three and preliminary work here, I have

taken b2 to be 1.25.

The choice var*(ykm) = 1 for determining the weights wem, for

a given c, effectively coincides with M-estimation for the one-way

analysis of variance, since c is scaled to have unit within-groups

variance. The choice of the robust variance for a group as var*(ykm)

will determine for each group those observations which have atypical

canonical variate scores relative to those for the rest of the group.

When the canonical variate variances are similar for all groups, the

two choices will give similar results. For somewhat different variances,

the effect of using the robust estimate compared with using 1 will be

to place more emphasis on atypical observations for groups with small

variance and less on those from groups with large variance.

To determine successive canonical vectors ci, 1 < i < h, project

the data onto the space orthogonal to that spanned by the previous

canonical vectors cl,...,c and repeat steps (ii) to (iv). Take

the second canonical vector from the last iteration for the previous

one as the initial estimate. The proposed procedure is as follows.

(vi) Foi3m the generalized projection operator

85

and calculate xikm _ (I-P)x ; here Ci-1

= (c1,...,ci_1). In km

practice, it is only necessary to form Pi-1 = WWci-1ci-1

and

calculate xikm =

(I_Pi-1)xi-1

km, since W is usually similar for each

vector.

(vii) Repeat steps (ii) to (iv), with xikm replacing x., and

determine Ci.

The pooled WW will now be singular, with rank v - i+1. This is

readily incorporated into the numerical procedures used. If the

eigenvalue/vector decomposition in Section 1.4 is used as the first-

stage orthonormalization, the smallest i-i eigenvalues will be zero.

(viii) The canonical variate scores for successive directions are

given by

cixikm = ci (I-P)xkm

and hence

ci = (I-P)Tcci .

4.2 Robust M-estimation of the Canonical Vectors

The derivation outlined in the previous Section is essentially a

distribution-free approach, in the spirit of the original Fisher-Rao

derivations in Section 1.2. In this Section, robust M-estimators for

the canonical vectors will be developed, based on the functional

relationship formulation in Section 1.3.

To do this, consider, as in Section 3.2, an elliptically symmetric

density of the form FE1-1/2h(d), where

86

ō _ = ()clan - uk) TE-1(xkm - Pk). km km (4.3)

Assume, as in the model (1.18) in Section 1.3, that the are specified

by

Pk = uo + E'Yrk, (4.4)

where Y' is the vxp matrix of population canonical vectors.

The relevant part of the log likelihood is

- 2 log SEI + E E log h(d ) km k=1 m=1

with dkm given by (4.3) and the uk by (4.4).

Write

u (d km) = -h (ōkm) _1h'

(6)t5- , km

and define

nk

k = E u(dkm), m=1

nk

xk = n* E u(dkm)xkm, m=1

(4.5)

g n* = E nk ,

k=1

and

g _ _ xT = n*-1 E nkxk ,

k=1

g nk

87

where dkm is defined analogously to dkm in (4.3) with uk and E replaced

by their maximum likelihood estimators.

Write

P = ERY(TTET)-1TT

as in (1.21), and follow the same derivation as in Section 1.3 to obtain

= (~ TE`Y) -11T (x*- u0)

(I-P)i0 = (I-P)x .

g nk _ S* = E E u(d) (x x ) (x - km - k km .k

). k=1 m=1

and (4.6)

B* = E nk (xk x - T) (xk - ;-*)T k=1

differentiation of the log likelihood, maximized for and uo, w.r.t.

and w.r.t. ' leads to equations as in (1.24) and (1.25), with S and B

replaced by S* and B*.

Introduction of conditions analogous to (1.26) and subsequent

simplification leads to the fundamental canonical variate equation

where

•► A B*W = V*Vn*F*p ,

V* = n 1S*

(4.7)

and

With

88

and

Y'TV*T = I ,

(4.8)

Y'TB*`Y = n*F* p

with F* a diagonal matrix.

Also, as in Section 1.3,

nE = S* + B* - B*Y'Fp-1Y B*n*-1 . (4.9)

The log likelihood is maximized by taking W to be the first p

eigenvectors of V*-1B* in (4.7).

As before,

w _ *A uk = xT + V*Y'''

T 7(*)(xk - T) . (4.10)

The above derivation is for an elliptically symmetric density;

for the multivariate Gaussian density, u(& m) = 1 and so the usual

canonical variate solution in Section 1.3 results. As in Section 3.2,

the above derivation also leads to the appropriate form for a robust

M-estimator solution, by associating the elliptical density with a

contaminated multivariate Gaussian density. The robust estimators will

again give full weight to observations assumed to come from the main

body of the data, but reduced weight or influence otherwise.

To define robust M-estimators of the canonical vectors, rewrite

the equation for xk in (4.5) as

k k 7C* = E dw (d )x / E d-1w (dem)

km km km kmm=1 m=1

89

where

w km) = dkmu (dkm)

and rewrite the equation for S* in (4.6) as

g nk

S* = E E d-20 (dkm) (xkm - xk) (xkm - xk)T k=1 m=1

where

(dkm) = d u (dkm) .

For robust M-estimation, w and 0 must be bounded. As in Section

3.2, take (dkm) = w2 (dlan) , with w defined in (3.6) or (3.7) .

Hence the equations used here to define robust estimators of

means and of covariances for the robust canonical variate solution

are as follows:

nk

nk = E wkm ,

m=1

nk

xk = nk-1

E wkmxkm' m=1

g n* = E nk ,

k=1

_ xT = n*-1 F nk xk ,

k=1

g nk _ _ S* = E E wkm (xkm

- xk) (xkm - xk) T

k=1 m=1

and

90

B* g — _

T = £1 nk(xk k - x;)(x - xT) k

where

km = w(d) = w (d*km) /dem

and

dkm = (xkm - uk) TĒ-1

(xkm -

uk) (4.11)

w w w

with and E given by (4.10) and (4.9), with 'Y given by (4.7), and

the scaling for 'Y given by (4.8).

To ensure agreement with the usual unbiased estimator when all

weights are unity, the definition of V* as n-1S* is replaced by

g nk 2 i

V* = S*/ £ ( £wkm - 1)

k=1 m=1

and a similar divisor is applied to the equation for £ in (4.9).

Consider now the distance dkm in (4.11); the superscript is dropped

for convenience. For two groups, B = n1n2nT1dxdT, where nT = n1 + n2

and dx = xl - x2. Then i = V-1dx/(dTV-1dx)1/2 = D-1V-1dx and

x n — w

f = nw n1n2nT D . Hence uk = xk and E = V, so that

dkm = (xkm - xk)TV-1(xkm - xk), This is the usual Mahalanobis distance

for the pooled matrix V.

Now let C be the vxv matrix of vectors from the eigenanalysis of

V 1B and F the diagonal matrix of canonical roots. Partition C and F

-1 -1 2

as in (1.32) in Section 1.3, and remember 'Y = C. From (4.8) and (4.9),

I 0

CT EC = ( ) and so E-1 = CpCT + Cq(I+Fq)-1CT. Since O I+F

q

TA = C T— and TA = C T, dkm = (xkm xk) TC CT (xkm

-xk) + P p q q p P

— T -1 T (x xT) Cq (I+Fq)

q)-xT). This is illustrated geometrically in

Figure 4.1. The representation is in the space of the orthonormal

variables after the first-stage analysis. Provided the canonical roots

fqi are small, this distance is essentially that from the observation

to the projection of the group mean in the canonical variate plane.

It remains to specify the constants b0 and b2 for the influence

functions w in (3.6) and (3.7). For the robust estimation of covariance

in Chapter Three, a value of b0 = 2.0 was adopted. Examination of

computer-generated data shows that the robust estimates of variance

are within 2% of the usual estimates for an underlying multivariate

Gaussian distribution. A similar approach is adopted here. Two group

configurations, representative of typical data sets to be examined in

Section 4.3, are used. The first is a four variable-five group data

set with group means and pooled covariance matrix taken from a study

of the whelk Dicathais by Phillips, Campbell and Wilson (1973). The

second is a seven variable-six group data set taken from a study of the

whelk Thais lamellosa by C. Campbell (1978). The means, standard

deviations and correlations for each set are given in Table 4.1.

For both the Dicathais and Thais data, five data sets with a

sample size of 50 for each group were generated from an underlying

multivariate Gaussian distribution. Two further sets were generated

corresponding to the Thais data with sample size 200 in each group.

The procedure used to generate the data is outlined below. The robust

M-estimation canonical variate analysis described above was carried out

91

4 ep xrM ,

_ I -,4r - ~T

1

92

Figure 4.1 - Representation of the components of the squared

distance d2 for three groups and two variables. km

The vectors c and c are arbitrarily centred

at xT. The horizontal component is the squared

distance of xkm from xk in the canonical variate

plane. The vertical component is the squared

distance of xkm above the canonical variate

plane, scaled by the corresponding canonical root

as (l+f.)-1/2.

93

with b0 in the range 2.0 to 3.0, with the non-descending w in (3.6)

until successive estimates of the canonical roots were within 10-3

of the previous estimates, and then with b2 = 1.25 for the redescending

win (3.7) until convergence.

The results from these analyses with b0 = 2.25 are given in

Table 4.2. Only three of the twelve runs show an increase in the

sample canonical root of more than 1% when robust estimation is intro-

duced. The redescending function in (3.7) with b2 = 1.25 shows little

increase in the canonical roots over those calculated with the non-

descending function in (3.6). For the Dicathais-based data, only

three weights are less than 0.75, and none is less than 0.50 (out of

5x5x50 observations). For the Thais-based data with nk = 50, only three

weights are less than 0.75 (out of 5x6x50); one of these is less than

0.50 (0.35). For the Thais-based data with nk = 200, only four weights

are less than 0.75 (out of 2x6x200) and none is less than 0.50. When

b0 is taken as 2.0, the ratio of canonical roots is increased by less

than 1%, while the weights are reduced by approximately 0.11. On the

basis of these results for the generated data, it seems reasonable to

conclude that with b0 = 2.25 and b2 = 1.25, a weight of less than 0.30

will be associated with an atypical observation; probably observations

with weights between 0.30 and 0.50 warrant closer examination.

The multivariate Gaussian vectors were produced by orthonormal

rotation of vectors of independent Gaussian random numbers. The latter

were generated by the polar method of Marsaglia and Bray (1964) (see

Atkinson and Pearce, 1976, Section 5.1.6, for further details). If

z ti Nv(0,I) and if V = UEUT is an eigendecomposition of V, then

x = x + UE1/2z ti Nv(x,V). According to timings given in Barr and

Slezak (1972), this is not the most efficient method; orthonormalization

based on the Choleski triangular decomposition given in Section 1.5 is

94

Table 4.1 Underlying means, standard deviations and correlations

for the data generated in Section 4.2.

(a) Dicathais

Grp 1 Grp 2 Grp 3 Grp 4 Grp 5 vl v2 v3 v4

vl 39.36 33.39 44.02 33.34 55.94 9.391a 0.967 0.984 0.975

v2 16.10 11.99 14.91 13.34 25.00 4.077 0.916 0.911

v3 28.04 25.58 33.51 24.92 38.93 6.664 0.987

v4 12.81 12.02 17.46 13.02 20,84 3.374

(b) Thais

Grp 1 Grp 2 Grp 3 Grp 4 Grp 5 Grp 6 vl v2 v3 v4 v5 v6 v7

vl 346.3 320.6 410.3 549.3 263.8 341.9 88.d10.990 0.983 0.837 0.938 0.971 0.969

v2 276.4 269.3 344.8 428.7 191.5 254.5 66.1 0.991 0.848 0.945 0.977 0.975

v3 218.1 210.8 280.5 343.2 149.9 195.9 52.2 0.845 0.952 0.976 0.967

v4 62.1 44.2 73.3 87.1 49.0 63.1 13.6 0.803 0.853 0.845

v5 65.9 75.1 97.4 119.3 51.6 66.9 16.8 0.935 0.923

v6 193.5 193.1 245.1 301.8 139.1 176.5 47.1 0.985

v7 158.2 163.8 200.2 247.4 117.6 150.1 39.2

a standard deviations on diagonal, correlations off diagonal

95

Table 4.2 Summary of robust M-estimation canonical variate analyses

of generated data. The constant b0 for the non-descending

function w is 2.25.

(a) Ratio of M-estimation canonical roots to usual canonical roots, for each of the runs for the first two roots (fl and f2). The first line is for the non-descending w, and the second is for the re-descending w with b2 =

(i) Dicathais-based data nk = 50

1.25.

f : 1.014 1.019 1.000 1.004 1.006 f2: 1.007 1.007 1.008 1.003 1.005 1 1.017 1.027 1.000 1.004 1.007 1.010 1.010 1.011 1.004 1.006

(ii) Thais-based data nk = 50

fl: 1.002 1.007 1.002 1.001 1.008 f2: 1.005 1.002 1.003 1.001 1.000 1.002 1.010 1.002 1.002 1.011 1.007 1.003 1.004 1.001 1.000

(iii) Thais-based data nk = 200

fl: 1.001 1.003 f2: 1.002 1.001 1.001 1.003 1.003 1.001

(b) Non-unit weights for each of the runs

Grp

(i)

1 2 3 4 5

Dicathais-based data nk = 50 v = 4 g = 5

run 1 run 2 run 3 run 4 run 5

0.78,0.96 0.85,0.90,0.93 0.95 0.55

0.96 0.98,0.99 0.93 0.61 0.81,0.84

(ii) Thais-based data nk = 50 v = 7 g = 6

Grp 1 0.35,0.96 0.80 0.85 2 0.95 3 0.90 0.93 4 0.86 0.67 5 0.86 0.96 0.87 6 0.91 0.97 0.98 0.67,0.97

(iii) Thais-based data nk = 200 v = 7 g = 6 (only two runs)

Grp 1 0.82, 0.84, 0.89, 0.91, 0.93 0.99

2 0.70

3 0.79, 0.88, 0.99

4 0.88, 0.89, 0.95, 0.97 0.74, 0.97

5 0.66, 0.85, 0.94, 0.97

6 0.71, 0.81

96

faster. However, I have found the relative magnitude of the smallest

eigenvalue to be a more sensitive indicator of near-singularity than

that of the smallest diagonal element of the triangular matrix. For

this reason, I have adopted the eigenrotation for occasional generation

of multivariate data, so that simple monitoring or the nature of the

within-groups variation is provided. For a large-scale study, the

more efficient routine should be implemented.

4.3 Somē Practical Examples

The first data set to be examined is the Dicathais study by Phillips,

Campbell and Wilson (1973). Fourteen groups were collected around the

coast of Australia and New Zealand. Four variables were measured on

each animal. Group sizes are 102, 101, 75, 69, 29, 48, 32, 83, 88, 44,

34, 33, 82 and 61. Table 4.3 summarizes the canonical roots and vectors

for the first two canonical variates for the usual analysis and for the

various robust procedures described in Sections 4.1 and 4.2. The robust

M-estimates of means and of covariances were calculated as described in

Chapter Three, with b0 = 2.0 and b2 = 1.25. Figure 4.2 shows a plot of

the group means for the first two canonical variates.

The relative positions of the main clusters of groups are little

changed by the introduction of the robust procedures, though there are

some variations in the large cluster with low canonical variate scores.

In particular, the positions of groups 5 and 6 relative to groups 1, 3,

4 and 10 and to groups 2 and 7 have been altered under robust

M-estimation of the canonical vectors.

The estimates of the canonical vectors are little changed for the

various approaches. However, there is a marked change in the canonical

roots. For.the robust M-estimates, the increase is 29% and 15%. The

97

Table 4.3 Canonical roots and vectors for Dicathais data.

Coefficients for standardized variables are given in brackets.

c-vectors c-root c-vectors c-root

usual c.v. analysis

-0.50 0.47 -0.35 1.62 2.13 0.48 -0.99 -0.31 0.43 1.68

(-4.8) (2.0) (-2.4) (5.6) (4.6) (-4.3) (-2.1) (1.5)

c.v. analysis -0.67 based on

0.78 -0.23 1.50 2.35 0.47 -0.98 -0.51 0.86 1.87

M-estimates of means and covariances

robust

(-6.3)

-0.62

(3.2)

0.86

(-1.5)

-0.31

(5.1)

1.46 2.75

(4.4)

0.39

(-4.0)

-0.90

(-3.4)

-0.54

(2.9)

1.04 1.94 M-estimation of canonical vectors b =2.25, b2=1.25

robust

(-5.6)

-0.55

(3.3)

0.61

(-2.0)

-0.41

(4.8)

1.72 2.58

(3.5)

0.50

(-3.5)

-1.09

(-3.5)

-0.42

(3.4)

0.72 2.22 M-estimation applied to c.v. scores, b =2.25, b2=1.25, variance = 1

robust

(-5.2)

-0.58

(2.5)

0.58

(-2.7)

-0.34

(5.8)

1.68 2.36

(4.7)

0.51

(-4.5)

-1.08

(-2.8)

-0.39

(2.4)

0.59 2.06 M-estimation applied to c.v. scores,

(-5.5) (2.5) (-2.3) (5.8) (5.0) (-4.6) (-2.6) (2.0)

b =2.25, b2=1 .25, robust variance

6V30

V1 0

14 V

6 14

4

• • 9

• 12

• V/2 8 A/2

Ag

98

A usual c.v.a.

♦ M—estimates of means and covariances

• robust c.v.a. M—estimation of c.v. scores

2

S o •2

i' A _ :s® I3

Q 10 V . ® e

4 se = d► \ ♦ • 0 •3 ; V IIB

• l

DICATHAIS

14

-1

6 usual c.v.a.

08 68

612 V robust i1—estimates of canonical vectors — non—desc [v

69

o robust I1—estimates of canonical vectors — re—desc G)

O9v

0 1

12 v 2

2 a7

>° 1 227 5

lo . 405

z 104 O6e6

O 0 Q 3 vō

130 213

v0 61!

-1

0 i 2

canonical variate I

Figure 4.2 - Canonical variate means for Dicathais data

99

use of the redescending w results in an increase of 15% and 5% over

the non-descending w, whereas the change was less than 1% for the

generated data.

Table 4.4 gives the non-unit weights less than 0.35 for the various

robust procedures adopted. Ten of the last 15 individuals for group 6

are listed. Inspection of the data shows that 18 individuals (31-48)

have measurements larger than those recorded for any of the remaining

groups. The variance for the first canonical variate for group 6 for

the usual analysis is twice that of the other variances. The robust

M-estimation of the canonical vectors has downweighted the influence

of these larger observations. Robust M-estimation of the means and

covariances indicates only one observation as being possibly atypical.

However, on closer inspection the Q-Q Gaussian plot of cube root of

Mahalanobis distances does show some slight curvilinearity, being

approximately linear for the first 30+ observations, and then curving

upwards, though the effect is not pronounced. Apart from group 6,

there is generally good agreement as to observations indicated as being

possibly atypical by robust M-estimation of canonical vectors and of

means and covariances. Moreover, of 19 observations with low weight

for robust M-estimation of the canonical vectors, 14 also have low

weight for either the first or the second canonical vector from robust

M-estimation of the scores.

The second data set to be examined is the Thais study by

C. Campbell (1978). Brief details and group sizes are given in Section

3.4. The results from applying robust M-estimation of means and

covariances are given in Section 3.4. The corrections noted in that

Section, namely changes of 50 or 100 units or of the order of two

numbers, have been made to the data set considered here. As Table 4.6

shows, only three observations then have weights less than 0.35.

100

Table 4.4 - Summary of non-unit weights less than 0.35 from robust analyses of Dicathais data. b = 2.25, b

2 = 1.25 for robust

canonical variate analyses. b0 = 2.0, b2 = 1.25 for robust covariance estimation.

grp obs robust m-est c-vecs

m-est c v scores m-est c v scores variance = 1 robust variance cvl cv2 cvl cv2

m-est mean covariance

c-variate group variance

1 96 36a 10 2,34,15b 40 17,31c 1.3 2.7 102 03 55 02 2,74,19 67 17 49

2,82,18

3 16 32 17 90 03 2,22,30 45 1.3 1.3 75 04 01 00 2,23,18 00

4 65 32 05 2,56,26 06 39 60,22 1.3 1.6 68,27

6 34 28 5.5 2.0 38 04 05 65 39 07 15 41 30 42 07 32 44 07 37 35 45 00 02 01 04 37 46 00 47 12 48 00 00 33 79

7 31 00 03 2,25,08 11 2,25,26 00 1.2 2.4

8 2,6,15 83,08 2.0 2.0 2,1,30 2,4,31

10 1,2,20 1,2,17 1.3 0.6 2,42,05

11 34 07 03 02 00 27,00 1.5 1.0 31,00

12 15 02 42 2,33,27 85 00 2.7 3.1

13 26 00 79 79 00 80,08 1.8 1.4 82,01

a observation 96 for group 1 has a weight of 0.36 for robust M-estimation of the canonical vectors, weights of 1.00 and 0.10 for M-estimation of the canonical variate scores for the first two vectors when a variance of 1 is adopted and weights of 1.00 and 0.40 when a robust variance is adopted, and a weight of 1.00 for robust M-estimation of means and covariances.

b observation 34 has a weight of 0.15 for M-estimation of the canonical variate scores for the second (2) canonical vector, and unit weight for robust M-estimation of the canonical vectors.

c observation 17 has a weight of 0.31 for M-estimation of the means and covariance matrix, and unit weight for robust M-estimation of the canonical vectors.

101

Table 4.5 Canonical roots and vectors for Thais data. Coefficients for standardized variables are given in brackets

canonical vector I c-root

usual c.v. analysis -0.62 0.35 0.40 -0.85 0.35 0.14 0.24 3.585 (-5.5) (2.4) (2.1) (-1.2) (0.6) (0.7) (0.9)

c.v. analysis based -0.62 0.36 0.40 -0.85 0.34 0.14 0.23 3.615 on M-estimates of (-5.5) (2.4) (2.1) (-1.2) (0.6) (0.7) (0.9) means and covariances

robust M-estimation -0.69 0.42 0.42 -0.87 0.31 0.16 0.24 4.189 of canonical vectors(-5.2) (2.4) (1.8) (-1.0) (0.4) (0.7) (0.8) b0=2.25, b2=1.25

robust M-estimation -0.69 0.42 0.45 -0.90 0.37 0.17 0.19 4.382 applied to c.v. (-6.0) (2.7) (2.3) (-1.2) (0.6) (0.8) (0.7) scores, b0=2.25, b2=1.25, variance = 1

robust M-estimation -0.65 0.40 0.43 -0.91 0.36 0.12 0.23 4.037 applied to c.v. (-5.8) (2.6) (2.3) (-1.2) (0.6) (0.6) (0.9) scores, b =2.25, b2=1.25, robust variance

canonical vector II c-root

usual c.v. analysis -0.09 -0.20 0.19 0.74 0.36 0.16 -0.07 1.209 (-0.8)(-1.3) (1.0) (1.0) (0.6) (0.8) (-0.3)

c.v. analysis based -0.10 -0.19 0.19 0.78 0.40 0.13 -0.07 1.237 on M-estimates of (-0.9)(-1.2) (1.0) (1.0) (0.7) (0.6) (-0.3) means and covariances

robust M-estimation -0.15 -0.27 0.37 0.89 0.08 0.29 -0.10 1.631 of canonical vectors(-1.1)(-1.5) (1.6) (1.0) (0.1) (1.1) (-0.3) bo= 2.25, b2=1.25

robust M-estimation -0.15 -0.18 0.24 0.94 0.00 0.31 -0.04 2.101 applied to c.v. (-1.3)(-1.2) (1.3) (1.2) (0.0) (1.4) (-0.1) scores, b =2.25, b2=1.25, vvariance = 1

robust M-estimation -0.12 -0.18 0.15' 0.77 0.31 0.20 -0.02 1.277 applied to c.v. (-1.0)(-1.2) (0.8) (1.0) (0.5) (0.9) (-0.1) scores, b0=2.25, b2=1.25, robust variance

102

-~ 4 +

5

6.72-f (a)

b 0

6 7

4

U -E 5.12+ 0

12 1 9 3

10

3152--1- 11 2 _1- + 4 -1- + 4

-2.24 -0.64 0.96 2.56 4.16 Canonical voriafe I

8.2 -I- tb) 7

4

8

-I-

5

6.2 + 0

6 1. 9

3 12 10

4.2 -i- 11 2 -1- + + + + +

-2.08 -0.48 1.12 2.72 4.32 canonical variate I

Figure 4.3 - Canonical variate means for Thais data.

(a) usual canonical variate analysis

(b) robust M-estimation of the canonical vectors, with b0 = 2.25, b2 = 1.25.

103

Table 4.6 - Summary of non-unit weights less than 0.35 from robust analyses of Thais data. by = 2.25, b2 = 1.25 for robust canonical variate analyses. by = 2.00, b2 = 1.25 for robust covariance estimation. See Table 4.4 for explanation of entries.

grp obs robust m-est c-vecs

m-est c v scores m-est c v scores m-est mean variance=1 robust variance covariance

cvl cv2 cvl cv2

1 50 05

4 52 42 1,45,17 1,45,22 20 1,64,07 1,64,11

5 23 40 70 1,5,23 2,1,29

2,3,16 2,32,35

6 23 01 00 1,20,37 24 00 21 00 2,22,01 25 40 02 26 00 07 00 87 27 00 60 03 28 01 02 29 01 00 30 08 00 31 00 19 00 32 02 00 33 00 00 34 00 00 35 00 00 36 00 73 00

7 22 11 2,1,03 2,1,28 32 02 81 2,3,03 2,3,2w

2,4,32

8 43 19 08 1,34,08 22 1,34,24 45 07

10 1,28,24

11 1,28,73 1,28,05 28,00

12 5 24 00

14.2.o, 62..0 +

3. 0 (a)

4

4*

104

-+-

2.0+

# 4' 60= 2 o, 4= oo _}. 1 0+ 4- -- — --I- + -I- 4-

-2.4 -1.2 0.0 1.2 2.4 Gaussian quonf;les

20. 0+ (6)

11 1*

#ir1111 * #

0.0 + -I- -I- +- -1- -I- 0.0 5.0 10.0 15.0 20.0

Gamma (2,3.5) quantiles

Figure 4.4 - Q-Q plots of Thais data from group 6.

(a) Gaussian plot of cube root of squared Mahalanobis distance

(b) Gamma plot of squared Mahalanobis distance

105

Table 4.5 summarizes the canonical roots and vectors for the usual

and various robust canonical variate analyses. Figure 4.3 shows a plot

of the canonical variate group means. The two plots are very similar,

except for the marked change in the position of group 6 for the second

canonical variate. The largest 14 individuals for this group have low

weights for the robust M-estimation and for the second canonical variate

for M-estimation of the scores with unit variance (Table 4.6). None of

these has low weight for the robust covariance estimation. Inspection

of the data shows that individuals 22-36 are larger than any collected

from the other groups. Q-Q plots of Mahalanobis distances in Figure 4.4

show marked lack of linearity.

There is generally good agreement between the weights from the

robust M-estimation of the canonical vectors and from M-estimation of

the scores with variance unity. The canonical vectors are little changed

by the various robust procedures, though there is again a marked change

in the roots.

For both examples, there is an obvious explanation for the changes

resulting from the use of robust procedures. And in each case, the data

from the group with the large animals do not agree with a multivariate

Gaussian form, unlike those for the remaining groups, so that initial

examination of the data would indicate the need for caution.

4.4 Discussion

The conclusion to be drawn from the analyses reported here and

those of other data sets is that the canonical vectors are little

influenced by a small number of atypical observations, and hence the

assessment of the relative importance of the variables is little affected.

However, when the influence of the atypical observations is downw eighted,

106

the canonical roots are increased, often by as much as 15%. Unless a

particular group contains a reasonable proportion of observations

which are downweighted, as in the Thais example, the pattern of the

canonical variate means are generally similar for the usual and robust

analyses.

The robust M-estimation of the canonical vectors indicates values

which are atypical in relation to the summary provided by the estimated

group means and covariance matrix in the chosen number of dimensions.

Observations may be genuinely atypical in some way. However, it is

important to ensure that the representation provided by the canonical

variates in the reduced number of dimensions considered is adequate for

the group or groups with atypical observations before interpreting the

results. It may well be that a group mean lies a significant distance

above or below the canonical variate plane so that observations are

distant rather than atypical in the sense of a wrong measurement(s) or

variable observation. The use of robust M-estimation of the vectors

and of the scores will be informative here. For the Thais data, the

larger observations for group 6 have low weight for the M-estimation of

the vectors and for the second vector resulting from M-estimation applied

to the scores with unit variance adopted. The agreement for the

Dicathais data is less pronounced.

The question arises as to whether the robust procedures should be

used in preference to the usual analysis, and if so, which one(s) should

be used. For an occasional atypical observation(s) in some of the

groups, my preference is for the robust M-estimates of the vectors and

the vectors from M-estimation of the scores adopting unit variance. For

the examples considered, the choice is more complicated. The Dicathais

data show reasonable agreement with an underlying multivariate Gaussian

distribution, even for group 6. And the third canonical root is small,

107

indicating an adequate representation. It may well be that the animals

from that region grow to a larger size, and that techniques which allow

for different covariance matrices, such as those discussed in Chapter

Eight, are more appropriate for the analysis. The Thais data show

reasonable agreement with a multivariate Gaussian distribution, apart

from group 6 for which the Q-Q plots are somewhat distorted. The data

for group 6 would warrant closer examination. The larger animals do not

form part of the same population as the smaller animals, and the robust

approaches indicate that the larger animals are atypical when compared

with the remaining data. It seems reasonable to conclude that for

occasional atypical observations, the robust procedures provide a simple

and effective means of accommodating such observations in the analysis.

In addition, indication of subsets of atypical observations may lead to

further insight into the data.

Ahmed and Lachenbruch (19 T1) and Randles, Broffitt, Ramberg and

Hogg (1978) have considered the use of robust procedures for discriminant

analysis. The interest in both cases is in allocation rates. Ahmed and

Lachenbruch (1977) use an iterative trimming suggestion of Gnanadesikan

and Kettenring (1972), with either 5% or 10% trimming of observations

with the largest Mahalanobis distances, to estimate means and covariances,

and then calculate the discriminant function. They also use 15% or 25%

trimming of the discriminant scores as an alternative procedure, with

the discriminant vector recalculated from the remaining observations.

They show improved performance over the usual discriminant function

when contamination is present, with similar performance of the three

approaches for Gaussian data. To me, trimming is less appealing than

M-estimation. The latter reduces the influence of an observation only

if the observation is atypical, whereas trimming always reduces the

influence of'some observations. Randles, Broffitt, Ramberg and Hogg

108

(1978) use M-estimation of means and covariances with b1 = 2.0,

b2 = =, and then calculate the discriminant function. As an alternative,

they suggest estimating the discriminant vector by maximising a function

of the between-to-within groups ratio of a linear combination of the

variables. To do this, they suggest choosing the function so that the

influence of observations whose scores are a "great distance from a

robust measure of the middle" is reduced. This suggestion has a basic

weakness. For consider two groups which are well separated. Then all

observations will be a great distance from the cutoff point, and so

under their suggestion all observations will be markedly downweighted.

CHAPTER FIVE: GRAPHICAL COMPARISON OF COVARIANCE MATRICES

This Chapter develops procedures for comparing within-group

covariance matrices. The procedures are based on separate analyses

of the variances and of the correlations. The variances and correlations

are represented as two-way tables, with the columns representing groups.

Section 5.2.1 develops graphical procedures based on comparisons of

linear regressions, by considering a multiplicative columns-regression

model for the interaction of groups X variances and of groups x corre-

lations. A multivariate comparison is considered in Section 5.2.2,

and this leads to the use of canonical variate analysis to display the

differences in covariance structure. Section 5.2.3 presents procedures

based on orthonormalization of the original variables. For equal

covariance matrices, correlations between suitably orthonormalized

variables should be zero, and variances should be unity. Section 5.3

applies the procedures to two sets of data, while Section 5.4 gives

further discussion of the various approaches.

5.1 Introduction

A fundamental assumption in canonical variate analysis is that of

equality of covariance matrices. The commonly-used procedure for

comparing covariance matrices for several groups is based on the

likelihood ratio criterion; this is the procedure presented in virtually

all recent multivariate texts (an exception is Gnanadesikan, 1977).

This criterion is known to be very sensitive to non-normality (Layard,

1972, 1974). A further drawback in applied studies is that no readily-

interpretable information is provided as to how the covariance matrices

differ.

109

110

There are at least two distinct reasons for studying differences

in covariance structure. One is to detect and identify differences in

covariance structure per se when these are of direct interest in the

particular field of application. A second reason is to be able to

relate the differences to the subsequent effect, if any, on the

ordination resulting from a canonical variate analysis.

5.2 Graphical Comparisons

Let Vk be the sample covariance matrix for the kth group,

k = 1,...,g, and write Vk in terms of its variances ski, i = 1,...,v,

and its correlations rkj, j = 1,...,v(v-1)/2. The ordering of the j

is taken here to correspond with the order of entry in the upper

triangle of the correlation matrix, viz. (1,2),(1,3),...,(v-1,v).

The variances and correlations for each group provide one natural

summary of the scatter of the variables and of the linear relationships

between them, summarizing size and orientation aspects of the data.

They also provide a basic description in the sample space geometry of

multivariate analysis - the variance is represented by the squared

length of a vector and the correlation by the cosine of the angle

between two vectors. Moreover, the variances and the correlations can

be transformed so that the distributions of the transformed statistics

have second moments approximately independent of the corresponding

population parameter; the distributions more closely approximate to the

Gaussian form. For the variance, either the logarithmic transformation

(Bartlett and Kendall, 1946) or the cube root transformation (Wilson

and Hilferty, 1931) can be used. For the correlation coefficient, the

arctanh transformation of Fisher (1921) is used.

111

The comparisons of covariance matrices described in this Chapter

are based on separate comparisons of the variances and of the

correlations. For convenience, either the transformed variances or

the transformed correlations for the g groups will be denoted by tik

with i = 1,...,q where q = v or q = v(v-1)/2,and k = 1,...,g, and will

be referred to as elements.

The basis of the procedure to be described in detail in the next

Section is to consider the q elements for the g groups as two qxg

matrices. Each matrix is analogous to a two-way data matrix in

analysis of variance; the interest here is in the nature of the

interaction, if any, between the groups and elements.

Alternatively, each column can be written as a multivariate vector

tk, k = 1,...,g. In a multivariate context, for the groups to have

similar covariance matrices, they must have similar profile vectors

tk (see Morrison, 1976, Section 4.6, for a discussion of the latter).

A multivariate comparison is discussed in Section 5.2.3.

5.2.1 Individual-average plot

Consider now a qxg elements x groups two-way table. If there are

no differences in the covariance matrices across groups, this means

that the elements within each row will be essentially the same and that

the pattern of changes of the elements is similar from group to group.

In the context of analysis of variance, this is equivalent to specifying

that there is no group effect and that there is no interaction of

elements with groups.

Since the elements are themselves summary statistics, they have

associated variances and covariances. If these are assumed known (see

discussion of this below), then a weighted analysis of variance can be

112

formed, and the hypothesis of no interaction (and of no main effect)

in the two-way table can be examined by the usual F-test. However

this gives little insight into the nature of the interaction. To gain

more insight, a graphical approach is proposed which, when combined

with a formal statistical analysis, also provides the SSQs for

interaction and for groups in the analysis of variance table.

The basis of the procedure is the result that the hypothesis of

no interaction in the analysis of a two-way table can be expressed as

that of a linear regression, with slope unity, of the set of elements

for a particular group on the row means. The hypothesis of no group

effect corresponds to that of the same zero intercept for each group.

To see this, consider the familiar rows x columns model for the

analysis of variance, viz.

E(tik) = u + ri + ck + Tik

The expected row means are given by

E(ti.) = u + ri

(5.1)

(5.2)

under the usual constraints that r. = c. = Ti. = T.k = 0, where the •

subscript denotes average over the subscript. From (5.1) and (5.2),

if the interaction Tik is null and the column (here group) effect ck

is null, then E(tik) = E(ti.). Equivalently, there is a linear

relationship between E(tik) and E(ti.) with slope unity and intercept

zero.

Since a single linear regression with unit slope and zero intercept

is of ultimate interest, the proposed graphical procedure is to plot

the individual element values tlk,...,tgk against the row means

113

tl.,...,t for all g groups and see if there is reasonable agreement q.

with the null model of a single common linear relationship. The

formal statistical analysis consists of the fitting of linear

regressions for each group, and the comparison of the fitted slopes

and intercepts. The plot is here called the I-A (individual-average)

plot. The idea was proposed by Yates and Cochran (1938) in the

context of an examination of the interaction of varieties of barley

and the places at which they were grown.

The alternative specification of linear regressions with unspecified

intercept and slope is adopted because it usually provides a simple

description of the resulting I-A plot. The summary statistics of slope,

intercept, % variation explained, and residual mean square complement

the graphical interpretation. A scatter plot of mean level against

fitted slope will be referred to subsequently as an M-S plot. Moreover,

the calculations for the comparison of the fitted regressions for

equality of slope and thence of intercept give the complete analysis of

variance table.

Consider the alternative specification in more detail, namely a

linear relationship between E(t. - t•k) = r. + Tik and E(ti. - t.,) = r. .1. ik

That is, consider the special structure

ri + Tik Skri . (5.3)

Summation over k gives

or

g gri = ri E Sk

k=1 (5.4)

s = 1.

114

From (5.3),

Tik = (6k - 1)ri

and so the model (5.1) then becomes

E(tik) = u + r. + ck + (6k - liri

(5.5)

= u + ck + skri ,

with the Sk constrained as in (5.4). This specifies a linear relation-

ship between E(tik) and E(ti.), with slope Sk and intercept

E(tik) - 8 E(t..). When Sk = a for all k, (5.4) shows that Sk = s = 1,

and so, from (5.3), Tik = 0. That is, within this setup, a common

slope of unity corresponds to a null interaction effect. The intercept

then becomes E(tik) - E(t) = ck; a zero intercept corresponds to a

null column (i.e. group) effect. The specification adopted in this

paragraph is that of a multiplicative model for the interaction in the

two-factor linear model (see, e.g., Williams, 1952; Mandel, 1961).

The parameters in the model (5.5) can be estimated by maximum

likelihood under the assumption of Gaussian errors. This leads to an

eigenanalysis, the solution of which gives the estimates of the Sk and

of the r. (Williams, 1952; Mandel, 1971).

An alternative is to use a conditional regression approach (to

use the terminology of Mandel, 1961); this approach follows directly

from the Yates-Cochran (1938) formulation. First, the parameters in

(5.1) are estimated withTik = 0, under the assumption of independent,

identically distributed Gaussian errors. This gives the usual estimates

115

ri = t - ti, and ck = t.k - t . Then the estimates of the slopes

Sk are found by regressing thetik on the t. for each column

k = 1,...,g.

The conditional regression approach is adopted here, largely

because of the computational benefits which result from the subsequent

comparison of regression table. Since the regressor variable is simply

the row mean and hence is the same for all regressions, the latter

table contains the SSQs for the complete analysis of variance. Moreover

the partition of the interaction SSQ into components due to the

comparison of slopes and to deviations from regression leads to standard

statistical tests. As Tukey (1949), Scheffe (1959, Section 4.8) and

Mandel (1961, Section 4) have shown, the SSQs follow x2 distributions,

with g - 1 and (g-1)(g-2) d.f. respectively. The procedure for

comparison of regressions is well-known (see, e.g., Sprent, 1969,

Section 7.4): first fit individual regressions for each column; then

fit a common within-columns regression; and finally bulk the data over

all columns and fit a single overall linear regression. The differences

in residual SSQs for the three stages give., the SSQ due to common slope,

and conditionally on common slope the SSQ due to common position. To

see how the simplifications for the comparison of regressions result,

write yik for the observed tik and xik for the regressor variable so

thatxik = t. for all k. Then

g q g q q E E (yik-y)(xik

-x) = E E (tik-t.k)(ti -t..) = g E (t. .-t.. k=1 i=1 k=1 i=1 i=1

and

gq 2 g q 2 2 E (x. -x ) = E E (t. -t ) = g E (t, -t ) .

k=1 i=1 lk •k k=1 i=1 1~ i=1 1~

116

4 Hence the SSQ due to common regression is g E (t. -t )2 = row SSQ,

isl 1• , .

while the estimate of common slope is unity. Similar calculations

for the remaining terms in the comparison of regressions table lead

to Table 5.1.

The results outlined above assume independent and identically

distributed Gaussian errors for each element. In the two-way tables

for comparison of covariance matrices, this does not hold, since the

elements within a column (group) are in general correlated, while the

columns will have different variances if the group sizes differ.

Hence weighted regressions are desirable. Consider first the effect of

different group sizes. Specifically, var(tanh-1 rkj

) = (nk-3)-1,

var(log ski) = 2(nk-1)-1, and var(s1/3) = 2{9(nk-1)1-1. Hence when ki

the group sizes differ, the weights will differ from column to coiumn.

As a consequence, while the orthogonality of the row and the column

effects will still hold, the interaction or rowsxcolumns effect will

no longer be orthogonal to the row effect. However, an added advantage

of the conditional regression approach is that since the interaction

SSQ is conditional on the usual row and column effects, it gives the

SSQ usually calculated when non-orthogonality exists.

A more fundamental problem is that although the variances of the

elements may be regarded as known, their asymptotic covariances (and

hence the weighting in the regressions) depend on the unknown

population correlations pig, as summarized below. The

approach adopted here is based on the common observation that in many

discrimination studies, the pooled covariances matrix is calculated

with a reasonable number of degrees of freedom and should therefore

provide good estimates of thepig.

Some protection against atypical elements is desirable, however,

since they may unduly influence the average across groups; to provide

Table 5.1 Relationship between analysis of variance SSQs for two-way table and comparison of

regressions calculations with row means as regressor variable. Degrees of freedom

for x2 are also given.

Total SSQ

Regression SSQ Deviation SSQ

(a) individual regressions row + interaction row + regre.sion deviation g (q-1) q+g-2 (q-2) (g-1)

(b) common slope regressions row + interaction row interaction g (q-1) q-1 (q-1) (g-1)

difference of (a) and (b)

regression regression g-1 g-1

(c) overall (single) regression row + column + row column + interaction interaction q-1

q(g-1) qg-1

difference of (b) and (c) column column g-1 g-1

118

this, a robust average of the elements for each row is used. Specifi-

cally, the midmeans of the arctanh r for each m are calculated, and km

the robust means are back-transformed to provide estimates of the pij.

Remember that the order m = 1,...,q corresponds to (i,j) in the

sequence (1,2),(1,3),...,(v-1,v).

An alternative to the calculations proposed above is to determine

the pooled estimates of the pij from those groups which appear to be

similar from the I-A plots; the pooling is over all the elements in

similar groups. There is then the choice of whether to average the

arctanh values and backtransform, or to pool the individual correlations

directly, or to pool the individual covariance matrices in the usual

way and then calculate the correlations. With a computer it is

straightforward to use the various alternatives and compare the results.

In practical applications to date, the graphical descriptions provided

by the alternative estimates have been similar to those provided by

the midmean estimates. This is discussed further in Section 5.3.

In the practical application of the approach, the assumption is

made that the population covariance matrix P for the elements is known.

Asymptotic variances and correlations of the transformed variances and

transformed correlations can be derived from results given in Elston

(1975). Let s denote the sample variance for the ith variable, r.

the sample correlation coefficient, and let aii and

pli denote the

corresponding population parameters. Asymptotically, var(sii) = 2n-laii 2

and cov(sii,sjj) = 2n-1aiiajjpij. Now use the second order result for

the variance of a function of sii. Since Slog sii/asii evaluated at

aii gives aii, var(log sii) = 2n-1 for all i = 1,...,v, and

cov(logsii,log s..) = 2n-1

p2

j. The result for var(log sii) agrees

with that of Bartlett and Kendall (1946) if n-1 replaces n. Hence the

asymptotic covariance matrix for log sii is of the simple form

119

2(n-1)-1P, where P is a vxv matrix with unit diagonal terms and off-

diagonal elements pij. For the correlation coefficient, asymptotically

var(rij) = n-l(1—pij)2 (Elston, 1975, p.136), while

cov(rij ,rik ~ ) = n

-1{p.k (1-P

2 ij -P2 ik 2ij ) - 1p pik

(1-p2ij -P

2 jk -P

2 ik )} and

cov(rij,rkm) = n-1{2 ijPkm(Pik+Pim+pjk+pjm) - (PijPikPim + PijPjkPjm

+ PikPjkPkm + PimPjmPkm) + PikPjm + PimPjk}. Since a tank lrij/ar

ij

evaluated at pij gives (1-pij)-1, it follows that var(tanh-lrij) = n-1

for all i = 1,...,v-l; j = i+1,...,v, which agrees with Fisher (1921)

if n-3 replaces n. The asymptotic covariance matrix for tank lr.j is i

of the simple form (n-3)-1P, where P is a v(v-1)/2xv(v-1)/2 matrix

with unit diagonal terms and off-diagonal terms given by

2 coy (ri j ,rik) { (1-pij) (1-Pik) }-1 or coy (r .. ,rfi) { (1-pij) (1-pkm) }-l.

Note that the asymptotic covariance matrices depend only on the unknown

pij. The back-transformed robust midmeans are substituted for the pij

in the practical applications in Section 5.3, unless otherwise indicated.

The calculations for the comparison of regressions with known

covariance matrix follow directly from the usual theory (see, e.g.,

Sprent, 1969, Section 4.1). In matrix notation, with tk = (tlk,...,tgk) T

and tk ti Ng(uk'akkP) and ckk « nkl as i n the preceding paragraph, the

calculations proceed by replacing tk by ~rkk/2P-1/2tk, t, _ (tl.,...,tq.)T

by crick

and lq (as in q = lglq) by ckk/2P-l/21q. For example,

the slope Sk is estimated by

akktkP-lt. - okktk

P 11tTP-1lakk(akk1TP-11)-1

sk -1 T -1 -1 T -1 •

2 -1 T -1 -1 akkt.P t, - (akkt.P 1) (akkl P 1)

a

in this example, the akk cancel.

Since the regression calculations, and hence the analysis of

variance calculations, are weighted with weights assumed known, the

120

various SSQs are compared with the appropriate X2 distributions.

As in the usual analysis of variance, various residual plots can

be made. To emphasize those groups or elements in a group which differ

from the rest, take the residuals as the departs of the individual

elements from the robust midmeans. Under the null hypothesis of no

group differences, these departures will have zero mean and common

standard deviation if the group sizes are equal. An obvious summary

is provided by a Q-Q Gaussian plot. Two views of the same plot reveal

the differences of interests the first could use different symbols,

such as letters of the alphabet, for each group; and the second could

use different symbols, such as the numbers 1 to v or v(v-1)/2, for each

element. I have found teletype and line printer plots to be adequate.

The elements within a group are not iancorrelated, though empirical

evidence and some resu'ts in Tukey (1976, Section 5.3) suggest that

the linearity of the plot should not be affected. Another plot which is

sometimes useful, referred to here as an R-R plot, is that of the

departures from the robust midmean against the departures from the

usual meant ideally, the plot should be linear with unit slope and

zero intercept. Atypical groups or elements are indicated by departures

from unit slope and/or by the clustering of elements for a particular

group(s). Again the same plot viewed with group symbols and with

element symbols shows the differences of interest.

To illustrate the simplicity of the I-A plot, and the subsequent

residual plots, consider the following computer generated example. All

populations are assumed to have unit variances. Two groups are generated

from a four-variate multivariate Gaussian population with correlations

0.975, 0.950, 0.925, 0.900, 0.850, and 0.875, while for a third group,

four of the six population correlations are reduced, viz. 0.975, 0.600,

0.575, 0.500, 0.450, and 0.875. The generation of the covariance

121

matrices, via the Bartlett decomposition (see, e.g., Newman and Odell,

1971, Section 5.2), assumes a sample size of 50. Figure 5.1(a) shows

the I-A plot for the arctanh-transformed correlation coefficients.

Since greater resolution on the teletype exists for the abscissa,

the individual elements are represented along that axis. The evidence

for departure from equality of correlation structure is obvious, and

needs no formal analysis. It is also clear that the two groups

generated from the same population do indeed behave similarly.

Figures 5.1(b) and 5.1(c) show Q-Q and R-R plots for the same example.

The nature of the differences between the groups is again obvious.

5.2.2 A multivariate comparison

The approach proposed in Section 5.2.1 uses the univariate concept

of regression. As noted in the introduction to this Section, it is also

possible to consider the elements for each group as a profile represented

by a multivariate vector tk.

Consider the column vector tk ti Nq(1ak,01;kk

P) with akk and P assumed

known; Qkk depends only on the group size. Then the relevant part

of the log-likelihood for all g groups is

9

- tr P-1

E akk(tk-uk)(tk uk)T .

k=1

(5.6)

Under the hypothesis of equality of the Ilk, differentiation of (5.6)

w.r.t. the common u gives

ū = E akk

tk/ E akk

k=1 k=i

and the maximized term is -trP-1B where

+ i-

2.1 +

+

122

+ 4- +

B AC

C

C BAC

A B

8A

C A 8

0.9+ C B A + + +

0. 35 0. 85 i . 35 1.85 2.35

individual arctanh correlations

Figure 5.1(a) - I-A plot - arctanh correlations - generated data - two groups (A & B) are generated from the same population.

+ --------- -{- -{- ~- +- 0.12-I-

v)

-o U)

ō- 0. 48

a,

N

0

C

C C

8 B8 AA A A

B

B CC A A

B

-1.08 -1-C + + -I- -I- -I- +

-2. 0 -1.0 0.0 1. 0 2. 0 Gaussian Quantiles

Figure 5.1(b) - Q-Q plot - arctanh correlations - generated data.

123

0. 48_ ++

B 8 1 .

iA 8 B 8 A A

y

C

C C

-1. i2 +

-1.08

+ + + -F- -0.68 —0.28 —0.12 0.52

usual residuals tik"ti.

Figure 5.1(c) - R-R plot - arctanh correlations - generated data - for the Q-Q and R-R plots, a number indicates

the number of overprintings.

g _

B = E akk (tk-11) (tk T.

k=1

Hence the likelihood ratio statistic is exp(- 2 - tr P 1B), and so

-2 log A = tr P-1B. The statistic tr P-1B is distributed as x2 on

q(g-1) d.f. under the null hypothesis, and as a non-central x2 with

non-centrality parameter tr P-1 E akkukuk

under the alternative k=1

hypothesis (see, e.g., Chakravarti, 1966, Section 3.1).

The form of the test statistic suggests that a conventional

canonical' variate analysis, in which the vector of column elements tk

is treated like a vector of group means, may be used to examine

differences in covariance structure. Specifically, the roots fi and

vectors ci of (B-fP)c = 0 are determined; there are min(g-1,q) non-zero

roots, the sum of which equals the trace statistic above. The plot of

canonical variate means, referred to subsequently as a C-V plot,

indicates those groups which have similar patterns of elements, and

those groups which differ. The value of the plots of canonical variate

means will depend on whether most of the information is in the first

few canonical roots (termed concentrated structure by Olson, 1974,

following Schatzoff, 1966) or whether it is spread over most of the

roots (diffuse structure).

The statistic tr P-1B is just the sum of the column SSQ and

interaction SSQ in the analysis of variance discussed in Section 5.2.1.

This is easily seen by considering the case when P = I and akk = 1.

Then

-1 T q g 2 tr P B = tr E (tk-t ) (tk-t. ) = E1 E (tik-t .

k=1 i=1 k=1

q g = E E {(t.k-t)

2 + (tik-t. -t.k+t)2} 1.

i=1 k=1

124

125

as required. The result is also obvious when the vectors tk are

considered as profiles (refer again to Morrison, 1976, Section 4.6).

A profile is simply a graphical representation obtained by plotting

the components of the vector against the variable number (from 1 to q).

Equality of mean vectors implies that the profiles are similar in

shape or are parallel, and that they also have the same overall mean

value; the former implies lack of interaction, and the latter lack of

group effect.

Layard (1972) has demonstrated the non-robustness of the likelihood

ratio test for the equality of two covariance matrices. Of particular

relevance to this study is that one of the tests he proposes, his

standard error test, is very similar to the multivariate comparison

outlined above. The standard error test is a simultaneous test of

equality of log variances and arctanh correlations for two groups;

a consistent estimate of the asymptotic covariance matrix is obtained

by substituting sample quantities for population moments in the

asymptotic covariance matrix of the sample second moments (see Layard,

1972, p.125). A sampling experiment with two variables presented in

Layard (1974, p.464) shows that empirical significance levels for a

nominal 5% level for the standard error test are between 5.8%, for

an underlying standard Gaussian distribution, and 8.5%, for a double

exponential; the levels for the usual likelihood ratio test are 4.5%

for the Gaussian, 21.8% for the double exponential and 39.2 for a 10%

contaminated Gaussian with a scale factor of 3. The results are based

on 1000 replications of samples of size 25. The power of the standard

error test is comparable with that of the likelihood ratio test for

Gaussian samples. Layard's results suggest that the multivariate trace

statistic, and the analysis of variance table, will provide useful

126

guidelines for indicating the statistical significance of the differences

in variance and/or correlation structure.

5.2.3 Orthogonalized variables

If the covariance matrices are similar, then the corresponding

concentration ellipsoids will be similar in orientation and in size.

After suitable rotation and standardization, the concentration ellipsoids

will then become concentration spheres. This is, of course, the first-

stage rotation in canonical variate analysis (see Section 1.4). The

correlations of the orthonormalized variables within-groups should be

zero, and the variances should be unity. The rotation and scaling is

based on a pooled within-groups covariance or correlation matrix.

The previous paragraph suggests two further procedures for examining

equality of covariance structure, based on the equality of the correla-

tions between orthonormalized variables, and on the equality of the

variances of the orthonormalized variables. However, the elements of

the pooled covariance or correlation matrix are influenced by an atypical

group(s) or element(s) of a group, with the result that all the correla-

tions and the variances between the orthonormalized variables will tend

to differ from zero and from unity respectively. A simple but effective

solution is to use the robust estimate of the correlation matrix described

in Section 5.2.1, with the variables standardized by the back-transform

of the midmean of the log variances.

The orthonormalizing transformation based on the eigenanalysis of

the robust correlation matrix is applied to the covariance matrices for

each of the groups after the latter have been standardized by the robust

variances. Specifically, let RA be the robust correlation matrix, and

let VA be the diagonal matrix of robust variances, with Vk the covariance

matrix for the kth group. Let RA = UAEAUA be the eigenanalysis of RA.

127

Then the standardized covariance matrix is given by VA1/2VkVA1/2

and that for the orthonormalized variables for the kth group is given

by E-1/2UTV 1/2V V-1/2U E 1/2 = V0 A A A k A A A k

say.

All variances, for all groups, for the orthonormal variables- should

be approximately unity under the null hypothesis, and the variances will

be approximately x2 distributed. Moreover, from the results on variances

and correlations of variances and of correlations in Section 5.2.1, the

variances within a group will tend to be uncorrelated after transforma-

tion. Hence, if the group sizes are similar, the equality of the variances

can be examined by a gamma probability plot.

All correlations, for all groups, for the orthonormal variables

should be approximately zero under the null hypothesis, and again will

tend to be uncorrelated. If the group sizes-are similar, a Gaussian

probability plot of the arctanh transformed values should have slope

(n-3)-1/2 (see, e.g., Hills, 1969).

The examination of the orthonormal variables has an added advantage:

if a canonical variate analysis is carried out on these variables, the

within-groups directions that do not contribute to the between-groups

discrimination can be identified and eliminated, and this will often

lead to a marked improvement in the stability of the canonical variate

coefficients, as discussed in Chapter Six. Moreover, in the context

of discrimination, it is then only necessary to base the comparison of

the covariance matrices on the remaining orthonormalized variables,

and determine whether the correlations between these variables are

approximately zero and the variances approximately one. Plots of group

means and associated concentration ellipses for pairs of important

orthonormal variables can also be made; examples of this are given in

Campbell (1979) and in Campbell and Atchley (1979, paper in preparation).

128

The approach described in this Subsection will be of particular

value when the first-stage principal component analysis provides a

useful biological or physical interpretation. Campbell and Atchley

(1979) relate the principal components to differences in the patterns

of correlation between species of grasshoppers.

5.3 Some Examples

The first data set to be examined is from a study of variation in

and between species of grasshopper in the Snowy Mountains, New South

Wales (Campbell and Dearn, 1979). In all, twenty-five groups were

collected along three altitudinal transects. Each population contains

one of three species. The species are denoted as P, C and U. Thirteen

groups were collected along the first transect (altitude 980 m - 2140 m),

ten groups along the second transect (1040 m - 1540 m) and two groups

from a third transect. Sample sizes are given in Table 5.2.

Of twelve variables measured, three contain much of the information

for discrimination between the species. To illustrate the approaches

described in Section 5.2, differences in the variance and correlation

structure for the twenty-five groups for the three variables are

considered here. The variances and correlations for each group are

given in Table 5.2. The approaches are presented for illustrative

purposes. I am not proposing that such a detailed analysis is warranted

when there are only three variables; careful inspection of the log

variances and arctanh correlations will often then suffice, though the

I-A plot may also be a useful aid.

Figure 5.2(a) shows an I-A plot of average log variance against

individual log variance. The regressions to be fitted are of individual

values against average values; however the reverse is plotted because

of the greater resolution on the abscissa on teletype plots. The order

129

Table 5.2 Variances (x1000) and correlations for the grasshopper data

Grpa n vl v2 v3 r(1,2) r(1,3) r(2,3)

1 P 20 3.225 4.216 11.251 0.670 0.799 0.549

2 C 20 1.646 5.352 16.645 0.721 0.742 0.539

3 C 20 0.889 6.873 9.900 0.664 0.503 0.827

4 C 39 2.648 6.253 11.371 0.698 0.669 0.715

5 C 20 1.604 3.150 11.183 0.465 0.423 0.602

6 C 20 1.973 5.197 17.197 0.786 0.757 0.761

7 C 17 1.047 3.563 5.988 0.502 0.634 0.518

8 C 19 1.743 6.792 19.493 0.412 0.401 0.852

9 C 20 2.171 5.546 11.321 0.558 0.756 0.583

10 C 13 2.073 6.677 14.476 0.616 0.903 0.642

11 U 18 1.156 3.603 8.356 0.419 0.614 0.367

12 U 15 4.441 8.681 31.078 0.879 0.820 0.812

13 U 26 1.572 4.961 16.496 0.620 0.609 0.661

14 P 10 2.023 8.093 9.854 0.684 0.916 0.720

15 C 16 2.660 5.783 9.273 0.553 0.685 0.474

16 C 20 5.947 7.973 24.020 0.790 0.907 0.891

17 C 20 3.289 4.592 19.199 0.691 0.586 0.624

18 C 19 3.120 4.447 14.809 0.627 0.797 0.802

19 C 20 1.625 5.722 12.729 0.644 0.685 0.762

20 C 16 2.733 2.710 13.319 0.563 0.804 0.765

21 C 17 3.293 6.894 22.224 0.634 0.839 0.757

22 C 12 1.988 4.209 18.950 0.459 0.624 0.663

23 U 19 1.858 3.813 12.558 0.521 0.506 0.578

24 C 19 1.676 4.439 15.898 0.447 0.325 0.726

25 U 16 2.126 5.233 21.056 0.406 0.682 0.470

aspecies identification (P, C or U) is also given. Groups 1-13 are

from transect I, groups 14-23 are from transect II and groups 24 and

25 are from transect III.

130

-4.2

CT) cr

a►

-6.2

+

1-

+ 4

C GK M NFJ ERWN2 5

+ i- N A TX 0

G K0 tI5J MFYNuP FD R 8

G vRFI J p T EKWAgySDNU L

XM80c N

D A OR L P

L

+ +

-7.2 -6.2 -5.2 -4.2 -3.2 individual log variances

Figure 5.2(a) I-A plot - log variances - grasshopper data - order of groups for each row: 7 11 15 14 3 5 1 9 4 23 19 20

10 18 24 13 2 6 22 17 8 25 21 16 12; 20 5 7 11 23 22 1 24 18 17 13 6 25 2 9 19 15 4 10 8 3 21 16 14 12; 3 7 11 13 5 19 2 24 8 23 6 22 14 10 25 9 4 15 20 18 1 17 21 12 16.

LL

LP

0.1 +

121 2

23V 03F

33 13J

32 233

11W 0

K1 GT

0.9 + G C + 4 + 4 -I- -I-

-3.2 -1.6 0.0 1.6 3.2 Gaussian quantiles

Figure 5.2(b) Q-Q plot - log variances - grasshopper data

LL

i R5

ā lY 54

• 0.1 + 13 Y2IS

43F � HX4

..4 M42

° 3W 0

K KE G T

-0.9 +1 -4-

÷ + + + + -0.9 -0. 4 0.1 0.6 1.1

usual residuals tik-ti.

Figure 5.2(c) R-R plot - log variances - grasshopper data

132

of the groups for each row is given in the Figure legend to distinguish

groups whose symbols are overprinted in Figure 5.2(a). The general

visual impression is of a band of linear trends oriented at around 450;

this is indicative of a lack of groups x elements interaction. Figures

5.2(b) and 5.2(c) show Q-Q and R-R plots with the letters A-Y representing

the 25 groups; linearity of the plots is again evident. Each of the

plots shows groups 12(L) and 16(P), and 7(G) and 11(K), at the extremes

of the plot.

Table 5.3 summarizes the fitted linear regressions and formal

analysis of variance table. The intercepts for common slope, of one,

are also given. All but three of the fitted regressions account for

more than 90% of the variation; and for only two of these three are

the deviations significant at the nominal 5% level. The analysis of

variance table shows that none of the SSQs of interest reaches even

the 90% point of the appropriate x2 distributions. In particular,

the pooled residual SSQ is not significant, nor is there a significant

difference in slopes. An M-S plot of mean against slope in Figure

5.3(a) provides graphical support for the analysis and for the

subjective conclusions from the I-A plot. Again groups 12 and 16, and

group 7, are on the edges of the plot. Groups 12(L) and 16(P) have

large variances, while group 7 has small variances.

The fitted regressions and analysis of variance table were also

calculated with the weighting matrix P based on alternative derivations

of the pij, viz, the usual pooled within-groups SSQPR matrix; the

pooled within-groups SSQPR matrix for the 12 groups 4, 5, 9-11, 14,

17-21, 23, with these groups chosen subjectively from the I-A, M-S and

C-V plots; and the backtransform of the average of the arctanh correla-

tions for the 12 groups. All fitted slopes and intercepts for these

three alternatives are within 0.02 of those for the robust midmean

133

Table 5.3 Summary of fitted linear regressions and analysis of variance

table for the log variances for the grasshopper data

(a) fitted linear regressions

Sk r2 Deviation SSQ Grp Total SSQ t -Skt t.k-t.,

1 17.0 -1.65 0.01 0.68 0.94 1.04

2 50.6 1.13 -0.03 1.21 1.00 0.09

3 58.4 0.81 -0.30 1.20 0.85 8.68

4 39.8 -1.19 0.09 0.76 0.98 0.65

5 37.8 -0.10 -0.34 1.04 0.98 0.57

6 45.0 0.83 0.03 1.15 1.00 0.02

7 24.3 -1.19 -0.63 0.89 0.94 1.45

8 51.8 1.51 0.12 1.26 0.99 0.52

9 25.6 -0.75 -0.03 0.86 0.99 0.28

10 22.4 0.15 0.09 1.01 0.98 0.44

11 32.8 -0.32 -0.49 1.03 0.99 0.41

12 28.0 0.93 0.67 1.04 0.98 0.44

13 69.0 1.21 -0.07 1.24 1.00 0.03

14 12.1 -1.06 0.04 0.79 0.83 2.03

15 11.6 -1.86 0.00 0.65 0.97 0.30

16 21.3 -0.56 0.67 0.76 0.94 1.34

17 34.6 0.05 0.20 0.97 0.93 2.50

18 24.8 -0.68 0.09 0.85 0.94 1.37

19 39.7 0.28 -0.09 1.06 0.98 0.90

20 26.1 -0.72 -0.16 0.89 0.82 4.58

21 30.2 0.51 0.39 1.02 0.99 0.21

22 29.7 1.14 -0.01 1.21 0.98 0.56

23 34.2 -0.05 -0.19 1.02 0.99 0.30

24 46.3 0.95 -0.10 1.19 1.00 0.06

25 40.7 1.33 0.12 1.22 0.99 0.24

(b) analysis of variance table

source of variation d.f. SSQa SSQ'

row (elements) 2 797 782

column (groups) 24 28.1 28.3

row x column 48 56.9 56.4 slopes 24 27.9 26.5 deviations 24 29.0 29.9

C + RxC 72 84.9 84.7

a correlations derived from robust midmeans b correlations derived from usual pooled within-groups SSQPR matrix

134

calculations. The (C + RxC) SSQs are within 2% of the C + RxC SSQ

for the robust midmean. Table 5.3(b) gives the analysis of variance

table when the pig are derived from the usual pooled within-groups

SSQPR matrix.

Now consider the multivariate approach in Section 5.2.2. With

only three variables, there are only three canonical roots. Their

values for the analysis of the log variances are 42.7, 24.8 and 17.4,

with the sum, the trace statistic, 84.9, agreeing with that for

columns SSQ + rowxcolumns SSQ in Table 5.3. Figure 5.3(b) shows a

C-V plot for the three canonical variates. There is in general

excellent agreement with Figure 5.3(a), with the obvious exception of

group 3 which has a significant residual SSQ and relatively low r2.

Figure 5.4 shows an 1-A plot of average arctanh correlation

against individual arctanh correlations. The reasonably smooth linear

trends evident in Figure 5.2(a) no longer obtain, and this is reflected

in the fitted regressions. Only ten of the groups exhibit significant

row variation within a column at even the 20% level, and only four at

the 10% level; only for four of the ten do the fitted regressions

account for more than 50% of the total variation. The analysis of

variance in Table 5.4 shows no significant interaction or column

effects. The significant deviation from regression, or residual, SSQ

when the interaction is partitioned further reflects the poor description

afforded by the linear regressions.

The fitted regressions and analysis of variance table were also

calculated based on the alternative derivations of the pig discussed

above. The SSQs for (C + RxC) are again within 2% of the value for the

robust midmean calculations (see, for example, Table 5.4). There is

close agreement between the M-S plots from the various calculations.

The three canonical roots from the canonical variate analysis of

135

the arctanh correlations are 40.9, 26.5 and 16.0, with sum 83.4

(c.f. Table 5.3). The C-V plot in Figure 5.5 shows no obviously

different groups, though population 16 is again on the edge of the plot.

Table 5.4 Analysis of variance table for the arctanh

correlations for the grasshopper data

Source of variation d.f. SSQa SSQb

row (elements) 2 9.7 9.2

column (groups) 24 24.7 24.7

row x column 48 58.7 58.5 slopes 24 20.4 19.4 deviations 24 38.3 39.1

C + RxC 72 83.4 83.2

a correlations derived from robust midmean

b correlations derived from within-groups SSQPR matrix pooled over groups 4, 5, 9-11, 14, 17-21, and 23.

The grasshopper example illustrates the simplicity of the

interpretation based on I-A, Q-Q and R-R plots and associated regression

statistics when linear regressions adequately describe the individual-

average relationship and the use of the multivariate C-V plot to

complement the univariate regression approach. No marked differences

in covariance structure exist, though group 16 may be worth closer

examination.

A second example, from a study of geographic variation in the

whelk Thais lamellosa along the west coast of North America

(C. Campbell, 1978), does exhibit differences in covariance structure.

The comparison of the covariance matrices for nine of the groups for

five variables is discussed here. Group 6 (see Chapter Four) is excluded.

D N R

I

T

G

E 0A

0 u

-6.084 +

0. 624

A

N d

4

K

J

4

G

I F

X(BM

-0.54+ +

4.05

YV

4.55 5.05

+

5.55 6.05 4. + + +

+ + -4.48+

-f

136

4 -I-

P l

U

4 VY H

J F + B

S X M w

C E

K

+ + + + -I-

0.784 0.944 1.104 1.264 estimated slope iik

Figure 5.3(a) M-S plot - log variances - grasshopper data

4 0.664

b

a 1 a 0.0644)

W E

I.

canonical variate I

Figure 5.3(b) C-V plot - log variances - grasshopper data - third canonical variate is indicated by vertical lines.

t

X HE CW Q KG DO BIF RAT L U

K YO Gp iE Q4 1VX UT RLC H P

0.8 +

o + YHKEGW

0 JMV S FP L + R CND

+ + + + + 0.0 0.4 0.8 1.2 1.6

individual arctanh correlations Figure 5.4 I-A plot - arctanh correlations - grasshopper data

0. 9-fi

137

JP N

F

W G 0B

I.56+

0.96+

C

0.361 -1- + -+- + + +

-0.4 0.0 o• 4 0.8 1.2

canonical variate I

Figure 5.5 C-v plot - arctanh correlations - grasshopper data - third canonical variate is indicated by vertical lines.

138

Group sizes are: 50, 99, 76, 37, 46, 50, 33, 28, 43. It is not the

intention to provide a biological interpretation of the data, but

rather to provide a basis for making such an interpretation. The

analyses are based on robust M-estimates of covariances, and hence

correlations (see Chapter Three).

Figure 5.6(a) shows an I-A plot for the log variances; the group

rankings are given in the Figure legend. A series of straight lines

would provide a reasonable description. Figure 5.6(b) shows a Q-Q

plot. Three groups of residuals are evident, and these correspond to

groups with visually similar positions in Figure 5.6(a), viz. H, I;

F, A, B, G; and C, E, D.

Table 5.5 summarizes the fitted linear regressions and formal

analysis of variance table for the analysis of the log variances.

Linearity of the fitted regressions holds for all but groups 2(B)

and 4(D) where departure from linearity is significant at the 1% and

0.1% levels respectively, when the deviation SSQ is compared with the

X3 distribution. Only for group 4 is less than 99% of the variation

explained by the linear regression. An M-S'plot of mean versus slope

is given in Figure 5.7(a). It is clear from Figures 5.6 and 5.7(a)

that groups 8(H) and 9(I), and groups 3(C) and 5(E) and possibly 4(D),

have similar variance structure. The groups 1(A), 2(B), 6(F) and

7(G) have similar mean level but differ in slope.

There are five canonical vectors with non-null roots. The roots

for the analysis of the log variances are 181, 84, 20, 11 and 3. The

sum, the trace statistic, is 299 (c.f. the SSQ due to C + RxC in

Table 5.5), to be compared with X40. A C-V plot for the first three

canonical vectors is shown in Figure 5.7(b). Again groups 3 and 5,

and groups 8 and 9, are similar.

139

8. 6 +

individual 103

Figure 5.6(a) - I-A plot - log variances - order of groups for each

3 5 4; 8 9 6 2 1 7 3 5 4 8 9 6 2 1 7 3 5 4; 9 8 2

variances

- Thais data row, from top: 9 8 6 1 2 7 8 9 1 2 6 7 3 5 4;

6 1 7 3 5 4.

4 1.08±

-I-

D DD E

DE CE11

CC

GG 16

B B111

F118

U) 0

`a

CIP N

--0. 12± ..4 0

F

IH HHIIIH

1.32-I-H -2.4

I -I- +

-1 .2 0.0

Gaussian quantiles

1.2 2.4

Figure 5.6(b) - Q-Q plot - log variances - Thais data.

Table 5.5 Summary of fitted linear regressions and analysis of

variance table for the log variances for the Thais data

Grp n Total SSQ t.k-gst t.k-t.. r2 Deviation SSQ

1 50 805 0.16 -0.21 0.95 0.99 4.6

2 99 2478 -1.50 -0.19 1.17 0.99 14.6

3 76 1316 0.77 0.65 0.98 1.00 1.9

4 37 454 2.31 0.98 0.82 0.96 16.9

5 46 706 1.32 0.78 0.92 0.99 5.4

6 50 566 1.29 -0.29 0.78 0.99 7.3

7 33 699 -0.50 0.07 1.07 0.99 4.2

8 28 588 -1.81 -1.12 1.09 0.99 3.3

9 43 830 -1.43 -1.13 1.04 0.99 5.8

Source of variation d.f. SSQ

row (elements) 4 8243

column (groups) 8 99

row x column 32 200 slopes 8 136 deviations 24 64

C + RxC 40 299

140

141

4 8.52+ -}

D

E C

G -1-

A 8 3

6.124 -E- 4- I

0. 78 0.88 0.9$

Figure 5.7(a) - M-S plot - log variances - Thais data.

H +

-I- -I- 1.08 1.18

estimated slope Sk

a I a E

7.32+

4- 4-

B

'0 G

F

7.i+ 4- + +-

-0.24 0.56 1.36

tN

+ + 2.16 2.96

f

canonical vari of e I

Figure 5.7(b) - C-V plot - log variances - Thais data.

142

The formal analysis of variance and comparison of regressions can

be partitioned for comparisons of particular groups. Using the

conventional 5%, 1% and 0.1% levels as guidelines, specific comparisons

indicate that groups 8 and 9 do not differ significantly, nor do groups

3 and 5, though the inclusion of group 4 leads to a highly significant

RxC SSQ. Comparisons within groups 1, 2, 6 and 7 (highly significant

overall) show that groups 1 and 2 differ significantly, while the

overall comparison (C + RxC SSQ) for groups 1(A) and 6(F) is not

significant, though there is a significant difference (p < 0.01) in

slopes. The level of significance is, however, relatively small when

compared with that between groups 1 and 2 for example, or within groups

3, 4 and 5.

When the individual versus average trends are reasonably linear

and roughly parallel, the differences in mean or, equivalently, position

of the common slope lines provide a first summary of group differences.

Differences in slope within groups of roughly equal position are

indicative of differences in relative magnitudes of the variances for

the groups, and hence of possible differences in the orientation of the

associated concentration ellipsoids.

Figure 5.8 shows an I-A plot for the arctanh correlation coefficients;

the group rankings are given in Table 5.6, together with the row means

and identity of the elements. Examination of the order and clustering

of the groups in each row of Figure 5.8 and Table 5.6 shows that groups

6(F), 9(I) and 5(E) tend, for different elements, to have correlations

somewhat lower than those for the remaining groups, while group 4(D)

sometimes has higher correlations. Groups 5, 6 and 9 rank as the

bottom three groups for virtually all rows, while groups 1 and 4 rank

in the top three for nearly all rows.

143

Table 5.7 summarizes the fitted regressions and formal analysis of

variance table. The fitted regressions provide a reasonable description

for only about half the groups, both when evaluated by r2 and by the

significance of the deviation SSQ. Figure 5.9(a) shows an M-S plot of

column mean against fitted slope. The groups 6, 9, 5 and 4 are at the

edges of the plot. Group 1(A) has a low r2 and so its position must be

interpreted with care.

There are eight canonical vectors with non-null roots; the canonical

roots for the arctanh correlations are 60, 56, 38, 26, 14, 10, 3, 1

with trace statistic of 208 on 80 d.f. A C-V plot for the first four

canonical vectors is shown in Figure 5.9(b).

Comparison of Figures 5.9(a) and 5.9(b) shows a number of differences.

These can in part be explained by the poor description afforded by the

linear regressions, e.g. groups 5, 6, 8, 9. However, groups 2, 3 and 4

are well fitted by linear regressions, and yet their relative

similarities differ in the two representations.

The clustering of the rows themselves in Figure 5.8 suggests that

further examination of the patterns of correlation within each matrix

may provide further insight. To show how this might proceed, Table

5.8 lists the correlation coefficients for each group. The order is

as in Figure 5.8 and Table 5.5. The first three rows in Figure 5.8

refer to correlations between vl-v3, while the fourth row refers to

correlations between v4 and v5. The bottom rows refer to correlations

of v4 and of v5 with vl-v3. The second part of Table 5.8 provides a

subjective pooling of the correlations. Correlations are pooled if

they are within 0.02 of each other, otherwise the individual values

are given; the value 0.02 is arbitrarily chosen, though it does

accommodate most correlations. From the smoothed summary, some clear

patterns emerge. Groups 1, 2, 3, 4 and 7 have correlations >0.97

-}

-1-

144

-1- -+ 4 -E 4

IEF H BCA GD

FI EG AM 8 G D

0

ā2.364

IF I REF GBNCBAAD CG D (t)

E I F CHBGA D TIE E FH F8C CGADDGAH

I EI EF NF 8 HCACDDBG A

1.96-+E I F HC BG AD

-4- + + 4 1. 3 1. 8 2.3 2.8

individual arctanh correlations 3.3

2. 76 -}-

Figure 5.8 - I-A plot - arctanh correlations - Thais data.

Table 5.6 Order of the groups for each row of the I-A plot for

arctanh correlations for Thais data. Row means and

identity of the element are also given.

element row mean group rankings

1,2 2.73 9 5 6 8 2 3 1 7 4

2,3 2.56 6 9 5 7 1 8 2 3 4

1,3 2.31 9 6 5 7 8 2 1 3 4

4,5 2.29 9 6 8 5 2 3 1 4 7

2,5 2.08 5 9 6 3 8 2 7 1 4

1,5 2.07 5 9 6 2 3 7 1 4 8

2,4 2.06 9 5 6 8 3 2 4 7 1

3,4 2.03 5 9 8 6 3 1 4 2 7

1,4 2.01 9 5 6 2 8 3 4 7 1

3,5 1.96 5 9 6 8 3 2 7 1 4

145

146

Table 5.7 Summary of fitted linear regressions and analysis of

variance table for the arctanh correlations for the

Thais data

Grp Total SSQ t,k-Skt.. t.k-t.. Sk r Deviation SSQ

1 29 1.41 0.13 0.47 0.45 16.2

2 132 0.14 0.08 0.97 0.88 15.4

3 116 0.02 0.15 1.05 0.89 13.0

4 86 -0.49 0.40 1.37 0.94 4.8

5 110 -0.99 -0.31 1.28 0.82 20.2

6 73 -0.12 -0.27 0.94 0.74 19.4

7 55 0.25 0.22 0.98 0.69 17.1

8 46 0.35 -0.03 0.84 0.49 23.0

9 73 -0.81 -0.46 1.14 0.92 6.1

Source of variation d.f. SSQ

row (elements) 9 555

column (groups) 8 43

row x column 72 165 slopes 8 30 deviations 64 135

C + RxC 80 208

2

1.92 1.62 1.32

C

2.85-+

147

D

G

A

C B

F E

I 1.85+ +

+ + + + + 0.425 0.675 0.925 1.175 1.425

estimated slope k Figure 5.9(a) - M-S plot - arctanh correlations - Thais data.

4-- 2.464

0

O

F

1 .26+ {- 0.72 1.02

A~

canonical variate I

Figure 5.9(b) - C-V plot - arctanh correlations - Thais data - third and fourth canonical variates are indicated

by vertical and horizontal lines.

148

Table 5.8 Summary of correlation coefficients (x100) for Thais data,

(a)

and smoothed summary

original correlations

for various combinations of elements

4,5 2,5 1,5 2,4 3,4 1,4 3,5 1,2 2,3 1,3

1 99 99 99 99 99 98 99 98 99 98

2 99 99 98 98 98 97 98 98 96 97

3 99 99 99 98 98 97 98 98 98 97

4 100 100 99 99 98 98 98 98 98 98

5 98 99' 98 97 94 91 94 91 94 87

6 98 97 94 97 95 96 95 95 94 94

7 100 99 98 99 98 98 98 98 98 98

8 99 99 98 97 95 99 95 95 97 96

9 98 97 94 95 91 91 91 92 88 92

(b) smoothed summary for various combinations

1,2 2,3 1,3 4,5 4v1-3a 5v1-3 v05b v1,2,3c

4 100 100 99 98 98 98 98 100

1 99 99 99 99 99 98 99 99

7 100 99 98 99 98 98 98 99

3 99 99 98 98 98 97 98 99

2 99 99 99 98 98 97 98 99

8 99 99 98 97 95 95,96,99 96 99

5 98 99 98 97 91,94,94 87,91,94 93d 98

6 98 97 94 97 95 95 95 97d

9 98 97 94 95 88,91,92 91 91d 97d

a average of r(1,4), r(2,4), r(3,4)

b average of r(1,4), r(2,4), r(3,4), r(1,5), r(2,5), r(3,5)

c average of r(1,2), r(1,3), r(2,3)

d subjective average, since range greater than 0.02.

149

for all pairs of variables, including those of v4 and v5 with vl-v3,

in contrast with groups 5, 6 and 9 which have lower correlations of

v4 and v5 with vl-v3. Group 8 is somewhat intermediate. Both

Figures 5.9(a) and 5.9(b) show the distinction between the two subsets.

vl-v3 are length measurements while v4 and v5 are width measurements.

There are some differences in the patterns of correlations within

groups 1, 2, 3, 4 and 7, though it is doubtful whether these are of any

practical significance. It is the overall similarity of the correlations

which leads to the different graphical summaries, depending on those

features of the patterns which are emphasized by the particular linear

combinations of the elements chosen. A change in correlation from 0.98

to 0.94 or to 0.92 will have considerably more effect on the orientation

of the corresponding concentration ellipsoid than will a change from

0.99 to 0.98.

5.4 Further Practical Aspects

The aim of the approaches presented in this Chapter is to provide

procedures for comparing covariance matrices which will indicate the

nature of any differences which exist. The I-A plots and accompanying

residual plots indicate atypical groups and/or elements. For more than

five groups and five-ten variables, the overall plots may need to be

combined with plots of, say, five groups at a time, still plotted against

the overall row means. This will offset overprinting of symbols in the

overall plot and hence provide better group identification.

The separate treatment of the variances and of the correlations

leads to ready identification of particular structure. For example,

proportional changes in variance from group to group will be reflected

in parallel straight lines in the I-A plot with different positions.

Uniform correlation matrices or a common variance for all variables

150

will be indicated by a null row effect.

The formal analysis of variance and multivariate comparison should

be used to complement the graphical displays and fitted regressions,

supported by commonsense interpretation of the results. When most of

the elements are similar, as in the Thais correlation example, so that

a small row effect results, the fitted regressions may be misleading.

The advantage of the I-A plot for the Thais example is that it draws

attention to the correlations for groups 5(E), 6(F) and 9(I), which

are somewhat lower than the corresponding correlations for the remaining

groups. Care must be taken when interpreting results from fitted

regressions, since the latter may be effectively determined by only

one or two observations, as will occur when all but one or two of the

variances or correlations in each group are essentially the same.

Another possibility is that the slope of the fitted line, and the

corresponding residual SSQ, may be unduly influenced by an atypical

element(s); a robust fitting procedure based on M-estimators could be

used to provide smoothed trends.

CHAPTER SIX: SHRUNKEN ESTIMATORS IN CANONICAL VARIATE ANALYSIS

This Chapter examines the role of shrunken estimation procedures

in discriminant and canonical variate analysis, and delineates

situations where they are likely to be effective. Shrunken estimators

are considered in Section 6.2 for the discriminant function and a

simple hypothetical example is discussed to illustrate the ideas.

Some asymptotic results for the mean square error (MSE) of the

coefficients in the two-group discriminant analysis are given in

Section 6.3. It is shown that for the g-inverse estimator, the MSE

of the corresponding coefficients will be less than that for the usual

solution provided the contribution to Mahalanobis 02 along the smallest

eigenvalue/vector combination is sufficiently small. Section 6.4

introduces shrunken estimators for canonical variate analysis, by

increasing the variance or standard deviation of the orthogonalized

variables from the first-stage calculations before scaling. When

the between-groups SSQ for an orthonormal variable is small and the

corresponding variance (and particularly eigenvalue) is also small,

shrinking will lead to improved stability of the canonical vectors.

Section 6.5 discusses a practical application. Some practical guidelines

are given in Section 6.6, together with some recommendations for variable

selection.

151

152

6.1 Introduction

In many applications of canonical variate analysis, the relative

magnitudes of the coefficients for the variables standardized to unit

variance by the pooled within-groups standard deviation are useful

indicators of the important variables for discrimination. If the

relative magnitudes of the standardized coefficients are to be used

in this way, stability of the coefficients is important. Stability

here refers to the sampling variation of the coefficients over repeated

samples.

Discriminant analysis, the two-group canonical variate analysis,

can be considered as a regression problem with a dummy y-variate. In

regression, the presence of high correlation between a pair of

regressor variables leads to instability in the corresponding regression

coefficients, reflected in large standard errors. Generalized ridge or

shrunken estimation procedures produce more stable estimates; Alldredge

and Gilb (1976) give a comprehensive bibliography.

In discriminant analysis, it is easy to show that high correlation

within groups when combined with between-groups correlation of the

opposite sign leads to greater group separation and a more powerful test

than when the within-groups correlation is low. However, if the

instability inherent in regression analysis with highly correlated

regressor variables carries over to discriminant analysis and thence

to canonical variate analysis, interpretation of the importance of

variables based on the relative magnitudes of the standardized coefficients

will be misleading.

153

6.2 Shrunken or Ridge-Type Estimators in Discriminant Analysis

If W is the pooled within-groups SSQPR matrix on nW = ni+n2-2

d.f., and dx = xi - x2, then the sample discriminant function is cTx,

where

c = nW 1d . w x

For consistency of notation in the following derivations, the usual

sample discriminant vector will be denoted by U. There is some

inconsistency of notation in this Section with that in Chapter One and

in Section 6.4 of this Chapter, in that the canonical vector c in (1.10)

is standardized to have unit variance within groups. I have chosen to

retain the same notation in this Section since the essential meaning

of the vectors is unchanged.

Write

nwW 1 = UE UT = E uiui/ei i=1

where the columns of U are the eigenvectors of nw1W, and the diagonal

elements of the diagonal matrix E are the eigenvalues. Then

v cU = UE-1UTdx = E uiuidx/ei = UE-1/2aU ,

i=1

where

aU = E 1/2UTd = E-1/2d = d

x y z

Here y = UTx denotes the orthogonalized variables and z = E-1/2UTx

the orthonormalized variables derived from x, aU is the vector of

discriminant function coefficients for the orthonormalized variables,

154

and dy and dz are the vectors of mean differences for the new variables.

Then ai = ei1/2 dyi = dzi.

The Mahalanobis squared distance is

v v D2 = dxnwW 1dx = E d

2./e.1 = E d

2. .

i=1 y i=1

When the means are virtually coincident along one of the principal

components, dyi = yli r y2i -

0. The coefficient ai involves dyi/e1/2.

When the eigenvalue is also small, ai will be given by the ratio of

two small quantities and will be subject to wide fluctuations from

sample to sample. However the contribution to D2 is the square of this

quantity and so will tend to be small, even when the corresponding

eigenvalue is also small. Of course there is no reason in principle

why the largest dyi should not occur with the smallest e..

The geometry of discriminant and canonical variate analysis suggests

that at solution to the problem of smalldyi and small ei is to shrink

ai towards the origin by increasing the scaling factor ei for the first-

stage orthonormalization. One possibility is to consider the class of

shrunken estimators defined by

aGR = (E+H)-1/2d

y or aiR = dyi/(ei+hi)1/2 (6.1)

with H = diag(h1,...,hv). These are generalized ridge estimators (see,

e.g., Goldstein and Smith, 1974, in the context of regression analyses).

In terms of the original variables, (6.1) gives the estimator

cGR = (n IW+ UHUT)-1dx = U(E+H)-1/2aGR . (6.2)

Generalized-inverse (or principal component) estimators are

obtained by setting the smallest v-p coefficients ai to zero, which is

155

equivalent to setting hi = 0 for i < p and h = 03 for i > p. Hence

cGI = UPE-1dyp - UPEp

1/2ap, where the subscript p denotes the

appropriate vxp, pxp and pxl partitions of U, E, d and aU respectively.

By contrast, when hi = h for all i, the ordinary ridge estimator

cR = Onto W + hI)-1dx results.

It is instructive to consider a simple hypothetical example which

illustrates the problem of instability. Assume for convenience that

there are two variables, and that they have unit variance and correlation

r within groups. Then the eigenvalues are 1 + r and 1 - r, with eigen-

veutors 2-1/2(1,1) and 2-1/2(1,-1). With differences dxl and dx2 in

the original variables, the differences along the principal components

are (dxl + dx2)/✓2 and (dxl - dx2)/12.

The discriminant coefficients for the orthonormal variables are

aGR = dxl x2 and aGR = dxl-dx2

1 ✓2 (1+r+h1)1/2 a2 GR

(1-r+h2)1/2

and hence

GR dxl x2 + dxl-dx2 and cGR dxlx2 - dxl-dx2 c = 1 2 (1+r+h1) 2 (1-r+h2) 2 2(1+r+h

1) 2 (1-r+h2)

Consider the situation in which dxl = dx2, and r is high. Assume

that dxl'is slightly greater than dx2. For k2 = 0, 141 will be

inflated by small 1 - r, while small increases in h2, which may be

large compared with 1 - r, A411 cause Ia 2Rl to tend rapidly to zero.

In this situation c2R will be negative for h2 = 0, and will become

positive as h2 increases sufficiently. If dx1 4 dx2,

then the corres-

ponding contribution to D2, (dxl - dx2)2/2(1-r), will become important

and shrinkage will be accompanied by a marked decrease in discrimination.

156

It seems natural to require that if the contribution of

dy2 = (dxl - dx2)/1/2 to D2 is small, then the estimate a2 should also

be small in magnitude. This would be achieved here by the generalized-

ridge estimator a2R, with h2 chosen such that a2R is not sensitive to

small variations in h2.

Now dxl will be approximately equal to dx2 if the major axes of

the concentration ellipses for the two groups are virtually collinear.

If this situation is combined with high within-groups correlation, then

the overall correlation of the combined data, ignoring group distinctions,

will also be high.

The effect of high correlation on the stability of the coefficients

from a discriminant analysis, and its relation to regression analysis,

can now be explained. When discriminant analysis is viewed as a

regression problem with dummy response variable, the total matrix for

the combined data, namely W + t dxdT with t = n-1 + n-1 determines the

discriminant coefficients; the correlations between the dummy response

and observed (regressor) variables are essentially the dxi. Problems

of instability arise in regression when r is large, so that x.x.

r ~ r Moreover, one of the eigenvectors will contrast xi and xi,Yxj.Y xj; there will be a relatively large positive component corresponding

to xi, a relatively large negative component for x., and small

components for the remaining variables. When the regression analysis

is carried out on the principal components of the original variables,

the sum of squares corresponding to this contrast-type eigenvector

will be very small. The corresponding terms in the dummy variable

analysis are the correlation of xi and xj in W + tdxdx and the

similarity of the values of dxi and dxj. As noted in the previous

paragraph, only when dxi = dxj will the correlation between xi and x.

in the combined matrix be high.

the columns

eigenvalues

of E, are

of r, are the eigenvectors of E, with corresponding

X.. Then the 1 asymptotic distributions, for distinct roots

157

6.3 Mean Square Error of Shrunken Estimators for Discriminant Analysis

In this Section, the asymptotic mean square error of the shrunken

estimators is derived, and is compared with that for the usual

estimator.

Assume now that x ti Nv(uk,E) if x belongs to the kth population,

k = 1,2, and let d = Ul - p2. The population vector of discriminant

coefficients is given by * = E-1d.

From (6.1) and (6.2) the generalized ridge estimator may be written

v v cGR =

iEl uiuidx/(ei+hi) = lEl uidyi/(ei+hi) . (6.3)

The expectation and mean square error (MSE) of cGR involve the

moments of ei and ui. Exact results do not appear to be available.

Anderson (1963) has derived asymptotic results, showing that the ei

independent. Write E = rArT, where the yi, and u. are asymptotically 1

and

e ^' N (Ai ,.2Ai/nw) , (6.4a)

ui ti Nv(yi,riLiri/nw) (6.4b)

where the subscript i for r denotes that the ith vector of r is deleted,

so that ri is vx(v-1) while the subscript i for S2i denotes that the ith

diagonal element is deleted; SZi is

of ni is A.Ai/(Aj-Ai)2.

(v-1)x(v-1). The jth diagonal term

158

From (6.4) the following can be established:

E(uiu.) = yiy. + ri3~ir./nw; (6.5a)

A A.

E(e.+h.) A.+h. + n A.+h. + 12 (A.+h.)2 - A +h-{l+g(nw,hi)} say;

w i i n i w

(6.5b)

and

Xi 1 1 6 i 60 i 2 E{ 2} 2 {1 + n

A.+h. + 2 (A,+h.) }. ( e.+h.) (A,+h.) w i i n 1 i 1 1 w

(6.5c)

The independence of W and dX implies the independence of (ei,ui)

and dX. Hence (6.5) with (6.3) gives

v v

E(cGR) - irl Ai+hi

y1 1 + irl

A.1 +hl yiyi`S g(nw.hi)

v

+ E {l+g(n ,h)} 1 r.S 2.rT i=1 Ai

+hi w i n i i i

and for large n,

T v 'y.y.(5. E(cGR) -} £

Al+h.l = a + rHrT)-l6 = CGR •

i=1 i 1

The asymptotic MSE (aMSE) of cGR can be evaluated using (6.5).

2 After some algebra, and ignoring terms involving 1/ni and smaller,

v ((Ty.)2 aMSE(cGR) = 1 E 1 1 (2X3-4h.X. + n h.X. + n h3) nw

i=1 Xi (X1+hi)3 i l i w l l wi

(6Ty) 2 X.Xi X3-2X

1

+ nw 1=1 j#1. Xj (X.-X.)2 (Xi+hi)2

•

With h. = O for all i,

U 2 ° (dTyi)2 1 v 2 A.-2A. aMSE(c ) _ E + E E (dT y.) 3 1

nw i=1 A. nw i=1 ji 3 Xi(Xj-Ai

The generalized inverse result follows by setting hi = 0

i < p, hi = = i > p, to give

GI 2 p ((Tyi ) 2 v (dTyi) 2

aMSE (c )= n E 2 + E 2 w 1=1 Xi i=p+1 Xi

+ 1 E E 3 1 (6Ty)2 . nw i=1 j#i Xi(Xj-Xi)

2 (6.7)

Comparison of (6.6) and (6.7) shows that aMSE(cGI) < aMSE(cU)

provided

v (dTy )2 v X.-2X.

2 n1-2 E E 2 (dy) 2 .

i=p+1 Xi nw -2 j#i (Xj-Xi) A

In the interesting case in practice of the single small eigenvalue,

with p = v-1 and Xv « Xi for all i < v, the condition becomes

159

(6.6)

160

(dTyv) 2

1 v-1 (6Tyi) 2

Xv — nw-2 i=1 Ai (6.8)

Denote (6Tyi)2/ai, the contribution to Mahalanobis Q2 along the

ith eigenvector, by A(i). Then the condition (6.8) may be written as

A (v) < A2/ (nw-1) . (6.9)

The requirement (6.8) or (6.9) is intuitively reasonable and

accords with practice, namely that when the contribution to the overall

discrimination (measured by A2) is small, and the eigenvalue A , is also

small, then the MSE of the g-inverse or principal component solution

will be reduced by eliminating the corresponding principal component.

Unfortunately, a simple result cannot be established for the

comparison of aMSE(cGR) with aMSE(c0). Even a simplification such as

h. = 0 for all i < v, by = (b-1)Xv fails to lead to a useful comparison.

One situation in which some insight can be gained is the case v = 2 with

X2 « A1, and hl = h2 = bX2. Under the assumption that terms of order

7 2/X1 can be ignored, it can be shown that aMSE(cR) < aMSE(cU) provided

(6T 11)2

(nw-2)b2+(nw 6)b-10 (6Ty2)2

Xl (b+1) (b+2) X2 (6.10)

Consider the situation in the previous Section, with two variables

with equal unit variance and correlation p. Let the difference between

the means for the first variable be 6 (=dl) and that for the second be

to (=62). The condition t = 1 implies that the means lie along the major

axis of either ellipsoid, while t = -1 implies that the means lie along

the minor axis. For this situation, X1 = l+p, X2 = 1-p, (6Tyg)2 = 62(1+t)2/2

and (6Ty2)2 = 62(1-t)2/2. The condition (6.10) becomes

161

b2+3b+2 1-p (l-t)2

(nw 2)b2+(nw 6)b-10 1+p

(l+t)2

As t + +1, the condition is readily satisfied, and this is just

the situation where al

= 62 and little or no information for discrimination

lies in the direction of smaller within-groups variation. However,

when t + -1 the condition will not be satisfied; here all the information

for discrimination lies in the direction of smaller within-groups

variation.

6.4 Shrunken Estimators in Canonical Variate Analysis

The argument given in Section 6.2 to illustrate the instability of

the discriminant coefficients is readily extended to more than two groups.

When the sum of squares between the means along a particular eigenvector

of the within-groups dispersion matrix is small, and the corresponding

eigenvalue is also small, instability of the coefficients will again result.

To see this, consider a hypothetical situation with two highly

correlated variables, with the remaining v-2 variables having lower

correlations. Then uv will be approximately of the form

2-1/2(*,...,*,1,-1)T where the * represent small numbers, while

ui = v-1/21. The remaining ui will have small components for the v-1

and vth variables. The form of the eigenvectors can be readily verified

empirically. If the between-groups SSQ e-1uTBuv for the vth orthonormal

variable is also small, then the corresponding component av of the

canonical vector(s) a will also be small. Since c = UE-1/2a from

(1.37), it follows that for the hypothetical situation,

162

cv-1 = (2ev )-1/2av + (ve1

)-1/2al and cv

m -(2ev )-1/2av + (ve1 )-1/2a1 .

But av/~ involves the ratio of two small numbers and may be

fortuitously large if ev is small enough, in which case the term will

dominate both cv-1 and cv, giving coefficients of similar magnitude

but opposite sign.

The introduction of shrunken estimators in Section 6.2 proceeds

by increasing the eigenvalues ei by positive shrinkage constants hi

before the first-stage orthonormalization. The procedure for canonical

variate analysis is to find the canonical roots fGR and canonical

vectors aGR

of

(E+EH) -1/2

UTBU(E+HE)

-1/2 (6.11)

where H = diag(hl,...,hv); here the shrinkage constants are multiples

of the eigenvalues. Then

CGR = UE 1/2(I+H)-1/2aGR

gives the shrunken estimators of the canonical vectors for the original

variables.

Consider now the formulation of canonical variate analysis in

Section 1.5. Write

ZH = XU(E+HE)-1/2 = Z(I+H)-1/2 (6.12)

The symmetric matrix in (6.11) above becomes HH. H. The eigen-

analysis to give fGR and aGR, and hence cGR, is

(ZH fGRI)aGR = 0

or

Z GR

= f a GR

ZHZHa (6.13)

163

The first-stage simultaneous rotation produces new variables

zH = (I+H)-1/2E 1/2UT

x = (I+H)-1/2z, which corresponds to a principal

component analysis of WH = U(E+H)UT.

The relationship (6.13) gives, as in Section 1.5,

Z Zm

GR = fGR nGR H H

with

mGR = Z aGR H

and so the Q-technique and singular value decomposition calculations

carry over directly.

In the usual canonical variate solution, the two-stage computations

give the same canonical vectors cU, irrespective of the first-stage

orthonormalization z = W-1/2x used. The invariance of canonical variate

analysis under different first-stage orthonormalizations is one of its

attractive features. However, this invariance will no longer result

when shrunken estimators are introduced. There is even some ambiguity

over the way in which the shrunken estimators are introduced above.

Explicit consideration of the alternative first-stage rotations suggests

that shrunken estimators could equally well be introduced as the solution

to the eigenanalysis of ZHZH with

EH = XU(E

1/2+H2E1/2)-1 = Z(I+H2)-1 • (6.14)

Here the shrinkage constants h2i are multiples of the standard

deviations of the orthogonalized variables UTx.

Shrunken estimators can also be introduced for the triangular

decomposition or successive orthonormalization of Section 1.5. When

two variables are highly correlated within groups, the diagonal term

164

uTji of UT corresponding to the second of the variables will be small.

From (1.46), the corresponding diagonal term of ET will then also be

small. Instability of some of the coefficients will again result if

the corresponding diagonal term of ZTZT is also small, where ZT is

defined in (1.47). For thenaTj will be small, and the calculation

for some components of c will again involve the ratio of two small

numbers, this time aTi/uTjj.

The proposed solution is to increase the uTii'

so that

TH = XUU1(ET+HTET)-1 = ZT(I+HT)-1 (6.15)

The eigenanalysis becomes

(I+HT)-1ZTZT(I+HT)-laTR = GR .

and hence

CTR = U-1(I+HT)-1aTR

T

Again, by using the form (6.15), the calculations can also be

set out in Q-technique and singular value decomposition form.

Unfortunately, the lack of invariance goes even further. Canonical

variate analysis can be carried out with the data in standardized

form, where the standardization is based on the pooled within-groups

standard deviations. Then B and W in (1.5) and (1.3) are written in

standardized form, and the resulting canonical vectors are those for

the standardized variables, as referred to in the opening paragraph

of Section 6.1. Hence the shrunken estimates can be determined using

eigenanalyses based on any of (6.12), (6.14) or (6.15), using either

the original or standardized variables since they will give different

first-stage analyses.

165

As shown in Section 1.2,the maximization of the between- to

total SSQ for a linear combination of the variables also leads to the

usual canonical variate solution. Shrunken estimators can be defined

in a similar way to that above, since directions of near-singularity

within groups and corresponding small differences in these directions

between groups will also be reflected in the smallest eigenvalue/

vector combinations for the total SSQPR matrix. Again, the shrunken

estimators can be based on a simultaneous or a successive first-stage

rotation, and on the original or on standardized data.

6.5 Practical Aspects

To illustrate the ideas presented in the previous Sections, data

examined by Phillips, Campbell and Wilson (1973) in a study of

geographic variation in the whelk Dicathais around the coast of

Australia and New Zealand are re-analyzed. Four variables describing

the size and shape of the shell were measured, namely overall length

(L), length of spire (LS), length of aperture (LA) and width of

aperture (WA). Means, pooled standard deviations and correlations

are given in Table 6.1.

The presentation for most of this Section uses the generalized-

ridge formulation in (6.11) and (6.12). The analysis is based on

standardized variables, so that the first-stage eigenanalysis is on

the within-groups correlation matrix. The last part of this Section

briefly examines alternative formulations.

Table 6.2 lists the eigenvalues and eigenvectors for the correlation

matrix given in Table 6.1. As might be expected from the high

correlations, there are two very small eigenvalues. The smallest

accounts for less than 0.08% of the within-groups variation. The

eigenvector corresponding to the smallest eigenvalue, hereafter referred

166

Table 6.1 Means, pooled standard deviations and correlations for

the Dicathais data.

group 1 2 3 4 5 6 7 8 9

L 39.36a 33.39 35.54 33.86 27.43 51.73 37.47 40.11 38.43

LS 16.10 11.99 14.06 13.07 10.14 20.73 13.79 13.16 12.71

LA 28.04 25.58 25.81 25.10 20.42 37.21 28.55 31.94 30.40

WA 12.81 12.02 11.76 11.60 9.64 17.97 13.39 16.08 14.90

group 10 11 12 13 14 L LS LA WA

L 33.17 32.39 44.02 33.34 55.94 9.728b 0.967 0.983 0.975

LS 12.36 13.29 14.91 13.34 25.00 4.312 0.913 0.912

LA 24.67 23.12 33.51 24.92 38.93 6.817 0.986

WA 11.21 11.76 17.46 13.02 20.84 3.476

a values in columns 1-14 are the group means for the four variables

b diagonal elements are the pooled standard deviations; off-diagonal are the corresponding correlation coefficients.

I

III

Table 6.2 Eigenanalysis of within-groups correlation matrix, and summary of canonical variate analyses for Dicathais data

eigenvector e-value L LS LA WA

0.50 0.49 0.50 0.50 3.869

-0.33 0.15 -0.56 0.75 0.016

eigenvector e-value L LS LA WA

II 0.08 0.79 -0.42 -0.43 0.112

IV 0.79 -0.33 -0.51 0.03 0.003

aU

cU(h40= )

cGI(h4=*)

cU(hi.0)

cU(hi=0)

cU(hi=0)

canonical vector I c-root canonical vector II c-root

PCI

-0.32a

L

PCII

-0.08

LS

PCIII

-0.93

LA

PCIV

0.17

WA

2.13

PCI

0.09

L

PCII

-0.93

LS

PCIII

-0.02

LA

PCIV

-0.35

WA

1.68

-4.82 2.02 -2.41 5.64 2.13 4.65 -4.28 -2.12 1.51 1.68

-2.42 0.66 -3.78 5.91 2.09 0.17 -2.54 1.93 0.18. 1.48 -7.96 3.70 4.79 2.03 0.52 -2.65 2.05 1.60

-0.79 -4.75 5.83 1.98 -2.36 2.84 0.81 1.45 -1.85 -3.70 5.86 2.06 5.17 -5.57 0.69 0.90

between-group SSQ for each orthonormal variable: 0.549, 1.491, 1.872, 0.381

sum = tr(W 1B) = 4.293

a standardized canonical vectors for orthonormal variables

b h1=h2=h3=0; h4=0 gives usual canonical vector E.. rn

L L

LA

— WA WA

Ls LS

f

2.1$

6

4

0

c- v

ar►Ot

e c

oeffi

cien

ts

-2

-4

2 1- \ LA

L,WA

LS

c-v a

riat

e co

e ffic

ients

CV

II

LS

L J.. 0.01 0.02 0.08 0.04 0.05 OD

168

shrinkage constant (_ h,,.e4 )

Figure 6.1 - Plots of the canonical variate coefficients and canonical

roots as the smallest eigenvector/value contribution is

shrunk progressively to zero, by increasing h4.

169

to as the smallest eigenvector, contrasts L with LS and LA.

Table 6.2 also gives the between-groups SSQ e. uiBui corresponding

to each eigenvalue/vector combination, together with the coefficients

aU for the orthonormal variables, the canonical roots fU and the

canonical vectors cU for the standardized original variables. The

smallest eigenvalue/vector contains 9% of the between-groups variation.

Examination of a1 and a2 indicates that the third principal component

dominates the first canonical variate, while the second principal

component dominates the second canonical variate. The smallest

principal component makes a greater contribution to the second canonical

variate than it does to the first.

Figure 6.1 and Table 6.2 show the coefficients for the first two

canonical variates as a14 and a24 are shrunk towards zero. The changes

in the coefficients for L, LS and LA are evident, even for a small

amount of shrinking. These changes in magnitude, and indeed sign,

hold for a wide range of values of h4, with virtually no change in the

first canonical root, and only minimal change in the second canonical

root.

The changes in the coefficients can be predicted from the results

given in the early discussion of the example: the smallest eigenvalue

is very small; the corresponding eigenvector is dominated by L in

relation to LS and LA; the corresponding between-group SSQ is relatively

small; and the greater (but still small) contribution made by the

smallest principal component to the second canonical variate is

reflected in the more marked changes in the coefficients for the

second canonical vector.under shrinking.

The generalized inverse coefficients cGI and c2I for the canonical 1

variates (h4 = co) provide a stable basis for interpretation. The first

canonical variate reflects differences in the shape and size of the

170

aperture, while the second reflects differences in the relative length

of spire. This interpretation is supported by an examination of the

second and third eigenvectorst the latter contrasts LA with WA in

particular, while the former contrasts LS with LA and WA.

The lack of stability of the canonical variate coefficients for L,

LS and LA suggests that one or some of these variables may be redundant

for discrimination. None of the variables has a small standardized

coefficient so that none of them is an obvious candidate for deletion.

However, the canonical roots are little affected by shrinking, while

the coefficients for L and LA change markedly. Moreover, after

shrinking, L enters less noticeably into either canonical variate.

This suggests that an analysis based only on L, LS and WA or on LS,

LA and WA is worth examining. The canonical vectors and canonical

roots are given in Table 6.2. Note that the sum of the coefficients

for L, LS and LA is virtually the same. The interpretation of the

canonical variates is unchanged from that based on the generalized

inverse estimates.

Tne contribution of the eliminated variable can be assessed

formally by a multivariate analysis of covariance (see, e.g. Kshirsagar,

1972, Chapter 8). The procedure is simple: carry out a canonical

variate analysis based on all v variables, and carry out an analysis

based on the p retained variables (so here v-p = 1), determining Wilks

A for each analysis. Wilks A is defined in (1.15). The ratio Av/Ap

gives the Wilks Av.p, which is the basis of the statistic for assessing

the importance of the v-p variables after first including the p retained

variables (see Section 1.6). In general,

- {ng + nw - p - 2(v-p+ng+1)} log Av.p ti X(v-p)n • g

171

where ng is the between-groups d.f. When v-p = 1, as in the example,

1(nw p) /n } f ti F , where f = (1-A ) /A . g v.p ng,nw-p v.p v.p v.p

Exclusion of LA (respectively L) gives A3 = 0.093 (0.102),

while A4 = 0.078, so that A4.3 = 0.838 (0.765) and f4.3 = 0.193

(0.308). Hence, with nw = 866 and ng = g-1 = 13, 12.81 (20.43) is

to be compared with an F(13,863) distribution (or, approximately,

151.9 (230.6) with a X13 distribution). Clearly, the result in both

cases is highly significant, suggesting that the omitted variable is

of value statistically for discrimination, in addition to the

discrimination contained in the other three.

It is tempting to apply the same argument to the use of the

g-inverse estimator, which eliminates the fourth principal component

rather than one of the original variables. The formal calculations

give A3 = 0.097, so that A4.3 = 0.804 and f4.3 = 0.243. In this case,

16.12 would be compared with the F(13,863) distribution.

It has been my experience that the conclusions reached by the

formal approach outlined in the previous three paragraphs are often

misleading. With the large number of degrees of freedom within groups,

a ratio of Wilks lambdas as high as 0.97 will be adjudged significant

at the 5% level. Moreover canonical variates which contain virtually

no practical information for discrimination, even though statistically

significant, influence this ratio. For example, the significance of

the last two canonical roots for the whelk data can be assessed using

the Bartlett (Kshirsagar, 1972, equation 8.7.3) or Lawley (Kshirsagar,

1972, equation 8.7.4) chi-squared approximations; the value is

approximately 370, to be compared with the X22

distribution. In this

example, and many others analyzed by me, a canonical root as small as

0.36, the value for the third root, contains no practical information,

and yet its statistical significance is marked. As in many areas of

172

applied statistics, distinction must be made between the practical

significance and statistical significance of a result. This is

particularly true of multivariate discrimination problems, where

statistically significant differences between the means may be

associated with considerable overlap of individual canonical variate

scores; it is the latter which usually determines the practical value

of a canonical variate.

A more realistic guide to the information lost by excluding certain

variables or principal components is given by the ratio of canonical

roots corresponding to canonical variates judged to be of practical

significance. In this example, the first two canonical variates

summarize the variation between the groups; the g-inverse solution

retains 94% of this information while that based on L, LA and WA

(LS, LA and WA) retains 95% (90%) of the information. As some further 2

guide, the ratio II and a similar ratio for fU based i=1

on three and on four variables can be calculated; their values are

0.912 and 0.940 (0.871) respectively.

In this example, there is little to choose between the canonical

variate solution based on the generalized inverse solution, and that

based on the three variables L, LA and WA. The interpretation of the

nature of the group differences is similar from both.

Now consider the alternative shrunken estimator formulations in

(6.14) and (6.15) in Section 6.4, for either the original or standardized

data, using either the B/W or B/T formulation. Table 6.3 gives the

canonical roots and vectors for a selection of the analyses. For the

eigenanalysis-standardized data combination, the B/W and B/T

formulations give very similar results. For the eigenanalysis-B/W

combination using the within-groups covariance matrix, the diagonal

terms of Z Z are 0.51, 1.36, 1.99 and 0.44. The smallest eigenvector

Usual h2i=0

correlation matrix; eigenanalysis; h -co B/W 24

h24=1

h24=5

correlation matrix; eigenanalysis; B/T h24

covariance matrix; eigenanalysis;

B/W h24=O0 h24=1

correlation matrix; triangular;

B/W hT3=o hT3=1

Table 6.3 Canonical roots and vectors for a selection of analyses for alternative shrunken estimator

formulations for Dicatliais data

Canonical vector I

L LS LA WA c-root

Canonical vector II

L LS LA WA c-root

-4.8 2.0 -2.4 5.6 2.13 4.6 -4.3 -2.1 1.5 1.68

-2.4 0.7 -3.8 5.9 2.09 0.2 -2.5 1.9 0.2 1.48

-3.3 1.1 -3.3 5.9 2.09 2.7 -3.6 0.1 0.5 1.53

-2.7 0.8 -3.6 5.9 2.09 1.0 -2.9 1.4 0.2 1.48

-2.8 0.9 -3.6 5.9 2.10 0.3 -2.6 1.8 0.3 1.49

-4.9 2.1 -2.4 5.6 2.13 -1.5 -1.8 2.4 0.8 1.37

-4.9 2.1 -2.4 5.6 2.13 1.5 -3.1 0.1 1.2 1.45

-1.2 0.4 -4.8 5.9 2.02 3.5 -3.9 -0.6 0.7 1.66

-2.8 1.1 -3.7 5.9 2.04 4.2 -4.1 -1.3 1.0 1.67

174

is (0.56, -0.55, -0.58, 0.21)T. The coefficients a14 and a24 are

-0.005 and -0.44. The first is smaller and the second is larger

than the corresponding coefficients in Table 6.2. For the g-inverse

solution, the first canonical root is unchanged, while the second is

decreased more than in the analysis using the correlation matrix.

The change in sign for c23 is again evident. The diagonal terms of

Z for the triangular orthonormalization are 0.50, 1.63, 0.26 and

1.90. The coefficients a13 and a23 are 0.24 and -0.10. The third row

of UUT

is (0.9o, -0.15, 0.11, 0.00). Setting hT3 = co results in less

change in the second canonical vector than does setting h24 = co. The

change in the first canonical root is more marked than for the eigen-

analysis orthonormalization. In this example, the third row of UUT

places much less emphasis on the second and third variables, and none

on the fourth. Shrinkage based on the triangular orthonormalization

effects little change in the coefficients for L, LS and LA for the

second vector.

6.6 Discussion

When the group separation along a particular eigenvector(s) is

small and the corresponding eigenvalue(s) is also small, a marked

improvement in the stability of the canonical variate coefficients

can be effected by shrinking the ai corresponding to the eigenvector(s)

towards zero. Instability will be largely confined to those variables

which exhibit highest loadings on this eigenvector(s). In many of the

examples considered to date, the smallest eigenvector reflects a

contrast involving two or at the most three variables, and the instability

if it exists is confined to the corresponding coefficients; in fact the

sum of the coefficients is usually virtually stable.

175

The practical question is to decide on the nature and degree of

shrinking to be adopted. Out of 16 data sets examined (8 published,

8 unpublished), roughly one third exhibited no instability which could

be overcome while maintaining group separation. In each of these

cases, much of the discrimination was associated with the smallest

eigenvector/value combination. When there is little discrimination

associated with the smallest eigenvector(s)/value(s), it will often be

satisfactory to use a generalized inverse approach. Shrinking the

effect of a component to zero gives results which differ little from

partial shrinking. Moreover, since the instability is usually

associated with only one or two of the principal components, there is

no advantage in drastic shrinking along the other directions - the aim

here is to improve stability and at the same time maintain group

separation.

Specific choice of the shrinkage constants remains an open

question, as indeed it does in ridge regression. The analysis of the

whelk data and similar analyses of other examples indicate that precise

determination of the constants is not necessary when the aim is to

summarize and interpret the nature of group differences. In practice,

a range of values of hi can be used, beginning with hi between 0.5 and

1, and increasing hi until stable estimates consistent with maintaining

group separation are achieved.

The ideas presented here have implications for variable selection.

A set of variables with unstable coefficients often indicates that

some of the variables are redundant and can be safely eliminated.

Variables with small standardized coefficients can also be eliminated.

The variables, amongst those remaining, with the largest standardized

coefficients will then usually be the more important variables for

discrimination. Clearly, when variables are being eliminated, care

must be taken to ensure that discrimination is little affected.

176

Precise guidelines are difficult to set down (see the discussion in

Section 6.5); much depends on the degree of separation contained in

the remaining variables.

From the viewpoint of effective data analysis, selection of

variables based on examination of the relative magnitudes and stability

of the standardized coefficients may be preferable to a stepwise

procedure. The former focusses attention on how correlations between

the variables, and their relation to differences between the means,

affect the relative importance of the variables. In the Dicathais

example, either L or LA can be eliminated with little loss of

discrimination - and there is marked instability in the corresponding

coefficients.' Both variables are involved in the smallest eigenvector,

and so the statistician must make the conscious choice, perhaps bringing

in taxonomic considerations, as to which variable is the more useful

to retain.

Practical considerations have a strong bearing on the form in

which shrunken estimators are introduced. In my experience, it is easier

to interpret the nature of the eigenvectors in the simultaneous rotation

than it is to interpret the rows of the transpose of the inverse of

the unit triangular matrix for the successive orthonormalization.

The eigenanalysis is straightforward: the smallest eigenvalue/vector

combinations are the ones which will reflect directions of near-

singularity within groups if they exist. And the variables with high

absolute loadings for those eigenvectors are the ones which together

result in the near-singularities. The diagonal terms of the triangular

matrix do not seem to reflect near-singularities to the same degree.

Moreover, all variables appear in all eigenvectors, whereas only the

first j variables appear in the jth linear combination formed by the

Choleski procedure. I have also found the eigenanalysis to be a more

177

sensitive indicator of near-singularity. This and ease of interpretation

lead to a preference for the eigenrotation.

There appears to be little difference between shrunken estimates

based on any of. (6.11) to (6.15) and those based on an equivalent

formulation corresponding to (1.12) for the between-to-total approach.

This is intuitively reasonable, since shrunken estimators will be

effective when there is little between-groups variation for a direction

corresponding to little within-groups variation. But this suggests

that directions of near-singularity for the total matrix will then

correspond with those for the within-groups matrix. High within-groups

correlation and high overall correlation of the same sign implies that

the group means are virtually coincident for some of the minor

directions of within-group variation (see also Section 6.2); but for

the overall correlations to also be high, the latter must be similar

to minor directions of total or overall variation.

Experience with the use of shrunken estimators has mainly been

gained using (6.11). The formulation for the Choleski decomposition

and the unification of the approaches in Section 1.5 was in response

to questions of lack of invariance and of uniqueness of the eigenvector

decomposition. The choice of the shrinkage constants as multiples of

the standard deviations of the orthogonal variables from the first-stage

eigenanalysis, rather than as multiples of the variances, makes little

difference to the analysis. A more important point for the eigenvector

rotation is the choice of original or standardized data. My own

preference is for the latter, since an eigenanalysis of the correlation

matrix is more readily interpretable. However, the obvious recommenda-

tion is to carry out the analysis on both forms of the data.

In summary, the recommended approach is to shrink markedly those

components corresponding to a small eigenvalue and small contribution

178

to trace (WB). Since the sum of the coefficients tends to be stable,

deletion of one or some of the variables with unstable coefficients

may suggest itself as the next step in the analysis. It should be

pointed out that in many cases there will be no advantage in shrinking;

this occurs when much of the between-group variation coincides with

the directions of the smallest eigenvectors. Whereas in regression

the presence of high correlation will almost certainly indicate

instability of the coefficients of the variables with highest loadings

in the smallest eigenvector(s), within-groups correlations as high as

0.98 may be associated with marked stability in discriminant analysis.

With high positive within-group correlation and negative between-groups

correlation, marked shrinking will never be necessary. However, with

high positive within- and between-groups correlations, marked shrinking

will nearly always be advantageous.

179

CHAPTER SEVEN: COMPARISON OF CANONICAL VARIATES

In this Chapter, the functional relationship formulation for

canonical variate analysis outlined in Section 1.3 is used to develop

methods for comparing canonical variate analyses for several independent

sets of data. Section 7.2 develops likelihood ratio criteria for

examining common orientation of discriminant planes; coincidence of

discriminant planes; common orientation and common dispersal of means;

coincidence and common dispersal, and overall coincidence or common

orientation and common position. An example is given in Section 7.3.

Section 7.4 discusses some practical aspects.

7.1 Introduction

A common problem in multivariate discrimination studies is the

analysis and comparison of different sets of data, where each set

relates to the same physical or biological problem. For example, in

medical studies, data on patients in various disease categories are

often available from a number of regions; the data may also be

available for different socioeconomic classes, races and so on.

The general problem considered is that of a discrimination study,

with s sets of data, each set relating to the same g groups, with the

same v variables in each group. A commonly adopted approach is to

carry out separate canonical variate analyses for each set, and also a

combined analysis based on all gs groups. Visual comparisons of the

resulting canonical vectors and of plots of canonical variate means

are then made. The problem can also be considered in the context of

multivariate analysis of variance. The total variation can be

partitioned into effects for sets, for groups, and for sets x groups.

180

The interaction and appropriate main effect can also be partitioned

into the group effect for each set, and, formally, into the set effect

for each group. An examination of the contribution of the sets x

groups effect, relative to the variation within-groups, will give

some indication of whether the variation between the groups is similar

for all sets. However, as discussed further in Section 7.4, by

analogy with the univariate analysis of variance and partitioning

of polynomial trends, this can lead to an insensitive comparison.

Itis desirable to be able to make more detailed comparisons of the

between-group discrimination for each set. While separate canonical

variate analyses and a combined analysis go some way to achieving this,

they still require subjective comparisons of the results.

This Chapter presents a more formal approach by formulating

models in terms of structure of the group means. The simple represen-

tation in Figure 7.1 together with Table 7.1 give the sequence of

models considered here. It is assumed that the number, p, of canonical

variates of interest is specified, and that the corresponding canonical

roots are well-separated, so that the canonical vectors for each set

are well-defined. This is, in my experience, a reasonable assumption.

The term dispersal in Figure 7.1 and Table 7.1 refers to the relative

positions of the (projected) means for each set. The sequence

1(c) } 1(d) is not specified in Table 7.1; it will only be of interest

when either factor can be designated as sets or as groups. It can be

approached by interchanging the designation of sets and groups, and

carrying out parallel analyses for each designation.

181

Table 7.1 Representation of comparisons of models of interest.

1(•) refers to Figure 7.1, while 1(.') denotes the same

Figure with coincident vectors.

individual orientation and dispersal - 1(a)

common orientation, individual dispersal - 1(b)

common orientation, coincidence, common dispersal - individual dispersal -

1(c)

1(b' )

coincidence, common dispersal - 1(c')

coincidence, common dispersal and position or overall coincidence - 1(d')

182

(b~

Figure 7.1 - Representation of three groups for three sets.

The axes represent variables. The symbols • , •

and ♦ represent sets. The means for each set lie on

discriminant vector: (a) different orientation for

each vector; (b) common orientation but different

dispersal; (b') coincidence but different dispersal;

(c) common orientation and dispersal but different

positions; (c') coincidence and common dispersal;

(d') with means collapsed along dotted lines -

overall coincidence or common orientation, dispersal

and position.

183

7.2 Comparison of Solutions

Consider s sets of data, with g groups in each set. Assume that

an observed vxl vector xktm is distributed as Nv(ukt,), where

m = 1,...,nkt; k = 1,...,g; and t = 1,...,s.

The general model considered in this Chapter is the generalization

of that considered in Section 1.3, viz.

ukt '0t + EYtkt

with 'Yt the vxp matrix of population canonical vectors for the tth set.

Estimation under this model, and under those to be discussed

subsequently, can all be reduced to a generalized eigenanalysis of

the sort obtained when there is only one set (s = 1). The approach

adopted is to reduce the relevant part of the log likelihood to a

form analogous to (1.23) for the single-set case in Section 1.3, whence

estimates of the canonical vectors and covariance matrix follow by

analogy with (1.30), (1.26) and (1.31). Since these results for the

single-set case are used repeatedly in the remainder of this Section,

they will be reviewed very briefly here. The relevant part of the

log likelihood maximized with respect to 4k and u0 is given by

n log 1E1 - tr E 1S - tr E 1B + tr E 1PB . (7.1)

Maximization w.r.t. E and 'Y gives T = C, where the C are the first p

columns of C, and C satisfies

BC = SCF; (7.2)

184

here F is the diagonal matrix of canonical roots. The canonical

vectors and roots satisfy

CTBC = nF

and (7.3)

CTSC = ni .

Finally,

nE = S + B - n 1BC F-1CTB . (7.4) P P P

For consistency of notation, the matrices and vectors required

will now be defined for general t = 1,...,s. Write

xkt =

n,t =

nk. =

-1 nkt

nkt E xktm ,

m=1

E nkt , k=1

s

! nkt ' t l

— -1 g - X•t

__ n•t k

=1 nkt xkt '

— -1 s -

Xk•

__ n

k. t lE nktxkt '

s g

nT = tE1 kEl

nkt ,

(7.5)

and

_1 s g —

XT _ n

-1nktxkt t=1 k=1

185

Let St and Bt be the usual within-groups and between-groups

SSQPR matrices for the tth set, viz.

g nkt _ — St = S E (x1-,4„,

-- - xkt) (xktm - xkt)T

k=1 m=1 (7.6)

and

- Bt = E nkt(xkt - x)(xkt - x.t)T ,

k=1

and write

s ST = E St

t=1

The derivation of the main results for the models of interest

now follows the sequence given in Table 7.1.

7.2.1 Individual orientation and dispersal

Consider the model

kt =

uOt + ETtrkt (7.7)

specifying different canonical vectors for each set. Then the relevant

part of the log likelihood may be written as

s g _ _ -nT loglE l-trE

-1ST - E E nkt(xkt-uOt-t 11

-1(xkt-uOt-£Y'Jkt) . t=1 k=1

With

Qt = (Y't -1 ETt) T

t

186

and

Pt £'YQt '

proceed as in the single set case to obtain

Ckt Qt(xkt - uOt)

and

(I-Pt)uOt = (I-Pt)x.t .

With

s BT = £ Bt r

t=1

the log likelihood maximized with respect to not and Ckt is

s -nTlog l £I - tr E- 1S

T - tr £-1BT + tr £-1 E PtBt .

t=1

Following the same steps as for the single-set case leads to

s

l nT£ = ST + BT - nT

E BtTtFt

l4'tBt

t=1 p

where the diagonal matrix Ftp satisfies

AT A = npFtp '

(7.8)

(7.9)

while

A

BtI,t = nT£'YtFtp . (7.10)

187

Now consider a particular value of t, say t = f. Then from (7.8),

nTE~'t =ST f +B/ ff

f -B; fF fp'YfB f'Y fnTl (7.11)

+ E Bt (I-TtFtppTtBtnT1

)Tf .

t#f

From (7.9), the second and third terms on the r.h.s. of (7.11) cancel.

Write

$

Hf = E Bt (I-'YtFtp Yt -1

BtnT ) t=1 p

t#f

then from the simplified form of (7.11) and from (7.10),

A A Bf/f = (ST+Hf)Y'fFfp (7.12)

A and the vectors Y'f are scaled so that

A A

f (ST+Hf)'Yf = nTI .

The solution is iterative; this is discussed in Section 7.4.

7.2.2 Common orientation, individual dispersal

Common orientation or parallelism of the discriminant planes is

specified by the model

ukt = u0t + ETCkt • (7.13)

Write

Q = (`YT ERY) -1TT

188

and

P = E'YQ .

Then proceeding as in the single set case and as in Section 7.2.1

gives

Akt = Q (xkt - uot)

and

(I-P)1Ot = (I-P)x.t .

The relevant part of the log likelihood maximized with respect to uOt

and Akt

is

-nT log+El - tr E 1ST - tr E1BT + tr E 1PBT .

But this is of the same form as (7.1) for the single-set case,

with ST, BT and nT replacing S, B and n. This gives 'YT as the solution of

w w

BT;T ST'YTFTp (7.14)

with

T 'YTBT'YT = n Tp

and

'YTST'YT = nTI ,

while, from (7.4),

('.15)

189

A The vectors 'YT are again the first p columns of the matrix of eigen-

vectors CT which satisfy (7.14).

The estimated means are given by

A T - }t kt

__ x•t +

VTTTTT(xkt - x•t)

AT- with VT = nT1ST, and the canonical variate means are given by TTxkt.

The model (7.13) specifies common orientation of the discriminant

planes. However, it does not specify common dispersal of the projected

means for the canonical vectors across sets and hence does not

necessarily specify equal canonical roots.

7.2.3 Common orientation and common dispersal

Following the first sequence of models, viz. 1(b) to 1(c) in

Figure 7.1, the model specifying common orientation and dispersal,

and hence common canonical roots, is

kt = '

Ot +

k . (7.16)

Write

= -1 u0 (k) nk•

tEl n

ktuOt

Differentiation of the log likelihood w.r.t. Ck gives

Ck = QS- - u0(k))

and so the log likelihood maximized with respect to Ck may be written as

-nT loglEl - tr E1ST

- E E nkt(xkt u0t)TE-1(xkt - u0t) k=1 t=1

g Tr-1 -

+ k=l nk• (xk• - 1-10(k)) E p(xk• - u0(k)) •

190

Differentiation w.r.t. pOt

leads to

g _ E {nktE

-1(uot-xkt) - nklnktE-lr(E nkfu0f nk•k• = 0,

k=1 f=1

A and so the expression for uOt

becomes

-1 g AOt __ x•t

- n•t P kE1 nkt(xk, - ū0(k)) .

Write

g s _ - T BDG = E E n

kt (xkt u0t) (xkt uOt)

k=1 t=1

and

g _ A

BD = El

nk. (xk• u0(k)) (xk. - u0(k))T . k

Then the relevant part of the log likelihood becomes

-nT loglEl - tr E 1ST - tr E1BDG + tr E'PBD

nT loglEl - tr E'ST - tr E-1(BDG BD) - tr E IBD + tr E-1PBD .

But this is of the same form as (7.1) for the single-set case, with

ST + BDG - BD replacing S and BD replacing B. From (7.2), (7.3) and

(7.4), this gives the canonical vectors ;D and canonical roots P

Dp as

the solution of

BDYD = (ST + BDG - BD) 1'DFDp (7.17)

191

with

AT '4'D (ST + BDG - BD)'YD = nTI

and

T `Y DBD'Y D = nTFDp ,

while

nTE= ST + BDC - BDY'DFDpY'DBDnTl . (7.18)

The solution is iterative in that BDG and BD depend on

pOt which

x depends on tip; this is discussed in Section 7.4.

7.2.4 Coincidence but individual dispersal

The model (7.13), and Figure 7.1(b), specifies parallel discriminant

planes with different dispersal and different position in the direction

orthonormal to the plane. The requirement of coincidence of the

discriminant planes (Figure 7.1(b')) implies that pot = u0 + E'YKt which

gives

ukt = PO + E'YKt + E'Y

kt '

or

and

'kt = u0 +

E'l kt •

Proceeding as in the previous part of this Section gives

A

Akt = Q(x

kt - u0)

A (I-P)u0 = (I-P)xT .

(7.19)

192

Write

g s _ BGT = E E nkt(xkt - xT) (xkt - xT)T .

k=1 t=1

Then it is easy to show that the solution is once again as for the

single-set case, with ST replacing S and BGT replacing B. This gives

A the canonical vectors TG and roots F

Gp as the solution of

A A BGT G

_ST%GFGp

with the usual constraints analogous to (7.3), while

A nTE = ST + BGT BGTTGFGp~G 1 B GTnT

.

(7.20)

(7.21)

7.2.5 Coincidence and common dispersal

From the first form of (7.19), the requirement of coincidence and

common dispersal is specified by the model

ukt = UG E'YKt + ET4k .

Write

• -1__ s K(k) nk• EnktKt •

t=1

Differentiation of the log likelihood w.r.t. ~ gives

A

k = Q (xk, - uG EY'K (k))

(7.22)

193

and so the log likelihood maximized with respect to ;k may be written as

g s _ _ -nTlog l E ( - trE 1ST - E E nkt (xkt-uō EY'Kt) TE-1(xkt-uŌ

EY'Kt) k=1 t=1

g _ _ + E nk. (xk.-u0-EY'K

(k) )TE-1P

(xk•-1.10 E'VK (k)) . k=1

Differentiation w.r.t. Kt leads to

_ g _ 5 E'YKt = Px.t - Pn. E nktxk. + n.

t 9E.

(k) . k=1 k=1

Define

K, = nTl

E nKt = t=1

then the maximum likelihood estimator for uo satisfies

A tI-P)la0 = (I-P) (xT - E'YK) .

Write

A BCG = E E nkt (xkt-E'YKt) (xkt-E'YKt) T

k=1 t=1

g _ BCG k l = E nk. (xk. -E'YK (k)) (xk.-E'YK

(k) )T

and

BCT = nT(xT-E'YK.) (xT EY'k. T ) .

g s

194

Then the relevant part of the log likelihood maximized with

respect to Ck, Kt and u0 may be written as

-nTloglEl-trE 1ST-trE 1BCG+trE-1PBCO+trE-1BCT-trE-1PBCT .

But this is of the same form as (7.1), with ST + BCG - BCO

replacing S, and BCO BCT replacing B. From (7.2), (7.3) and (7.4)

w this gives the canonical vectors 'YC and canonical roots FCp as the

solution of

w w (B -BB

CT) C (ST+BCG-BCO)'VCFCp

qT (ST+BCG-BC0);' = nTI

wT w ~C (BCO-BCT)'YC = nTFCp

while

nTE = ST+BCG BCT-(BCO BCT)~CFCpgC(BCO-BCT)n l T

with

and

(7.23)

(7.24)

The solution is iterative in that BCG' BCO and BCT depend on

w w A

E , 'Y and Kt .

7.2.6 Common orientation, dispersal and position

The model for common position in addition to coincidence and

common dispersal, or overall coincidence, is specified by

kt = u0 + E'Yck . (7.25)

195

Note that while this model follows as a direct simplification

of (7.22) (or Figure 7.1(c')) in Table 7.1, comparison of (7.25)

(or Figure 7.1(d')) with (7.16) and (7.19) suggests that it could

also follow directly from either of these models. However comparison

of the corresponding Figures 7.1(c) and 7.1(b') with Figure 7.1(d')

suggests that the sequence as specified is more logical.

The maximum likelihood estimators of Ck and are easily shown

to satisfy

w k = Q (x

- PO)

(I-P)u0 = (I-P)xT .

Write

g s BG = E E nkt (xkt -

xk • ) (xkt - xk, ) T

k=1 t=1

g _ _ _ _ T Bo = nk• (xk. - xT)(xk. - xT) . k=1

Then the relevant part of the log likelihood maximized with respect to

u0 and Ck may be written as

-nT1oglE1 - trE'ST - trE "BG - trE1B0 + trE 1PB0 .

But this is again of the same form as (7.1) for the single-set

case, with ST + BG replacing S, and BD replacing B. Note that

ST + BG may be written as

and

and

•

196

g s nkt _ s E { E E (xktm xkt)(xktm

xkt)T +

E nkt(xkt-xk.)(xkt-xk.)T} k=1 t=1 m=1 t=1

g s nkt _ — = E E E

(xktm xk •) (xktm xk •) T • k=1 t=1 m=1

and this is simply the sum of the within-groups SSQPR matrices for

each group for the data bulked over all sets. From (7.2), (7.3) and

(7.4), the required canonical vectors TO and canonical roots Fop are

given by

with

BOr'O = (ST + BG) %I OFop

(7.26)

T0 (ST + BG) W0 = nTI

and

Y 0B0'YO = nTFOp ,

while

nTE = ST + BG +B

0 - BOY'0F0p'TYOBOnT . (7.27)

This is an intuitively acceptable solution since the data for each group

are bulked over all sets, and a single canonical variate analysis is

then carried out to examine the variation between the g (larger bulked)

groups. In particular, with V0 = nT

-1 (ST+BG),

ATA

ukt = xT+VoToTO(xk,-xT) and

TOUkt = ~Ypxk., which are simply the

canonical variate means for the g larger groups.

197

7.2.7 Likelihood ratio statistics

The likelihood ratio statistics for comparison of the various

models reduce to the ratio of the determinants of the estimates of

E under the models. The determinant can usually be factorized: for

the single set case, the form for E given in (7.4) may be rewritten

as nE = S + n 1BCgF-CTB = S + n 1SCgCTB where the Cq are the last v-p

columns of C and some of the diagonal elements of Fq will be zero if

g-1 < v. Hence

1'121 = ISI I i+n-1CgCTBI = ISII I+n 1CTBCgI = ISIII+FqI.

However, it is straightforward when programming the comparisons to

w calculate E and evaluate the determinant directly. The d.f. for the

various comparisons are given, as usual, by the difference in the

number of estimated parameters for the models. Table 7.2 summarizes

the equation numbers for the main results for each model (including

those for the 2) and the corresponding degrees of freedom. With

Table 7.1 (and Figure 7.1), it provides a ready reference for the

comparisons of the models.

7.3 An Example

The data set to be discussed is taken from a study of morphological

divergence and altitudinal variation in three species of grasshopper

(see also Section 5.3). Groups were sampled at a number of sites along

two inuependent altitudinal transects in the Snowy Mountains, New South

Wales. Details are given in Campbell and Dearn (1979). The first

canonical variate for the data for the two transects effects complete

separation of the three species; information for discrimination between

the species is restricted to this variate. The question then arises as

Table 7.2 Summary of models, with equation numbers for the model and for the estimates of ' and of £,

and the degrees of freedom for the model. Refer to Table 7.1 and Figure 7.1 for cross-

reference of model details.

model

1(a) ukt = uOtt£vitkt

1(b) ukt = pOt)+ £'kt

1(b') ukt = u0 + £TCkt

1(c) ukt = 1Ot?+

1(c') Pict = u0 + £`YKt

+ £TCk

1(d') ukt = u0 + £TCk

text eqn

7.7

7.13

7.19

7.16

7.22

7.25

A A

eqn for eqn for £

7.12 7.8

7.14 7.15

7.20 7.21

7.17 7.18

7.23 7.24

7.26 7.27

degrees of freedom*

sv+s(vp-p2) + Wg-1)

sv + vp-p2 + sp(g-1)

✓ + vp-p2 + sp(g-1)

sv + vp-p2 + p(g-1)

✓ + vp-p2 + p(s-1) + p(g-1)

✓ + vp-p2 + p(g-1)

*there are also v(v+l)/2 d.f. for estimation of E common to all models

199

to whether the canonical variate effecting species separation is

the same for both transects. Three variables - eye width, eye depth,

and width of head - contain much of the information for discrimination.

Data for fourteen of the groups considered by Campbell and Dearn (1979)

are reexamined. Table 7.3 gives a summary.

Figure 7.2 shows a plot of the group means for the first canonical

variate for all fourteen groups, and for the analysis of each transect

separately. The 'overall canonical variate analysis and the separate

canonical variate analyses show the same general pattern of group

dispersal. Separation of the group means along transect II appears to

be greater than that along transect I, largely as a result of the greater

separation of the Praxibulus and K. usitatus groups. However, as

Table 7.4 shows, the canonical roots for the two transects are very

similar (2.16 vs 2.08); the slightly greater separation of the

K. cognatus means for the second transect offsets the separation of the

other two species. The slight tendency for K. cognatus groups along

transect II to have larger canonical variate means than groups along

transect I is a reflection of the slightly larger size of animals from

the second transect. This is shown in the SIZE column in Table 7.3 for

the means for the first principal component derived from the pooled

correlation matrix.

Table 7.4 lists the canonical root and canonical vector for the

various models outlined in Tables 7.1 and 7.2. The determinant of f

is also given. The similarity of the canonical roots and vectors and of

(ĒI for the individual orientation and the common orientation models is

evident. The additional specification of common dispersal results in a

marked increase in(Ēj, from 0.83 to 0.98, and a corresponding change in

the canonical root. The specification of coincidence but individual

dispersal has relatively less effect on these statistics, with ~Ē~

set species

t=l Praxibulus

K. cognatus

K. usitatus

t=2 Praxibulus

K. cognatus

K. usitatus

groupa altitude

1-5 980

1-16 1010

1-6 1180

1-27 1230

1-26 1390

1-50 1520

1-50 1520

2-35 1040

2-32 1010

2-22 1180

2-23 1240

2-41 1380

2-57 1480

2-56 1520

pooled correlation matrix

pooled within-groups standard deviations

200

Table 7.3 Summary of data for Praxibulus and Kosciuscola species

n EW ED HW SIZEb

20 1.57 2.22 3.17 33.34

20 1.51 2.18 3.24 32.88

20 1.53 2.17 3.31 33.28

39 1.53 2.18 3.26 33.15

12 1.54 2.21 3.33 33.61

13 1.55 2.16 3.34 33.44

18 1.59 2.11 3.67 34.54

10 1.59 2.25 3.17 33.65

16 1.61 2.24 3.52 34.88

20 1.55 2.14 3.44 33.58

19 1.55 2.22 3.46 34.12

20 1.53 2.20 3.37 33.54

17 1.53 2.18 3.35 33.40

12 1.61 2.18 3.88 35.71

1.0 0.63 0.70

1.0 0.69

0.049 0.074 0.121

a the first number refers to the transect (here set), the second to a

group code used by Campbell and Dearn (1979).

b SIZE is the mean for each group of the first principal component

(the eigenanalysis is based on the correlation matrix derived from

the pooled covariance matrix).

201

1-26 I-50 I-16,27 1-6 1-50 1-5

(a) It ]I .

2-35 2-56 2-57 2-23 2-22 2-41 2-32

1-26 I-50 1-5 1-16,27 1-6

(b)I s szsz I-50

T

CO 11 4 2-35

fiZ fit 2-57 2-23 2-22

2-41 2-32

4 2-56

-1 0 1 2 3 4 5 6 7 8 9

FIRST CANONICAL VARIATE

Figure 7.2 Canonical variate means for populations of Praxibulus (•), Kosciuscola cognatus (0) and K. usitatus (♦) for a canonical variate analysis for (a) all populations for transects I and II combined; (b) for transect I data; and (c) for transect II data.

Population numbers are given for cross-reference with Table 7.3.

202

increasing from 0.83 to 0.87.

The ratio of determinants for the hypothesis of common orientation,

viz. 1(b) vs 1(a) in Figure 7.1, is 1.007 (= 0.8278 : 0.8220). The

total number of observations is 256 (see Table 7.3) so that

256 log 1.007 = 1.81 is to be compared with a X2 distribution; taking

the d.f. of 242 as the multiplier gives 1.7. The corresponding

probability is around 0.50. The specification of common dispersal

given common orientation, viz. 1(c) vs,1(b), gives a ratio of determinants

of 1.18 and hence a value of 42 (40 using the d.f.) to ae compared with a

X6 distribution; the result is highly significant. The specification of

coincidence but individual dispersal, viz 1(b') vs 1(b), gives a ratio

of determinants of 1.05 and a value of 13.0 (12.3 using the d.f.) to be

compared with a X3 distribution; the associated probability is between

0.01 and 0.001.

The significant departure from the model for common dispersal could

be anticipated from Figure 7.2, despite the similarity of the canonical

roots for the two transects. There are obvious differences in the

nature of the separation. For common dispersal to hold, the differences

between corresponding groups from set to set must be essentially the

same.

The formal comparison of canonical variates for the two transects

suggests that the nature of the discrimination - viz. discrimination in

terms of eye shape and size relative to head shape and size - is

effectively the same for both transects. Examination of a plot of group

means and associated concentration ellipses for the three species shows

that animals from transect II have a slightly larger eye depth, relative

to head width. This results in slight separation of the (parallel)

discriminant vectors and leads to the significant result for the model

specifying coincidence.

Table 7.4 Canonical root and vector ana for various models given in Tables 7.1 and 7.2.

Canonical root and vector for overall canonical variate analysis (cva) are also given.

The variables are denoted EW, ED and HW.

IEl x 107 c-root EW ED HW

cva transects I and II 4.59 7.00 12.66 -12.41

1(a) individual c-vectorsa transect I transect II 0.8220 2.16

2.08 5.69 7.11

13.31 12.22

-11.85 -12.11

1(b) common orientation - individual dispersal 0.8278 4.23 6.41 12.81 -11.99

1(b') coincidence and individual dispersal 0.8711 4.61 6.81 12.33 -12.08

1(c) common orientation - common dispersalb 0.9761 3.44 7.45 11.70 -11.16

1(c') coincidence - common dispersal° 1.0319 3.44 7.47 11.67 -11.16

1(d') overall coincidence 1.2892 2.79 6.39 11.70 -9.77

convergence after: a 3 iterations; b 3 iterations; c 5 iterations.

204

7.4 Discussion of Some Practical Aspects

As shown in Section 7.2, the calculations for all models reduce

to application of the algorithm for the single-set case, though for

three models the solution given is iterative. The single-set algorithm

is just the usual canonical variate algorithm, as discussed in Section

1.5. Hence the procedures discussed in previous Chapters, such as the

use of robust estimates of means and of covariances rather than the

usual estimates, the adoption of shrunken estimator procedures, and

indeed adoption of a full robust M-estimator approach, could all be

implemented.

The iterative solution for the model specifying individual

orientation results from the assumption of common covariance matrix for

all sets. Initial estimates of Tit and Ftp are given by the individual

canonical vectors and roots for each set. Where parallelism obtains,

convergence typically seems to take place in three-five iterations.

However, for one data set examined, where the nature. of the canonical

vectors for the two sets considered differed appreciably, some thirty

iterations were required to achieve successive estimates of the canonical

roots within 10-s. For ithe'model specifying common orientation and

dispersal, an explicit canonical variate solution follows given estimates

of the Pot. When common orientation holds, convergence is typically in

two-three iterations, as in the example in Section 7.3.

The comparisons developed in Section 7.2 provide a detailed

examination of the nature of the interaction, by concentrating the

examination on the subspace containing much of the information for

discrimination. For the univariate two-factor analysis of variance,

the total variation is partitioned into main effects for A and for B

and the interaction AxB. When one factor, say B, is qualitative, single

205

d.f. polynomial trends are often fitted. If these trends adequately

describe the variation, then parallelism of responses implies lack of

interaction. Moreover, by isolating single or few d.f. effects, the

resulting comparisons are usually more sensitive.

In this Chapter, a separate canonical variate plane is determined

for each set. If the planes do not differ in orientation (i.e. if they

are parallel), analogy with the above would suggest examining the

position of the planes, since parallelism above implies lack of

interaction. However, a fundamental difference is that the canonical

vectors define linear combinations of the variables, rather than of

powers of a single variable as for the polynomial trends. Both axes

in Figure 7.1 represent variables, rather than one axis representing a

response variable and the other levels of the quantitative factor as

in the univariate case.

It is assumed in this Chapter that the p canonical vectors adequately

describe the variation between the group means, so that the model (7.7)

is a reasonable one. With this assumption, the significance of the

set x groups interaction for multivariate data is related to the

parallelism of the discriminant planes. But parallelism does not in

itself imply lack of sets x groups interaction. Consideration of the

interaction term ukt - uk• - u.t + u and/or Figure 7.1 shows that the

added condition of common dispersal of means in the discriminant plane

is also required for the interaction to be null (compare Figures 7.1(b)

and 7.1(c)). This latter condition implies equality of the corresponding

canonical roots. Given common orientation and common dispersal, the

added condition of overall coincidence of the planes (viz. 1(d')) results

in a null set effect, as in analysis of covariance and in the partitioning

of interaction for the univariate case. Since the interaction and main

effect are considered in the space of the canonical vectors of interest,

206

a more sensitive examination of the nature of the sets x groups effect

should in general result.

207

CHAPTER EIGHT: CANONICAL VARIATE ANALYSIS WITH UNEQUAL COVARIANCE

MATRICES

In this Chapter, canonical variate analysis is extended for use

when the covariance matrices are not assumed to be equal. Linear

combinations of variates are derived, in Section 8.2.1, by generalizing

the weighted between-groups SSQ approach, in Section 8.2.2 by

generalizing the likelihood ratio test and the associated non-centrality

matrix, and in Section 8.2.3 by generalizing the functional relationship

formulation. Function minimization routines must be used for the solution

to two of the generalizations. Computational aspects are discussed in

Section 8.3. In Section 8.4, the usual solution and the first two

generalizations are compared via generated data for a few typical

configurations of means in a situation in which the covariance matrices

are in fact equal. The MSE of the canonical variate coefficients and

group means for the generalizations are approximately three times those

for the usual solution, due to the corresponding changes in the variances.

Section 8.5 outlines some possible approaches for comparing the

generalized solutions with the usual solution. There was not time to

consider this in detail. An example is discussed in Section 8.6.

8.1 Introduction

Ideally, bivariate scatter plots of pairs of canonical variates

should exhibit approximately uncorrelated clusters for each group, with

unit standard deviation within each group. Unfortunately, this does not

always occur; there can be marked differences both in the scatter and

in the correlations of the scores. The idea of forming linear combinations

of the variables is widely accepted in practice. Because of this and the

208

attendant simplicity of the representations, generalizations of the

usual solution which lead to a representation by linear combinations

of the original variables are considered.

There are some obvious heuristic approaches which can be adopted

to examine the effect of within-group heterogeneity on the description

of between-group differences by a small number of linear functions.

One approach is to compare the analyses using different estimates of the

within-groups SSQPR matrix. The latter could be the robust estimate

used in Chapter Five; the matrix calculated by pooling over all but one

of the groups; the covariance matrix for each group in turn, provided

sample sizes are large enough; or the matrix calculated by pooling subsets

of covariance matrices, weighting in various ways if sample sizes are

unequal. Another approach is to base the calculation of Mahalanobis DZ

for each pair of groups on the overall pooled covariance matrix, or on

the covariance matrix calculated by pooling only the matrices for the

two groups involved. The ordinations from a principal coordinates

analysis of the matrices of D2 values can be compared (Campbell and

Mahon, 1974). The recent results of Constantine and Gower (1978) on

the analysis of asymmetric matrices suggest determining D2 twice for

each pair of groups, using each of the covariance matrices in turn as

the within-groups matrix, and examining the degree of asymmetry in the

representations. With the use of alternative estimates of the within-

groups metric, or the determination of D2 twice for each pair of groups,

a large number of ordinations can be produced; and these must then be

compared. Gower (1971, 1975) suggests comparing ordinations by

minimizing the distance between the group means in the simplified

representations (see also Sibson, 1978). To date, detailed guidelines

are not available for interpreting the resulting measure.

Ideally, procedures should consider the covariance structure for

209

each group in relation to the position of the mean. It may be that in

certain situations, differences in covariance structure have little or

no effect on major directions of between-group variation. Possible

examples include when a group with different covariance structure is

widely separated from the remaining groups; and when only one or a few

variables result in the differences, with the variables contributing

little to the discrimination. The solutions proposed in Section 8.2

generalize the usual canonical variate solution by associating the

covariance matrix for each group with the corresponding mean in the

various formulations in Section 1.2.

The differences in covariance structure are assumed to reflect

real biological or physical differences in the underlying variability of

the populations. This is in contrast with the situation where appropriate

transformation of the data achieves reasonable equality of covariance

structure, as will occur for example when there is systematic change

of variances with means. Of course, the effect of a transformation on

any distributional assumptions must also be considered.

8.2 Generalizations of the Usual Solution

8.2.1 Weighted between-groups formulation

Let xk represent the mean for the kth group, and Vk the covariance

matrix, where Vk = (nk-1)-1Sk and Sk is defined in (1.1).

Then for any linear combination cTx, the mean and variance for the

kth group are cTxk and cTVkc .

Define a weighted mean by

5 g c 1 ={ E nk(cTVkc)-lc k}/{ E nk(c

TVkc)-1}

kul k=1 (8.1)

210

here the weights are the inverse of the sample variances for the

linear combination.

Then by analogy with the usual one-way analysis of variance with

known weights, an appropriate weighted between-group SSQ to consider is

E nk (cTVkc) -1(c c - c 1) 2

k=1 (8.2)

g E nk(cTVkc)-lcTBkc

k=1

where

Bk k xI) (xk-x1 )

The SSQ in (8.2) could be defined with general weights wk replacing the

nk, and in particular with wk = constant. There is no direct

generalization of the Rao extension to canonical variates (see Section

1.2).

Maximization of (8.2) w.r.t. c will lead to coefficients c1 and a

maximized ratio f1; these will be termed the canonical vector and

canonical root for the generalization of the weighted between-groups

formulation. When each Vk is replaced by the pooled covariance matrix

VP = n-1W (the latter being defined in (1.3) and (1.4)), the usual

canonical variate formulation in (1.8) results.

In the usual canonical variate analysis, the assumption of common

covariance matrix leads to the pooled SSQPR matrix W as the appropriate

scaling metric for successive canonical vectors. However, for the

generalization discussed here, there is no obvious associated scaling

metric for successive vectors. The approach adopted here is to introduce

some average covariance matrix VA and to choose successive vectors ci

211

to maximize (8.2) subject to the constraint (ci)TVAc = 0, j < i.

Some possible choices of VA are the usual pooled within-groups

covariance matrix VP; the inverse of the sum of the inverses of the

individual SSQPR matrices Sk (or covariance matrices Vk); and some

form of weighted average of the individual covariance matrices. Another

possibility is to backtransform the robust midmeans of the log variances

and of the arctanh correlations, and recombine to form a covariance

matrix VR. The latter is used in the example in Section 8.6.

Write vck = cTVkc and cvk = Vkc; then the derivatives of the

weighted between-groups criterion w.r.t. the vector of coefficients c

are given by

2 E nkv-2 cTBkc cvk - 2 E n

kvckB kc . k=1 k=1

(8.3)

8.2.2 Likelihood ratio formulation

Consider g independent v(uk,Ek) populations, and let xkm be a

vector of observations from a sample of size nk from the kth population,

with m = 1,...,nk.

The relevant part of the log likelihood may be written as

g - E nk logIEkI - tr E Ek1Sk - E nk(xk uk)TEkl(xk-uk) , k=1 k=1 k=1

(8.4)

with Sk and xk defined in (1.1) and (1.2). Then j = xk. Differentiation

of (8.4) w.r.t. Ek gives

-1 Ek = nk Sk = Vk (8.5)

(270

___ ,., L nk/2

-nv/2 II

IVk+Bk 1 e

k=1

(8.10)

212

Hence the maximized likelihood becomes

(27)-nv/2 =1

II 2 e-nv/2

k Vk (8.6)

with 2 gv(v+1) + gv estimated parameters.

Now consider the hypothesis of equality of mean vectors. Replace

the uk in (8.4) by u and differentiate w.r.t. p to give

u = ( E nk Ekl )-1

E vkEklXk . k=1 k=1

(8.7)

Write

Bk = (xk-u) (xk-u) T . (8.8)

Then the maximum likelihood estimator of Ek is given by

nkEk = Sk + nkBk

or

A

Ek = Vk + Bk . (8.9)

Hence the maximized likelihood becomes

with Z gv(v+l) + v estimated parameters.

The determinant in (8.10) may be written as

I Vk+BkI = IVkIII+Vk1BkI

A T -1 — " = IVk 1 {1 + (xk-u) Vk (xk-u)}. (8.11)

213

The ratio of maximized likelihoods is, from (8.6), (8.10) and

(8.11) ,

II {i + (xk-p) TVkl k (x ū ) } -nk/2

k=1

An alternative test statistic, the generalization of Hotelling's

trace statistic, is tr(M), where

g M E nkVklBk .

k=1 (8.12)

The distribution of tr(M) is discussed further below.

The matrix M can be used in a practical way, as the analogue of the

sample non-centrality matrix in canonical variate analysis. An eigen-

analysis of the square non-symmetric matrix M will produce eigenvectors

and eigenvalues, and these will be termed the canonical vectors and

canonical roots for the generalization of the likelihood ratio

formulation.

The non-centrality matrix M involves B . As (8.7), (8.8) and (8.9)

show, the maximum likelihood solution is iterative. James (1954) and

Chakravarti (1966) replace Ek in (8.7) by Vk to give xW in (8.13) below.

Chakravarti (1966) examines the level and power of the statistic tr(M).

This statistic differs from the usual trace statistic in that the Vk

replace Vp, and the calculations for the between-groups components

involve a weighted overall mean. Chakravarti (1966) shows that the

important modification is the replacing of the pooled sample covariance

matrix by the individual group covariance matrices. The choice of the

weighted mean is less important.

Consider now the non-centrality matrix M for a linear combination

cTx. Then, as in Section 8.2.1, the mean and variance for the kth group

214

T- are c k and cTVkc respectively. Let the overall mean be c k,

with choice of xo discussed below. Write Bk = (xk x0)(x -x0)T.

Then the scalar quantity corresponding to M in (8.12) for the linear g

combination cTx is E nk(cTVkc)-1cTBkc. If x0 is set equal to the k=1

quantity in (8.2) results. Examination of generated data in Section

8.4 shows that the weighted between-groups and likelihood ratio

generalizations give very similar performance for the first vector when

the covariance matrices are assumed to be equal.

Each of the generalizations discussed in this Section has one or

more drawbacks in its practical implementation. For the non-centrality

matrix generalization, some of the eigenvalues may be complex (see

next paragraph). From experience gained with the approach to date, this

seems to occur only for the smaller roots and when the group separation

in the corresponding directions is minimal. In the usual solution,

successive vectors are chosen to be uncorrelated within groups. Here,

successive vectors are sometimes highly correlated within groups.

The roots and vectors of M need not be real. In general, M cannot

be written as the product of two square symmetric matrices. However,

for two groups, the canonical root is always real. To see this, write

XW = ( nkVkl)-1 E nkVklxk I

k=1 k=1 (8.13)

weighted between-groups mean xI in (8.1), the weighted between-groups

and

-1 Zk = nkVk , Zs = k lE Zk . (8.14)

When g = 2, xl-xW = ZS1Z2dx, with dx = x1-x2, and x2-xW = -ZS1Zldx .

Since Zs -1

= Z11(Z11

+Z21)Z21 = Z21( Z11+Z21)Z11, it follows that

215

M = Z1B1 + Z2B2 = (Z11+Z21)-ld

xdxT

= (ni V1+n21V2)-1dxdx .

The eigenvalue of M is given by dx(ni V1+n21N2)-1dx'

Note that nk1Vk is the sample covariance matrix of the vector of

means xk; James (1954) refers to the above as the multivariate analogue

of the Behrens-Fisher problem. James (1954, Section 7) develops improved

X approximations for tr(M). With xW as in (8.13) and Zk and Zs as in

(8.14), write

Then

Bk = (xkxw) (xk-xw) T • (8.15)

tr(M) = tr E nkVklBk = E (xk xW)TZk (xk-xW) . k=1 k=1

The improved x2 approximation is of the form Xv(g-1)(ho+hl X(

g-1))

where

h0 = 1 + {2v (g-1) }-1 E (nk 1) -1 {tr (I-Zs1Zk) }2 k=1

and

g h1 = [2v (g-1) {2v (g-1)+2}]-1 [ E (nk 1) -1tr (I-Zs1Zk) 2

k=1

+ 2 E (nk1)-1{tr (I-Zs Zk) }2] . k=1

216

For two groups, this becomes

2 1 h0 = 1 + (2v)-1 l E (nk-1)-1{tr (Vs nk vk) }2

k=1

and

2 h1 = {2v(2v+2)}-1 E (nk-

1)-1(tr(Vslnk'Vk)2 + {tr(Vs1nklVk)}2], k=1

where 2

= E nk1Vk . Vs k=1

8.2.3 Functional relationship formulation

Consider again g independent v-variate v(etr,Ek) populations.

Assume that the v 1 vectors of population means uk are specified by

the model

uk = 110 (8.16)

where T is again the vxp matrix of population canonical vectors. The

model (8.16) associates each population mean with its own covariance

matrix through the postulated multivariate Gaussian form and through

the direct association of each group with its own basis vectors. This

formulation of the model is not as intuitively acceptable as that in

(1.13). A more obvious generalization of the model in (1.13) would be

to replace Ek in (8.16) by some fixed scaling metric EAVE, with the

vectors chosen so that TTEAVE

T = I. However the resulting maximum

likelihood estimators are considerably more complicated than those given

below. Because of its relative algebraic tractability, the formulation

in (8.16) is considered here. The derivation parallels that given in

217

Section 1.3.

The relevant part of the log likelihood is

g — - E n log I E I - E tr Ek

-15 - E n (x -u -E TC )

TE-1 (x -u -E TC ) .

k=1 k k k=1 k k-1 k k O k k k k O k k

Differentiation w.r.t. ~k

gives

A k ('Y T ZJ Y) -1'Y T (xk-UO) .

Write

Pk = EkT (TT kT) --1TT

. (8.17)

Since (I-Pk)TE-1(I-Pk) = Ekl(I-P

k), the log likelihood may be

written as

- E nk loglEkl - E tr EklSk

- E nk(xk-u0)TEkl(I-Pk)(xk-u0) • k=1 k=1 k=1

Differentiation w.r.t. u0 gives

{ E nkEkl

(I-Pk) }ū0 =

E nkEkl (I-Pk)xk . k=1 k=1

Write

Bk = (xk-u0) (xk lt0)T•

(8.18)

(8.19)

Then the log likelihood may be written as

- E 1og!E I - tr E Ek-1S - tr E n E-lBF + tr E n E -1 P BF

k=1 k

k=1 k k=1 k k k

k=1 k k k k

(8.20)

218

Differentiation w.r.t. Ek gives

n2k Sk + nkBk-EkY' (~YTEk'Y)-1`YTnkBk~' ('1 Ek'Y) 1'v Ek . (8.21)

Pre- and post-multiplication of (8.21) by 'YT and by T gives

TA 'YTEkY' = ~, Tnk1SkT = 'YTVk'Y (8.22)

while postmultiplication of (8.21) by 'Y and substitution of (8.22) gives

EkT _ (Vk+Bk) Y'{tvT (Vk+Bk) % }-l'Y TVk'Y. (8.23)

Write

Tk = Vk + Bk

and substitute (8.22), (8.23) and its transpose in (8.21) to obtain

Ek = Tk - Tk'Y ('YTTk'Y) -1'YTBkT ('YTTk'Y) -1'YTTk . (8.24)

The determinant of Ek in (8.24) can be written as

Unfortunately, Ek itself must be calculated, since it occurs in (8.18).

Hence the computational savings implicit in the expression for Itki

cannot be realized.

To obtain the maximum likelihood estimator of 'V, it is necessary to

introduce some conditions or constraints on T. A natural one in the

context of canonical variate analysis is to choose some suitable metric

219

VR and require that WT VR'~Y = I. In fact, as discussed below, the

choice VR = I will suffice for the actual optimization.

So differentiate (8.20) with respect to 'Y, subject to the

constraint that `YTY = I, and let M be a symmetric pxp matrix of Lagrange

multipliers; the resulting likelihood equation is

- E

ATA -1AT n B y('YTE T)-1Y TE k=l k k kk k k

g •TA ^ -1"T F AT + E (Y' EkW) ' nkBk + MW = 0 . k=1

(8.25)

Postmultiplication by 'y gives M = 0, so that, on using (8.22) and

(8.23), the likelihood equation becomes

g AT w -1AT FA AT A -1AT g AT A,-1AT F E ('Y Vk~y) nkBk~Y (W Tk1') 'Y Tk - E (T VkT) 'Y nkBk = 0

k=l k=1 (8.26)

but no further effective simplification appears possible.

Consider again the orthogonality constraint TTy = I introduced

above. If instead the required condition is WOVRWO = I, then

A tTVRT = QR say = UQEQUQ for the eigenanalysis of and and so

-1/2 EQ UTQyTVR~I'UQEQ1/2 = I and hence y0 = ~iUQEQI~2

The numerical maximization of the log likelihood in (8.20) reduces

g to choosing y to minimize L nu k loglĒkl with gk given by (8.24). The

k=1 derivative of the log likelihood is given by (8.41). The maximized

likelihood is invariant under orthogonal (but not orthonormal) rotation

of the original variables (see Section 8.3). This results in a considerable

advantage computationally, since the likelihood can then be maximized

for each vector ~ conditionally on the previous 1'1 ,. .. ,,i-1. Further

discussion is given in Section 8.3.

220

If the Ek are assumed known, the functional relationship

formulation reduces to choosing V to maximize the last term of (8.20).

-1~TBk,

this becomes

E nk(*TEk*)-ljT(xk-ū0)(xk-po)Tp. If the sample Vk replace the Ek k=1

results.

If the Ek are assumed known and equal to VP = nW1W, (8.25) reduces

to the usual canonical variate solution in (1.10).

8.3 Computation of the Generalized Solutions

The likelihood ratio-non-centrality matrix generalization requires

the eigenvalues and vectors of an unsymmetric matrix. I have used the

NAG routine FO2AGF for the eigenanalysis.

The weighted between-groups and functional relationship generalizations

require explicit use of function minimization/maximization routines.

I have experienced considerable difficulty in developing an effective

overall computing procedure, and this has hampered adequate evaluation

of the various generalizations. The current program has an option for

using either the Simplex procedure described by Nelder and Mead (1965)

as implemented in the NAG routine EO4CCF or Powell's hybrid steepest

descent/quasi Newton method as implemented in the NAG routine E04DCF.

The Simplex procedure is generally considered to be inefficient when

compared with gradient methods when more than a few variables are

involved. It does, however, have the advantage of being relatively

insensitive to poor initial estimates.

The functional relationship formulation encompasses the situations

where no restrictions are placed on the mean vectors and where all

g From (8.17), this can be written as tr E nk(TTEkT)

_ — k=1

= tr E nk (Y'TEkW) -1 'T (x

k-u0) (xk u0) T'1 T. When p = 1, g k=1 _

A

and if xI replaces u0, the weighted between-groups quantity in (8.2)

221

vectors are assumed equal. The maximized likelihood for the functional

relationship with 1 < p < h will be between those for the usual

unrestricted and restricted hypotheses in (8.6) and (8.10).

The function maximization can be carried out on the original

variables, perhaps standardized with respect to some average covariance

matrix VA to unit variance, or on orthogonal or on orthonormal

transformations of the original or standardized variables. The effect

of the various possible transformations on the maximized likelihood

can be found by considering the effect on IEk !. Let VA = SARASA be

the decomposition of VA in terms of the diagonal matrix of standard

deviations SA and correlation matrix RA, and let VA = UAEAUAT

and

RA = URERUR be eigenanalyses of VA and of RA. It is straightforward to

establish the following table of determinants for the possible

transformations; each entry gives the determinant in terms of the

original determinant I ki and the determinants of the matrix of standard

deviations and/or of eigenvalues.

original orthogonal orthonormal

original

12k1 IEk!

IEAI-l1Ek 1

(8.27)

standardized

ISA!-2IEk! ISAI-21 2k1 I SA

!-2

-1 I Ek I

IEA I-l

lEk !

The bottom right-hand entry follows from IEA I = IVA I = ISA I 2IRA I = ISA I 2IER

Hence for an orthonormal transformation of the original variables, the

maximized likelihoods are given by

IEAI11/2

times the maximized likeli-

hoods in (8.6) and (8.10).

Successive vectors for the weighted between-groups and functional

relationship generalizations are to be chosen to satisfy ciVAci = 1

222

and ciVACj = 0, j = 1,...,i-1. In the functional relationship A

derivation, * = ci. Alternatively, consider variables orthonormal

with respect to VA. Write a = EĀ"2UĀc, as in Section 1.4. Then

aiai = 1 and a.aj = 0.

Initially, the constraints were incorporated by substitution for

one of the coefficients, with the variables orthonormalized. Specifically,

for the first coefficient vector, all coefficients are divided by the

largest component, and the maximization proceeds for the remaining

v-i coefficients. At each iteration, alai = 1 is used to solve for the

excluded component. For the second coefficient vector, the coefficients

are divided by the largest component ignoring the already excluded

component, and the maximization proceeds for the remaining v-2 coefficients.

At each iteration, a2a2 = 1 and a2a1 = 0 is used to solve for the two

excluded components. This procedure leads to an unconstrained

minimization, though the way in which the constraints are accommodated

seems to lead to relatively poor performance of the routines.

Very limited empirical evidence suggests that a more effective

procedure results if the orthogonality constraint ciVAcj = 0 is

accommodated by explicit projection of the data orthogonal to the

previous cj. Assume that the first vector, cl, has been found and that

ciVACj = 1. The residual projection operator Rl = I - VAc1ci projects

the observations x onto the space orthogonal With respect to VA) to

cl. But VĀ2 = RlVAR1 will now be of rank v-1. So form the eigenanalysis

VA2 = UA2EA2UA2 and set VA2 - UT A2,v-1VA2UA2,v-1 where UA2,v-1 denotes

the first v-1 columns of UA2. The (v-1)x(v-1) matrix VA2 will be of

full rank. Now carry out the maximization to determine c2, the

coefficients for the variables UT v-1Rlx

, and scale c2 so that

PTPP VP c2 = 1. Then the required c2 is given by c2 = R1

T (c2) UA2,v-1 2 c2. 2

Note that c2VAcl (c2)TUA2,v-1R1VAc1 Oand c2VAC2 = (c2)TUA2,v-1R1V ART 1

UA2,v-lc2 = (c2)TVA2c2 = 1 as required. In general, with

223

P RTU . R. U , the n c. = P.c.. Note tha t P = I. i 1 A2,v-1 • . 1-1 Ai,v-i+1 1 1 1 1

When the orthonormal form is used, VA = I and Ri = Ri = I-aia..

The length constraint for each vector is accommodated in the actual

maximization routine by using polar coordinates A. The constraint

cTc = 1 is adopted for the maximization and the vector resealed so

that cTVAc = 1.

The original means and covariance matrices can be retained for the

actual calculation of the log likelihood at each iteration. Specifically,

for the ith vector and for any iteration, calculate ci from Ai, and then

ci = Pic.. For the gradient calculations, the chain rule can be used

to form the derivatives for the polar coordinates from those for the

original variables. Specifically, taL/ae)T is given by

(9L/aci)TP(acP/a8i).

The alternative computational approaches are currently being

evaluated.

8.4 Performance of the Generalizations when the Covariance Matrices

are Equal

The performance of the weighted between-groups and likelihood ratio

generalizations and the usual solution is studied for some general

situations using computer-generated data. The functional relationship

generalization was developed after this aspect was completed; because

of the computing difficulties discussed in Section 8.3 and the similarity

of the results for the two generalizations, I have not carried out the

calculations for the third generalization. The situations examined are:

(i) one group differs from the rest, either in one or all variables;

(ii) two directions are of interest, the first two groups differing

from the rest, each on only one or two variables; (iii) a simple

bivariate configuration with groups symmetric about the 1 direction;

224

and (iv) configurations corresponding to two actual data sets. The

population covariance matrix is taken as the identity matrix; for the

actual data sets, the orthonormal variable configurations are used.

The simulations are blocked in that each of the three solutions are

calculated for the same generated data set. The independent Gaussian

observations are generated by the polar method of Marsaglia and Bray

(1964). The generation of the covariance matrices is via the Bartlett

decomposition (Newman and Odell, 1971, Section 5.2). The vectors of

means are produced by dividing the vectors of generated observations

by the assumed sample size and adding the vectors of population means.

Usually, 100 sets of data are generated. I have used the MSE of the

coefficients and of the group means for each canonical variate to

compare the solutions. The vectors are scaled to have unit length.

Table 8.1 gives the results for a two direction configuration,

with the overall mean for the likelihood ratio solution calculated from

(8.13). The usual solution performs somewhat better than the

generalizations, both for componentwise MSE and overall MSE. The MSE,

means and SSQ's for the two generalizations are similar. The lower

MSE for the usual solution is due to the lower variance of the coefficients

and means. The bottom part of Table 8.1(a) and 8.1(b) gives the results

when the unweighted mean xT in (1.6) and the full maximum likelihood

iterative estimate in (8.7)-(8.9) are used. The full maximum likelihood

solution seems to perform badly. It may be that the component Bk,which

enters into the estimation of the Ek and hence of p, and which itself

A depends on u, is more sensitive to random fluctuations than when the

Ek are replaced by the sample Vk. The canonical roots also have lowest

variance for the usual solution and highest for the full maximum

likelihood solution. The approximately threefold change in MSE is again

evident when the sample size is reduced to 15 for each group. A seven

225

Table 8.1 Summary of simulations to compare the usual canonical

variate solution with the weighted between-groups and

likelihood ratio generalizations. Results are for 100

runs of a three variate, four population configuration,

with ul = (4,0,0)T , u2 = (0,2,0)

T , p3 = U4

= 0, and

Ek = E = I. Sample size is 50 for each group.

(a) For this configuration *1 = (0.978, -0.208, 0)T and

*2 = (0.208, 0.978, 0)T, with canonical roots 621 and 100

128.8. The MSE is calculated as E (cmi-*i)

where m=1

cmi represents the ith component of the sample vector,

and * denotes either U, LR or WBG. The overall mean for

the likelihood ratio solution is calculated using (8.13).

The bottom part gives the results when the unweighted

mean in (1.6), denoted by superscript U, and the full

likelihood ratio iterative solution in (8.7)-(8.10),

denoted by M, are used.

(b) The population means for the canonical variates are

*Tp • 3.913, -0.415, 0,0 and *ZUk : 0.830, 1.956, 0, 0. 1 IC 100 The MSE is calculated as E {(c* k)m - *TUk}2 .

m=1

226

8.1(a) comparison of vectors and roots

CVI MSE USL 0.0084 0.200 0.108 Sum = 0.316 LR 0.0274 0.561 0.284 0.872

WBG 0.0271 0.559 0.261 0.847

Mean USL 0.978 -0.202 -0.009 LR 0.975 -0.204 -0.010

WBG 0.975 -0.204 -0.013

SSQ USL 0.0085 0.198 0.101 (x100) LR 0.0263 0.565 0.276

WBG 0.0262 0.564 0.246

canonical roots: mean (SSQ) USL: 618.2 (1265); LR: 622.0 (2751); WBG: 621.1 (2676)

CVII MSE USL 0.011 0.0006 0.162 Sum = 0.174 LR 0.070 0.006 0.518 0.594 WBG 0.293 0.014 0.262 0.569

Mean USL 0.205 0.978 0.001 LR 0.207 0.975 0.000

WBG 0.205 0.976 -0.005

SSQ USL 0.011 0.0006 0.164 (x100) LR 0.071 0.0052 0.523

WBG 0.295 0.0137 0.262

canonical roots: USL: 128.1 (40.2); LR: 130.2 (124); WBG: 130.6

CVI MSE LRU 0.027 0.567 0.280 Sum = 0.874 LRM 0.024 2.642 0.444 3.110

Mean LRU 0.974 -0.205 -0.011 LRM 0.990 -0.076 -0.013

SSQ LR0 0.026 0.572 0.270 (x100) LRM 0.010 0.920 0.432

canonical roots: U: 625.4 (2894); M: 764.8 (7215)

(67)

8.1(b) comparison of canonical variate means

227

CVI MSE USL 0.182 0.820 0.033 0.038 Sum = 1.072 LR 0.487 2.233 0.033 0.038 2.791 WBG 0.510 2.227 0.034 0.038 2.808

Mean USL 3.912 -0.403 -0.0027 -0.0021 LR 3.898 -0.408 -0.0025 -0.0021 WBG 3.899 -0.407 -0.0026 -0.0022

SSQ USL 0.183 0.813 0.033 0.038 (x100)

LR 0.469 2.251 0.033 0.038 WBG 0.495 2.243 0.033 0.038

CVII MSE USL 0.144 0.047 0.039 0.033 Sum = 0.264 LR 1.103 0.070 0.039 0.034 1.246 WBG 4.752 0.090 0.039 0.033 4.914

Mean USL 0.824 1.953 0.0005 0.0018 LR 0.832 1.950 0.0003 0.0010 WBG 0.823 1.951 0.0005 0.0018

SSQ USL 0.142 0.047 0.040 0.033 (x100) LR 1.114 0.067 0.039 0.034

WBG 4.794 0.083 0.039 0.033

CVI MSE LRU 0.489 2.258 0.033 0.038 Sum = 2.819 LRM 0.439 10.56 0.033 0.037 11.07

Mean LRU 3.897 -0.410 -0.0025 -0.0021 LRM 3.961 -0.152 -0.0025 -0.0020

SSQ LRU 0.469 2.278 0.033 0.038 (x100) LM 0.206 3.668 0.033 0.038

228

variable run with changes in two variables viz 111

u2 = (0,0,1.8,1.8,0,0,0)T, p3 = 0, p4

= (0,0,0,0,0,0.4,0.4)T, and

sample sizes of 50 again shows an approximately threefold change in

MSE. For the first vector, the MSE of the coefficients is 1.251 for

the usual solution, 2.828 for the likelihood ratio solution using xW,

and 2.875 for the weighted between-groups solution. The differences

in MSE are again due to differences in the variances. For the group

means, the MSE are 4.23, 9.61 and 10.87 respectively. The means of the

canonical roots are 484.0, 495.3 and 491.6 respectively, for a population

value of 481.4; corresponding SSQs are 459, 1260 and 1140.

The results outlined above are paralleled for the other configurations

examined. The usual solution is preferable to the generalizations when

the covariance matrices do not differ, in that the generalizations show

greater variation for the canonical variate coefficients, roots and

group means.

8.5 Comparison of Solutions

An obvious question to ask is whether the descriptions given by

the generalized canonical variate solutions differ from the description

given by the usual canonical variate solution.

One approach is to consider the usual canonical vectors as

hypothetical vectors, and ask whether they provide discrimination which

is as good as that provided by a generalized solution. For the usual

situation, different formulations lead to the same statistic Av/Ap, as

outlined in Section 1.6. When the covariance matrices are not assumed

to be equal, the various formulations lead to different and often more

complicated solutions.

The maximized likelihood for the functional relationship formulation

U can be compared with the value obtained when C is substituted for 'Y in

Ek = ( )

qp,k

Eqq,k

229

(8.24) (and hence (8.17), (8.18) and (8.19)). The relative magnitudes

of the changes in the likelihood using either 1 or CU can be examined.

A more formal comparison follows by comparing 2 log (ratio of likelihoods)

with the x2 distribution with vp-p2 d.f.; equivalently, set an asymptotic

confidence region for Y' of size a and determine whether CU is within

the region (see Cox and Hinkley, 1974, p.343, for further discussion).

In Section 1.6, the adequacy of hypothetical vectors is examined

by considering the equality of conditional means. The approach is

outlined here when the covariance matrices are not assumed to be equal.

The derivation follows that considered in Section 1.6, with an observation

xkm ti N v (uk, Ek) . Partition

Epp,k Epq,k

with a similar partition for Sk in (1.1). The maximized likelihood

for no restriction on the conditional means is easily shown to be

g -n /2 (27)

-np/2

k=1 Ink1Spp,k

k?

k e-np/2(2n)

-nq/2 nk

1s

k e-nq/2

-nk/2

. k 1 qq.p'

(8.28)

Using a determinantal identity similar to (1.50), this reduces to (8.6).

For the hypothesis specifying equality of the conditional means,

the unconditional part of the likelihood is unchanged, giving

Ēpp.k = nk1Spp.k and hence the first part of (8.28). The conditional

part of the log likelihood may be written

230

g - kE1 nk log I Egq.p,k I

g nk E E f x -S x -u ) TE-1 (x -S x -u ) ,

k=1 m=1 q qp,k pkm q•P gq•P,k qkm qp,k pkm q•P

with

R = F. E-1

and E _ E - E E 1 E qp,k qp,k PP,k gq•P,k qq,k qp,k PP,k Pq,k

Differentiation w.r.t. uq.p gives

uq.p = ( E ng Egg•P,k

)-1 nk Egq•p,k(xgk - Sgp,kxpk) , (8.29)

k=1 k=1

while differentiation w.r.t. a gives qp,k

S x A qp,k

(S +n pp,k k

x pk pk

) = S +n qp,k k(x qk uq.p)xpk

(8.30)

Since the solution foruq•p involves Sgp,k, an explicit solution

is not possible.

Differentiation w.r.t. Eqq.p,k gg•p,k

nk

n E = E (x -R x - ) (x -S x -u ) T k gq•p,k m=l qkm qp,k pkm

u q.P qkm qp,k pkm q•P

(8.31)

this may be rewritten in terms of the sample means and covariance matrices.

Again the solution is iterative.

Because of the iterative solution in (8.29), (8.30) and (8.31), the

maximum likelihood estimator for Egq•p,k

in (8.31) does not reduce to

the ana.Logue of the equal covariance matrix case, namely T Hence gq•P,k.

the equivalent determinantal identity to (1.50) cannot be used to

simplify the maximized likelihood. Note that the generalized result

231

requires explicit definition of the conditional variates xq as well

as the covariates x. The maximized likelihood is

-n /2

~ /

2 -nv/2 (2~)-nv/2

~nk15Pp.k l

k I gq•P,k1

e k=1

(8.32)

with 2 gv(v+1) + gp + q estimated parameters.

The intuitively acceptable notions of dimension and collinearity

given in Section 1.6 do not obtain here. Because of time limitations,

I have not examined this equality of conditional means approach.

8.6 Practical Application

Campbell and Payment (1978) have applied the shrunken estimation

techniques described in Chapter Six to data on the foraminifer

Afrobolivina afra from 46 borehole samples taken at approximately equal

depth intervals. Five of the nine variables measured contain much of

the discrimination. Group sizes are: 24, 12, 30, 28, 62, 31, 16, 16,

16, 15, 16, 16, 16, 12, 15, 12, 11, 26, 13, 26, 12, 39, 13, 18, 30, 17,

21, 42, 31, 28, 34, 13, 17, 14, 14, 22, 12, 14, 16, 13, 14, 15, 29, 31,

15 and 60. The total sample size is 997. The canonical roots for the

usual analysis based on five variables are 2.29, 0.64, 0.25, 0.08 and

0.05. The first canonical variate reflects depth changes down the

borehole, as shown by the solid line in Figure 8.1. The group variances

for the first canonical variate range from 0.2 to 3.7; the vector is

scaled so that the average variance is unity. The techniques discussed

in Chapter Five indicate highly significant differences in both variance

and correlation structure.

Figure 8.1 shows a plot of the group means for the first canonical

variate against depth for the usual solution and two of the generalizations.

There is very good agreement between the three profiles.

232

Table 8.2 gives the maximized likelihoods for the unrestricted

and equal means hypotheses, corresponding to (8.6) and (8.10), and

for the functional relationship generalization, with p = 1 and p = 2,

resulting from (8.24). The difference in the maximized likelihoods

for p = 0 and p = v is 777. Of this, the first vector (p = 1)

explains some 80%. The values of the log likelihood when the usual

canonical vectors CU and likelihood ratio generalization vectors C LR

are substituted for '1' in (8.24) are also given. Because of the

iterative solution in (8.18), (8.19) and (8.24), even this calculation

may be relatively time consuming on the computer. For example, to

calculate the log likelihood with p = 1 and cU replacing *1 takes 15

iterations for the value to fall below 3199 and 35 iterations to fall

A below 3195. The value after the first iteration, with u calculated

using (8.13), is 3326.

The similarity of the various solutions is evident in the magnitudes

of the log likelihoods. The value of -3098 for Y' for p = 2 may be

slightly large, due to convergence difficulties; the values obtained

range from -3098 to -3125 using different initial estimates and

computing routines.

For this data set, the differences in covariance structure have

little effect on the ordination of the groups along the first canonical

variate. The first canonical root for the likelihood ratio-non-centrality

matrix generalization, given by (cLR)TMcr, is 5.18, compared with the

value of 2.29 for the usual solution. This difference reflects the

effect of the few groups with large variances on the pooled covariance

matrix for the usual solution. When the group variances for the canonical

variate(s) are similar, the two values will also be similar.

There are some practical problems to clear up before the approaches

described in this Chapter will be suitable for general use. Extensive

233

Table 8.2 Maximized log likelihoods, excluding the factor

- 2 nv(l + log 2n), for the functional relationship generalization

in Section 8.2.3 and for the unrestricted (p = v) and usual null

(p = 0) hypotheses in Section 8.2.2. The log likelihoods when the

usual and likelihood ratio canonical vectors replace those for the

generalized functional relationship solution are also given.

uk = u0 (p = 0 in (8.16)) -3786

uk 0 + Ekiplck (p = 1) -3165

Pk = u0 + Ek Y k (p = 2) -3098

Pk =

k (p = v) -3009

uk =uo + EkciCk (p = 1) -3194

uk = u0

+ EkCU~k (p = 2) -3105

uk u0 + Ekc1LR~k (p = 1) -3174

uk =u0 + EkCLR k (p = 2) -3098

0 234

9 8 0 1 2 3 4 5 6 7

Canonical Variafe I

• ---•-• functional relnship

~--- usuu I solution

• • likelihood ratio 5

•.

10

•

15

20

A 25

30

35

40

45

•

•

•

•:_•

••

Figure 8.1 Plot of first canonical variate means for each borehole sample versus depth for the usual solution and two of the generalizations.

235

application of the functional relationship generalization to a number

of data sets, combined with detailed analysis of specific configurations,

is needed to develop general guidelines for the use of the approach.

Evaluation of proposals for comparing the solutions is also needed.

Time limitations and computing difficulties have precluded these aspects

of the study. Another problem requiring solution is the extension of

the use of shrunken estimators to the various generalizations. Here,

small cTBkc and small c Vkc for all groups would seem to be the natural

analogue. A further aspect needing clarification is the role of the

roots and particularly vectors from the likelihood ratio-non-centrality

matrix generalization. For one data set examined, the first two roots

were similar in magnitude, and the second vector duplicated the first.

The third root and vector paralleled the information given by the second

root and vector for the usual solution.

Despite the unresolved problems, the generalized approaches, and

particularly the functional relationship generalization, appear to be

useful for evaluating the effect of unequal covariance matrices on the

description provided by the canonical variates. This descriptive aim

should be clearly distinguished from that of allocation. Whereas

differences in orientation and size of the associated concentration

ellipsoids may have little effect on overall conclusions about group

differences and similarities, they may have a marked effect on the correct

allocation of individuals to the groups. This failure to distinguish

between allocation and description is a potential source of confusion

in the canonical variate and particularly discriminant analysis

literature. Geisser (1977) gives an excellent discussion of the two

(often quite distinct) aims (see his pp. 302-309).

236

REFERENCES

AHMED, S.W. and LACHENBRUCH, P.A. (1977). Discriminant analysis when

scale contamination is present in the initial sample. In

Classification and Clustering (J. Van Ryzin, ed), pp. 331-353.

New York: Academic Press.

ALLDREDGE, J.R. and GILB, N.S. (1976). Ridge regression: an annotated

bibliography. int. Stat. Rev., 44, 355-360.

ANDERSON, T.W. (1951). Estimating linear restrictions on regression

coefficients for multivariate normal distributions. Ann. Math.

Statist., 22, 327-351.

ANDERSON, T.W. (1963). Asymptotic theory for principal component

analysis. Ann. Math. Statist., 34, 122-148.

ASHTON, E.H., HEALY, M.J.R. and LIPTON, S. (1957). The descriptive

use of discriminant functions in physical anthropology.

Proc. Roy. Soc. Lond. Ser. B, 146, 552-572.

ATKINSON, A.C. and PEARCE, M.C. (1976). The computer generation of

beta, gamma and normal random variables (with Discussion).

J.R. Statist. Soc. A, 139, 431-460..

BARNETT, V. and LEWIS, T. (1978). Outliers in Statistical Data.

New York: Wiley.

BARR, D.R. and SLEZAK, N.L. (1972). A comparison of multivariate

normal generators. Comm. ACM, 15, 1048-1049.

BARTLETT, M.S. (1938). Further aspects of the theory of multiple

regression. Proc. Camb. Phil. Soc., 34, 33-40.

BARTLETT, M.S. (1951). The goodness of fit of a single hypothetical

discriminant function in the case of several groups. Ann. Engen.,

16, 199-214.

237

BARTLETT, M.S. and KENDALL, D.G.°(1946). The statistical analysis of

variance heterogeneity and the logarithmic transformation.

J. R. Statist. Soc. B, 8, 128-138.

BIBBY, J. and TOUTENBURG, H. (1977). Prediction and Improved Estimation

in Linear Models. New York: Wiley.

CAMPBELL, C.A. (1978). The frilled dogwinkle: ecological genetics of

a morphologically variable snail, Thais lamellosa. Ph.D. Thesis,

University of California, Davis.

CAMPBELL, N.A. (1976). A multivariate approach to variation in micro-

filariae: Examination of the species Wuchereria lewisi and demes

of the species W. bancrofti. Aust. J. Zool., 24, 105-114.

CAMPBELL, N.A. (1978). Multivariate analysis in biological anthropology:

some further considerations. J. Bum. Evol., 7, 197-203.

CAMPBELL, N.A. (1979). Some practical aspects of canonical variate

analysis. BIAS, 6 (to appear)

CAMPBELL, N.A. and ATCHLEY, W.R. (1979). The geometry of multivariate

analysis and its relation to the analysis of morphometric shape.

In preparation for Syst. Zool.

CAMPBELL, N.A. and DEARN, J.M. (1979). Altitudinal variation in, and

morphological divergence between, three related species of grasshopper.

Aust. J. Zool., 27 (to appear).

CAMPBELL, N.A. and MAHON, R.J. (1974). A multivariate study of variation

in two species of rock crab of the genus Leptograpsus. Aust. J.

Zool., 22, 417-425.

CAMPBELL, N.A. and REYMENT, R.A. (1978). Discriminant analysis of a

Cretaceous foraminifer using shrunken estimators. Math. Geol.,

10, 347-359.

CHAKRAVARTI, S. (1966). A note on multivariate analysis of variance

test when dispersion matrices are different and unknown.

•

238

Calcutta Statist. Assoc. Bull., 15, 75-86.

CHAMBERS, J.M. (1977). Computational Methods for Data Analysis.

New York: Wiley.

CONSTANTINE, A.G. and GOWER, J.C. (1978). Graphical representation of

asymmetric matrices. Appl. Statist., 27, 297-304.

COX, D.R. (1968). Notes on some aspects of regression analysis.

a. R. Statist. Soc. A, 131, 265-279. COX, D.R. and HINKLEY, D.V. (1974). Theoretical Statistics. London:

Chapman and Hall.

DEVLIN, Susan J., GNANADESIKAN, R. and KETTENRING, J.R. (1975). Robust

estimation and outlier detection with correlation coefficients.

Biometrika, 62, 531-545.

ELSTON, R.C. (1975). On the correlation between correlations.


FISHER, R.A. (1921). On the probable error of a coefficient of

correlation deduced from a small sample. Metron, 1, 1-32.

FISHER, R.A. (1936). The use of multiple measurements in taxonomic

problems. Ann. Eugen., 7, 179-188.

FISHER, R.A. (1938). The statistical utilization of multiple

measurements. Ann. Eugen., 8, 376-386.

GEISSER, S. (1977). Discrimination, allocatory and separatory, linear

aspects. In CIassification and Clustering (J. Van Ryzin, ed.),

pp. 301-330. New York: Academic Press.

GNANADESIKAN, R. (1977). Methods for Statistical Data Analysis of

Multivariate Observations. New York: Wiley.

GNANADESIKAN, R. and KETTENRING, J.R. (1972). Robust estimates,

residuals, and outlier detection with multiresponse data.

Biometrics, 28, 81-124.

239

GOLDSTEIN, M. and SMITH, A.F.M. (1974). Ridge type estimators

for regression analysis. J.R. Statist. Soc. B, 36, 284-291.

GOWER, J.C. (1966). A Q-technique for the calculation of canonical

variates. Biometrika, 53, 588-590.

GOWER, J.C. (1971). Statistical methods of comparing different

multivariate analyses of the same data. In Mathematics in the

Archaeological and Historical Sciences (F.R. Hodson, D.G. Kendall

and P. Tautu, eds), pp. 138-149: Edinburgh. University Press.

GOWER, J.C. (1975). Generalized procrustes analysis. Psychometrika,

40, 33-51.

HAMPEL, F.R. (1973). Robust estimation: a condensed partial survey.

Z. Wahr. verw. Geb., 27, 87-104.

HAMPEL, F.R. (1974). The influence curve and its role in robust

estimation. J. Amer. Statist. Assoc., 69, 383-393.

HAMPEL, F.R. (1977). Modern trends in the theory of robustness.

Res. Rep. no. 13, Fachgruppe fur Stat., Eidgenossische Technische

Hochschule, Zurich.

HEALY, M.J.R. (1968). Multivariate normal plotting. Appl. Statist.,

17, 157-161.

HILLS, M. (1969). On looking at large correlation matrices. Biometrika,

56, 249-253.

HINKLEY, D.V. (1978). Improving the jackknife with special reference

to correlation estimation. Biometrika, 65, 13-21.

HOGG, R.V. (1977). An introduction to robust procedures. Comm. Statist.-

Theor. Meth., A6, 789-794.

HOPPER, S.D. and CAMPBELL, N.A. (1977). A multivariate morphometric

study of taxonomic relationships in kangaroo paws (Anigozanthos

Labill. and Macropidia Drumm. ex Hary.: Haemodoraceae). Aust. J.

Bot., 25, 523-544.

240

HOTELLINC, H. (1936). Relation between two sets of variates.


HUBER, P.J. (1964). Robust estimation for a location parameter.

Ann. Math. Statist., 35, 73-101.

HUBER, P.J. (1972). Robust statistics: A review. Ann. Math. Statist.,

43, 1041-1067.

HUBER, P.J. (1977a). Robust covariances. In Statistical Decision Theory

and Related Topics IT (Shanti S. Gupta and David S. Moore, ads)

pp. 165-191. New York: Academic Press.

HUBER, P.J. (1977b). Robust Statistical Procedures. Philadelphia:

SIAM.

JAMES, G.S. (1954). Tests of linear hypotheses in univariate and

multivariate analysis when the ratios of the population variances

are unknown. Biometrika, 41, 19-43.

JOHNSON, N.L. and KOTZ, S. (1970). Continuous Univariate Distributions - 2.

Boston, Mass.: Houghton Mifflin.

KSHIRSAGAR, A.M. (1972). Multivariate Analysis. New York: Marcel

Dekker.

LAYARD, M.W.J. (1972). Large sample tests for the equality of two

covariance matrices. Ann. Math. Statist., 43, 123-141.

LAYARD, M.W.J. (1974). A Monte Carlo comparison of tests for equality

of covariance matrices. Biometrika, 61, 461-465.

MANDEL, J. (1961). Non-additivity in two-way analysis of variance.

J. Amer. Statist. Ass., 56, 878-888.

MANDEL, J. (1971). A new analysis of variance model for non-additive

data. Technometrics, 13, 1-18.

MARONNA, R.A. (1976). Robust M-estimators of multivariate location and

scatter. Ann. Statist., 1, 51-67.

MARSAGLIA, G. and BRAY, T.A. (1964). A convenient method for generating

normal variables. SIAM Rev., 6, 260-264.

241

MORRISON, D.F. (1976). Multivariate Statistical Methods, second

edition. New York: McGraw-Hill.

NELDER, J.A. and MEAD, R. (1965). A simplex method for function

minimization. Computer J., 7, 308-313.

NEWMAN, T.G. and ODELL, P.L. (1971). The Generation of Random Variates.

London: Griffin.

OLSON, C.L. (1974). Comparative robustness of six tests in multivariate

analysis of variance. J. Amer. Statist. Assoc., 69, 894-908.

PHILLIPS,' B.F., CAMPBELL, N.A. and WILSON, B.R. (1973). A multivariate

study of geographic variation in the whelk Dicathais. J. Exp.

Mar. Biol. Ecol., 11, 29-63.

RADCLIFFE, J. (1966). Factorizations of the residual likelihood

criterion in discriminant analysis. Proc. Carob. Phil. Soc., 62,

743-752.

RADCLIFFE, J. (1967). A note on an approximate factorization in

discriminant analysis. Biometrika, 54, 665-668.

RANDLES, R.H., BROFFITT, J.D., RAMBERG, J.S. and HOGG, R.V. (1978).

Generalized linear and quadratic discriminant functions using

robust estimates. J. Amer. Statist. Ass., 73, 564-568.

RAO, C.R. (1948). The utilization of multiple measurements in problems

of biological classification. J. R. Statist. Soc. B, 10, 159-193.

RAO, C.R. (1952). Advanced Statistical Methods in Biōmetric Research.

New York: Wiley

RAO, C.R. (1970). Inference on discriminant function coefficients.

In Essays on Probability and Statistics (R.C. Bose, et al., eds),

pp. 587-602. Chapel Hill: University of North Carolina and

Statistical Publishing Society.

RAO, C.R. (1973). Linear statistical Inference and its Applications,

second edition. New York: Wiley.

242

REMPE, U. and WEBER, E.E. (1972). An illustration of the principal

ideas of MANOVA. Biometrics, 28, 235-238.

ROY, S.N., GNANADESIKAN, R. and SRIVASTAVA, J.N. (1971). Analysis and

Design of Certain Quantitative Multiresponse Experiments. Oxford:

Pergamon.

SCHATZOFF, M. (1966). Sensitivity comparisons among tests of the general

linear hypothesis. J. Amer. Statist. Assoc., 61, 415-435.

SCHEFFE, H. (1959). The Analysis of Variance. New York: Wiley

SIBSON, R. (1978). Studies in the robustness of multidimensional

scaling.. J.R. Statist. Soc. B, 40, 234-238.

SPRENT, P. (1969). Models in Regression and Related Topics. London: Methuen.

TUKEY, J.W. (1949). One degree of freedom for non-additivity.

Biometrics, 5, 232-242.

TUKEY, P.A. (1976). Statistical models with covariance constraints.

Ph.D. Thesis, University of London.

WELCH, B.L. (1939). Note on discriminant functions. Biometrika,

31, 218-220.

WILK, M.B., GNANADESIKAN, R. and HUYETT, Miss M.J. (1962). Estimation

of the parameters of the gamma distribution using order statistics.


WILLIAMS, E.J. (1952). The interpretation of interactions in factorial

experiments. Biometrika, 39, 65-81.

WILLIAMS, E.J. (1961). Tests for discriminant functions. J. Austral.

Math. Soc., 2, 243-252.

WILLIAMS, E.J. (1967). The analysis of association among many variates

(with Discussion). J.R. Statist. Soc. B, 29, 199-242.

WILSON, E.B. and HILFERTY, M.M. (1931). The distribution of chi-square.

Proc. Nat. Acad. Sci., Washington, 17, 684-688.

YATES, F. and COCHRAN, W.G. (1938). The analysis of groups of experiments.

J. Agric: Sci., 28, 556-580.

canonical variate analysis: some practical aspects by ... · 5.5 c-v plot for arctanh correlations...

Documents