multivariate tests of means in independent groups designs...

Multivariate Tests of Means

Running head MULTIVARIATE TESTS IN INDEPENDENT GROUPS DESIGNS

Multivariate Tests of Means in Independent Groups Designs

Effects of Covariance Heterogeneity and Non-Normality

Lisa M Lix and H J Keselman

University of Manitoba

Author Contact Information

Lisa Lix

Department of Community Health Sciences Faculty of Medicine

408-727 McDermot Avenue

University of Manitoba

Winnipeg MB R3E 3P5

e-mail lisa_lixcpeumanitobaca

Biographical Information

Lisa Lix is Assistant Professor Department of Community Health Sciences University of

Manitoba Winnipeg Canada Her research interests are in the areas of longitudinal data analysis

multivariate methods and robust estimation and testing Her current publications are found in

British Journal of Mathematical and Statistical Psychology Multivariate Behavioral Research

Psychophysiology and Journal of Community Health and Epidemiology

Harvey Keselman is Professor of Psychology University of Manitoba Winnipeg Canada His

areas of interest include the analysis of repeated measurements multiple comparison procedures

and robust estimation and testing His publications have appeared in journals such as British

Journal of Mathematical and Statistical Psychology Educational and Psychological

Measurement Journal of Educational and Behavioral Statistics Psychological Methods

Psychometrika Psychophysiology and Statistics in Medicine

Abstract

Health evaluation research often employs multivariate designs in which data on several outcome

variables are obtained for independent groups of subjects This article examines statistical

procedures for testing hypotheses of multivariate mean equality in two-group designs The

conventional test for multivariate means Hotellingrsquos T2 rests on certain assumptions about the

distribution of the data and the population variances and covariances When these assumptions

are violated which is often the case in applied health research T2 will result in invalid

conclusions about the null hypothesis We describe procedures that are robust or insensitive to

assumption violations A numeric example illustrates the statistical concepts that are presented

and a computer program to implement these robust solutions is introduced

Health evaluation research frequently involves the collection of multiple outcome

measurements on two or more groups of subjects For example Harasym et al (1996) obtained

scores on multiple personal traits using the Myers-Briggs inventory for nursing students with

three different styles of learning Knapp and Miller (1983) discuss the measurement of multiple

dimensions of healthcare quality as part of a system-wide evaluation program

In the simplest case where there are only two groups such as a case and a control group

Hotellingrsquos (1931) T2 is the traditional method for testing that the means for the set of outcome

variables are equivalent across groups (ie the hypothesis of multivariate mean equality) The T2

statistic is the analogue of Studentrsquos two-group t statistic for testing equality of group means for

a single outcome variable While Hotellingrsquos T2 is the most common choice in the multivariate

context most applied researchers are unaware that it rests on a set of derivational assumptions

that are not likely to be satisfied in evaluation research Specifically this test assumes that the

outcome measurements follow a multivariate normal distribution and exhibit a common

covariance structure Multivariate normality is the assumption that all outcome variables and all

combinations of the variables are normally distributed The assumption of a common covariance

matrix is that the populations will exhibit the same variances and covariances for all of the

outcome variables For example with two groups and three dependent variables the assumption

of covariance homogeneity is that the three variances and three covariances are equivalent for the

two populations from which the data were sampled

The T2 test is not robust to assumption violations meaning that it is sensitive to changes

in those factors which are extraneous to the hypothesis of interest (Ito 1980) In fact this test

may become seriously biased when assumptions are not satisfied resulting in spurious decisions

about the null hypothesis Moreover the assumptions of normality and covariance homogeneity

are not likely to be satisfied in practice Outliers or extreme observations are often a significant

concern in evaluation research (see eg Sharmer 2001) Furthermore subjects who are exposed

to a particular healthcare treatment or intervention may exhibit greater variability on the outcome

measures than subjects who are not exposed to it (see eg Grissom 2000 Hill amp Dixon 1982

Hoover 2002) Consequently researchers who rely on Hotellingrsquos (1931) T2 procedure to test

hypotheses about equality of multivariate group means may unwittingly be filling the literature

with non-replicable results or at other times may fail to detect intervention effects when they are

present This should be of significant concern to health evaluation researchers because the results

of statistical tests are routinely used to make decisions about the effectiveness of clinical

interventions and to plan healthcare program content and delivery In this era of evidence-based

decision making it is important to ensure that the statistical procedures that are applied to

evaluation data will produce valid results

Within the last 50 years a number of statistical procedures that are robust to violations of

the assumption of covariance homogeneity have been proposed in the literature However these

procedures are largely unknown to applied researchers and therefore are not likely to be adopted

in practice Moreover all of these procedures are sensitive to departures from multivariate

normality Recent research shows that it is possible to obtain a test that is robust to the combined

effects of covariance heterogeneity and non-normality This involves substituting robust

measures of location and scale for the usual mean and covariances in tests that are insensitive to

covariance heterogeneity These robust measures are less affected by the presence of outlying

scores or skewed distributions than traditional measures

The purpose of this paper is to introduce health evaluation researchers to both the

concepts and the applications of robust test procedures for multivariate data This paper begins

with an introduction to the statistical notation that will be helpful in understanding the concepts

This is followed by a discussion of procedures that can be used to test the hypothesis of

multivariate mean equality when statistical assumptions are and are not satisfied We will then

show how to obtain a test that is robust to the combined effects of covariance heterogeneity and

multivariate non-normality Throughout this presentation a numeric example will help to

illustrate the concepts and computations Finally we demonstrate a computer program that can

be used to implement the statistical tests described in this paper Robust procedures are largely

inaccessible to applied researchers because they have not yet been incorporated into extant

statistical software packages The program that we introduce will be beneficial to evaluation

researchers who want to test hypotheses of mean equality in multivariate designs but are

concerned about whether their data may violatr the assumptions which underlie conventional

methods of analysis

Statistical Notation

Consider the case of a single outcome variable Let Yij represent the measurement on that

outcome variable for the ith subject in the jth group (i = 1 hellip nj j = 1 2) Under a normal theory

model it is assumed that Yij follows a normal distribution with mean μj and variance 2σ j When

there are only two groups of subjects the null hypothesis is

μμ 210H (1)

In other words one wishes to test whether the population means for the outcome variable are

equivalent

To generalize to the multivariate context assume that we have measurements for each

subject on p outcome variables In other words instead of just a single value we now have a set

of p values for each subject Using matrix notation Yij represents the vector (ie row) of p

outcome measurements for the ith subject in the jth group that is

Yij = [Yij1 hellip Yijp] For example Yij may represent the values for a series of measures of physical

function or attitudes towards a healthcare intervention It is assumed that Yij follows a

multivariate normal distribution with mean j and variance-covariance matrix j (ie Yij ~ N[ j

j]) The vector μj contains the mean scores on each outcome variable that is

μj = [μj1 μj2 hellip μjp] The variance-covariance matrix j is a p x p matrix with the variances for

each outcome variable on the diagonal and the covariances for all pairs of outcome variables on

the off diagonal

σσσ

The null hypothesis

H 210 μμ (2)

is used to test whether the means for the set of p outcome variables are equal across the two

groups

As Knapp and Miller (1983) observe adopting a test of multivariate (ie joint)

equivalence is preferable to adopting multiple tests of univariate equivalence particularly when

the outcome variables are correlated Type I errors which are erroneous conclusions about true

null hypotheses may occur when multiple univariate tests are performed If the outcome

variables are independent the probability that at least one erroneous decision will be made on the

set of p outcome variables is 1 ndash (1 - α)p which is approximately equal to αp when α is small

where α is the nominal level of significance For example with α = 05 and p = 3 the probability

of making at least one erroneous decision is asymp 14

To illustrate the multivariate concepts that have been presented to this point we will use

the example data set of Table 1 These data are for two groups of subjects and two outcome

variables Let nj represent the sample size for the jth group The example data are for an

unbalanced design (ie unequal group sizes) where n1 = 6 and n2 = 8 The vector of scores for

the first subject of group 1 is Y11 = [1 51] the vector for the second subject of group 1 is

Y21 = [28 48] and so on

Let jY and Sj represent the sample mean vector and sample covariance matrix for the jth

group Table 2 contains these summary statistics for the example data set The mean scores for

the first outcome variable are 130 and 194 for groups 1 and 2 respectively For the second

outcome variable the corresponding means are 497 and 478 The variances for groups 1 and 2

on the first outcome measure are 748 and 23 respectively The larger variance for the first

group is primarily due to the presence of two extreme values of 1 and 28 for the first and second

subjects respectively The corresponding variances on the second outcome variable are 39 and

36 For group 1 the covariance for the two outcome variables is ndash70 The population

correlation for two variables q and q΄ (ie ρqq΄ q q΄ = 1 hellip p) can be obtained from the

covariance and the variances

where σqq΄ is the covariance and 2

qσ is the variance for the qth outcome variable The sample

correlation coefficient rqq΄ is used to estimate the population correlation coefficient In the

example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two

variables for group 1

Tests for Mean Equality when Assumptions are Satisfied

Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of

equality of population means in a univariate design (ie equation 1) The test statistic which

assumes equality of population variances is

YYt (3)

where jY represents the mean for the jth group and s2 the variance that is pooled for the two

groups is

snsns (4)

where 2

js is the variance for the jth group

The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing

means with mean vectors and the pooled variance with the pooled covariance matrix

211 YYSYYnn

where T is the transpose operator which is used to convert the row vector 21 YY to a

column vector -1

denotes the inverse of a matrix and S is the pooled sample covariance matrix

nn SSS (6)

This test statistic is easily obtained from standard software packages such as SAS (SAS Institute

1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test

statistic can also be converted to an F statistic

N - p - FT (7)

where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its

critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1

degrees of freedom (df)

When the data are sampled from populations that follow a normal distribution but have

unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate

of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is

balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp

Clay 1963) However when the design is unbalanced either a liberal or a conservative test will

result depending on the nature of the relationship between the covariance matrices and group

sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results

are problematic researchers will be filling the literature with false positives (ie saying there are

treatment effects when none are present) A conservative result on the other hand is one in

which the Type I error rate will be less than Conservative results are also a cause for concern

because they may result in test procedures that have low statistical power to detect true

differences in population means (ie real effects will be undetected)

If the group with the largest sample size also exhibits the smallest element values of j

which is known as a negative pairing condition the error will be liberal For example Hopkins

and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the

smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was

11 When the ratio of the group standard deviations was increased to 32 the Type I error rate

was 21 more than four times the nominal level of significance For positive pairings of group

sizes and covariance matrices such that the group with the largest sample size also exhibits the

largest element values of j the T2 procedure tends to produce a conservative test In fact the

error rate may be substantially below the nominal level of significance For example Hopkins

and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings

for the two standard deviation values noted previously These liberal and conservative results for

normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian

et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)

for both moderate and large degrees of covariance heterogeneity

When the assumption of multivariate normality is violated the performance of

Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal

distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when

the data are non-normal suggested that tests of the null hypothesis for two groups were relatively

insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true

when the data are only moderately non-normal However Everitt (1979) showed that this test

procedure can become quite conservative when the distribution is skewed or when outliers are

present in the tails of the distribution particularly when the design is unbalanced (see also

Zwick 1986)

Tests for Mean Equality when Assumptions are not Satisfied

Both parametric and nonparametric alternatives to the T2 test have been proposed in the

literature Applied researchers often regard nonparametric procedures as appealing alternatives

because they rely on rank scores which are typically perceived as being easy to conceptualize

and interpret However these procedures test hypotheses about equality of distributions rather

than equality of means They are therefore sensitive to covariance heterogeneity because

distributions with unequal variances will necessarily result in rejection of the null hypothesis

Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the

Type I error rate when the data were sampled from non-normal distributions and covariances

were equal Not surprisingly when covariances were unequal these procedures produced biased

results particularly when group sizes were unequal

Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)

T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first

and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe

(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been

generalized to multivariate designs containing more than two groups of subject

The J1 J2 J NV and Y procedures are all obtained from the same test statistic

212 YYSS

The J NV and Y procedures approximate the distribution of T2 differently because the df for

these four procedures are computed using different formulas The J1 and J2 procedures each use

a different critical value to assess statistical significance However they both rely on large-

sample theory regarding the distribution of the test statistic in equation 8 What this means is that

when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)

distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2

critical value If

the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The

critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a

result the J1 procedure generally produces larger Type I error rates than the J2 procedure and

therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer

better Type I error control it is computationally complex The critical value for J2 is described in

the Appendix along with the F-statistic conversions and df computations for the J NV and Y

procedures

The K procedure is based on an F statistic It is more complex than preceding test

procedures because eigenvalues and eigenvectors of the group covariance matrices must be

computed1 For completeness the formula used to compute the K test statistic and its df are

found in the Appendix

The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly

from the one presented in equation 8

21BF YYSSYYN

nT (9)

As one can see the test statistic in equation 9 weights the group covariance matrices in a

different way than the test statistic in equation 8 Again for completeness the numeric solutions

for the BF F statistic and df are found in the Appendix

Among the BF J2 J K NV and Y tests there appears to be no one best choice in all

data-analytic situations when the data are normally distributed although a comprehensive

comparison of all of these procedures has not yet been conducted Factors such as the degree of

covariance heterogeneity total sample size the degree of imbalance of the group sizes and the

relationship between the group sizes and covariance matrices will determine which procedure

will afford the best Type I error control and maximum statistical power to detect group

differences Christensen and Rencher (1997) noted in their extensive comparison among the J2

J K NV and Y procedures that the J and Y procedures could occasionally result in inflated

Type I error rates for negative pairings of group sizes and covariance matrices These liberal

tendencies were exacerbated as the number of outcome variables increased The authors

recommended the K procedure overall observing that it offered the greatest statistical power

among those procedures that never produced inflated Type I error rates However the authors

report a number of situations of covariance heterogeneity in which the K procedure could

become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =

10 n1 = 30 and n2 = 20

For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that

the J1 J2 J and Y procedures could not control Type I error rates when the underlying

population distributions were highly skewed For a lognormal distribution which has skewness

of 6182 they observed many instances in which empirical Type I error rates of all of these

procedures were more than four times the nominal level of significance Wilcox (1995) found

that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and

n2 = 18) and the data were generated from non-normal distributions the K procedure became

conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J

procedure provided acceptable control of Type I errors when the data were only moderately non-

normal but for the maximum skewness considered it also became conservative Fouladi and

Yockey (2002) found that the degree of departure from a multivariate normal distribution was a

less important predictor of Type I error performance than sample size Across the range of

conditions which they examined the Y test produced the greatest average Type I error rates and

the NV procedure the smallest Error rates were only slightly influenced by the degree of

skewness or kurtosis of the data however these authors looked at only very modest departures

from a normal distribution the maximum degree of skewness considered was 75

Non-normality For univariate designs a test procedure that is robust to the biasing

effects of non-normality may be obtained by adopting estimators of location and scale that are

insensitive to the presence of extreme scores andor a skewed distribution (Keselman

Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that

have been proposed in the literature among these the trimmed mean has received a great deal of

attention because of its good theoretical properties ease of computation and ease of

interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the

most extreme scores in the distribution Hence one removes the effects of the most extreme

scores which have the tendency to ldquoshiftrdquo the mean in their direction

One should recognize at the outset that while robust estimators are insensitive to

departures from a normal distribution they test a different null hypothesis than least-squares

estimators The null hypothesis is about equality of trimmed population means In other words

one is testing a hypothesis that focuses on the bulk of the population rather than the entire

population Thus if one subscribes to the position that inferences pertaining to robust parameters

are more valid than inferences pertaining to the usual least-squares parameters then procedures

based on robust estimators should be adopted

To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent

the ordered observations for the jth group on a single outcome variable In other words one

begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]

where represents the proportion of observations that are to be trimmed in each tail of the

distribution and [x] is the greatest integer x The effective sample size for the jth group is

defined as hj = nj ndash 2gj The sample trimmed mean

gi(i)j

1 (10)

is computed by censoring the gj smallest and the gj largest observations The most extreme scores

for each group of subjects are trimmed independently of the extreme scores for all other groups

A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent

trimming is generally recommended (Wilcox 1995a)

The Winsorized variance is the theoretically correct measure of scale that corresponds to

the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group

covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first

computed

Y (11)

)1()1(

jgnij )jg(n

)jg(nijjgij

jgijjgij

The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme

value and the gj largest values with the next most extreme value The Winsorized variance for

the jth group on a single outcome variable 2

wjs is

and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized

covariance for the outcome variables q and q (q q = 1 hellip p) is

qwjqij

iwjqijq

and the Winsorized covariance matrix for the jth group is

To illustrate we return to the data set of Table 1 For the first outcome variable for the

first group the ordered observations are

28131312111

and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of

the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of

trimmed means for the two groups

To Winsorize the data set for the first group on the first outcome variable the largest and

smallest values in the set of ordered observations are replaced by the next most extreme scores

producing the following set of ordered observations

131313121111

The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same

value as the trimmed mean these two estimators will not as a rule produce an equivalent result

Table 2 contains the Winsorized covariance matrices for the two groups

A test which is robust to the biasing effects of both multivariate non-normality and

covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test

procedures and substituting the trimmed means and the Winsorized covariance matrix for the

least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust

estimators the T1 statistic of equation 5 becomes

t2t1t1 YYSYYhh

T (14)

1w SSS

n (15)

Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized

covariances were substituted for the usual estimators when the data followed a multivariate non-

normal distribution The Type I error performance of the J procedure with robust estimators was

similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2

= 36) More importantly however there was a dramatic improvement in power when the test

procedures with robust estimators were compared to their least-squares counterparts this was

observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The

differences in power were as great as 60 percentage points which represents a substantial

difference in the ability to detect outcome effects

Computer Program to Obtain Numeric Solutions

Appendix B contains a module of programming code that will produce numeric results

using least-squares and robust estimators with the test procedures enumerated previously that is

the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS

Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to

run this program This program can be used with either the PC or UNIX versions of SAS it was

generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website

httphomeccumanitobaca~lixlm

In order to run the program the data set group sizes proportion of trimming and

nominal level of significance α must be input It is assumed that the data set is complete so that

there are no missing values for any of the subjects on the outcome variables The program

generates as output the summary statistics for each group (ie means and covariance matrices)

For each test procedure the relevant T andor F statistics are produced along with the numerator

(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results

can be produced for both least squares estimators and robust estimators with separate calls to the

program

To produce results for the example data of Table 1 with least-squares estimators the

following data input lines are required

Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18

NX=6 8

PTRIM=0

ALPHA=05

RUN T2MULT

The first line is used to specify the data set Y Notice that a comma separates the series of

measurements for each subject and parentheses enclose the data set The next line of code

specifies the group sizes Again parentheses enclose the element values No comma is required

to separate the two elements The next line of code specifies PTRIM the proportion of trimming

that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or

Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that

are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a

symmetric trimming approach is automatically assumed in the program trimming proportions

for the right and left tails are not specified The RUN T2MULT code invokes the program and

generates output Observe that each line of code ends with a semi-colon Also it is necessary that

these lines of code follow the FINISH statement that concludes the program module

Table 3 contains the output produced by the SASIML program for each test statistic for

the example data set For comparative purposes the program produces the results for Hotellingrsquos

(1931) T2 We do not recommend however that the results for this procedure be reported The

output for least-squares estimators is provided first A second invocation of the program with

PTRIM=20 is required to produce the results for robust estimators As noted previously the

program will output a T statistic andor an F statistic along with the df and p-value or critical

value This information is used to either reject or fail to reject the null hypothesis

As Table 3 reveals when least-squares estimators are adopted all of the test procedures

fail to reject the null hypothesis of equality of multivariate means One would conclude that there

is no difference between the two groups on the multivariate means However when robust

estimators are adopted all of the procedures result in rejection of the null hypothesis of equality

of multivariate trimmed means leading to the conclusion that the two groups do differ on the

multivariate means These results demonstrate the influence that a small number of extreme

observations can have on tests of mean equality in multivariate designs

Conclusions and Recommendations

Although Hunter and Schmidt (1995) argue against the use of tests of statistical

significance their observation that ldquomethods of data analysis used in research have a major

effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances

in data-analytic techniques for multivariate data are unknown to the majority of applied health

researchers Traditional procedures for testing multivariate hypotheses of mean equality make

specific assumptions concerning the data distribution and the group variances and covariances

Valid tests of hypotheses of healthcare intervention effects are obtained only when the

assumptions underlying tests of statistical significance are satisfied If these assumptions are not

satisfied erroneous conclusions regarding the nature or presence of intervention effects may be

In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and

described a number of procedures that are insensitive to the assumption of equality of population

covariance matrices for multivariate data Substituting robust estimators for the usual least-

squares estimators will result in test procedures that are insensitive to both covariance

heterogeneity and multivariate non-normality Robust estimators are measures of location and

scale less influenced by the presence of extreme scores in the tails of a distribution Robust

estimators based on the concepts of trimming and Winsorizing result in the most extreme scores

either being removed or replaced by less extreme scores To facilitate the adoption of the robust

test procedures by applied researchers we have presented a computer program that can be used

to obtain robust solutions for multivariate two-group data

The choice among the Brown-Forsythe (1974) James (1954) second order Johansen

(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust

estimators will depend on the characteristics of the data such as the number of dependent

variables the nature of the relationship between group sizes and covariance matrices and the

degree of inequality of population covariance matrices Current knowledge suggests that the Kim

(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in

liberal or conservative tests under many data-analytic conditions and provides good statistical

power to detect between-group differences on multiple outcome variables Further research is

needed however to provide more specific recommendations regarding the performance of these

six procedures when robust estimators are adopted

Finally we would like to note that the majority of the procedures that have been

described in this paper can be generalized to the case of more than two independent groups (see

eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt

robust test procedures for a variety of multivariate data-analytic situations

References

Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos

tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational

Statistics 16 125-139

Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test

the equality of several means Technometrics 16 385-389

Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power

levels for seven solutions to the multivariate Behrens-Fisher problem Communications in

Statistics ndash Simulation and Computation 26 1251-1273

Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant

analysis Educational and Psychological Measurement 56 382-402

de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions

of several solutions to the multivariate Behrens-Fisher problem South African Statistical

Journal 27 129-148

Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-

sample T2 tests Journal of the American Statistical Association 74 48-51

Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on

means under conditions of heterogeneous correlation structure and varied multivariate

distributions Communications in Statistics ndash Simulation and Computation 31 375-400

Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and

Clinical Psychology 68 155-165

Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption

of homogeneous covariance matrices Psychological Bulletin 56 1255-1263

Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between

Myers-Briggs psychological traits and use of course objectives in anatomy and physiology

Evaluation amp the Health Professions 19 243-252

Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data

Biometrics 38 377-396

Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the

American Statistical Association 62 124-136

Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching

subgroup effects Statistics in Medicine 30 1351-1364

Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and

homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the

Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2

360-378

Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah

(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York

Ito K amp Schull W J (1964) On the robustness of the T2

0 test in multivariate analysis of

variance when variance-covariance matrices are not equal Biometrika 51 71-82

James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the

ratios of population variances are unknown Biometrika 41 19-43

Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of

squares in a weighted linear regression Biometrika 67 85-92

Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses

revisited An update based on trimmed means Psychometrika 63 145-163

Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika

79 171-176

Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health

care Evaluation amp the Health Professions 6 465-482

Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under

heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-

Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-

Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145

Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher

problem Communications in Statistics ndash Simulation and Computation 15 3719-3735

SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC

SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC

Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research

knowledge Evaluation amp the Health Professions 18 408-427

Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-

reported behavior of college students Evaluation amp the Health Professions 24 336-357

Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three

procedures for analyzing multivariate repeated measures designs Multivariate Behavioral

Research 36 1-27

Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect

size Review of Educational Research 65 51-77

Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher

problem via trimmed means The Statistician 44 213-225

Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher

problem Biometrika 52 139-147

Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61

165-170

Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate

Behavioral Research 21 169-186

Appendix

Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test

Brown and Forsythe (1974)

The numeric formulas presented here are based on the work of Brown and Forsythe with

the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo

amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then

BF2BF T

pfF (A1)

where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and

wtrwtrn

trtrf (A2)

In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is

compared to the critical value F[νBF1 νBF2] where

SSSSGG

wtrwtrwtrwtrtrtr

trtr (A3)

and G2 = w1S1 + w2S2

James (1954) Second Order

The test statistic T2 of equation 8 is compared to the critical value 2

p (A + 2

p B) + q

where 2

p is the 1 ndash α percentile point of the χ2 distribution with p df

AAAA trn

A (A4)

Aj = Sjnj 21 AAA and

AAAAAAAAAAAA trtrn

trtrnpp

B (A5)

The constant q is based on a lengthy formula which has not been reproduced here it can be

found in equation 67 of James (1954)

Johansen (1980)

Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and

1-221- 1

C AAAA (A6)

The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C

Kim (1992)

The K procedure is based on the test statistic

YYVYY (A7)

where 21

1 2 AAAAAAAV rr

c (A8)

m (A9)

hl = (dl + 1)(dl12

+r)2 where dl is the l

th eigenvalue of 1

21AA r = | 1

21AA |1(2p)

and | | is the

determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1

nf (A10)

and 21

21 YYVAVYY jjb

Nel and van der Merwe (1986)

TF (A11)

where νN = f2 ndash p + 1 and

trtrf AAAA (A12)

The FNV statistic is compared to the critical value F[p νN]

Yao (1965)

The statistic FY is referred to the critical value F[p νK]

TF (A13)

where f1 is given by equation A10 and νK again equals f1 ndash p + 1

Footnotes

1The sum of the eigenvalues of a matrix is called the trace of a matrix

2The skewness for the normal distribution is zero

Table 1 Multivariate Example Data Set

Group Subject Yi1 Yi2

1 1 1 51

1 2 28 48

1 3 12 49

1 4 13 51

1 5 13 52

1 6 11 47

2 1 19 46

2 2 18 48

2 3 18 50

2 4 21 50

2 5 19 45

2 6 20 46

2 7 22 48

2 8 18 49

Table 2 Summary Statistics for Least-Squares and Robust Estimators

Least-Squares Estimators

Robust Estimators

040322S

078741S

7490131Y 7474192Y

8492121t

Y 847219t2Y

1061w2S

30011wS

Table 3 Hypothesis Test Results for Multivariate Example Data Set

Procedure Test Statistic df p-valueCritical value (CV) Decision re Null

Hypothesis

T1 = 61

FT = 28

ν1 = 2

ν2 = 11

p = 106 Fail to Reject

BF TBF = 91

FBF = 37

ν1 = 4

ν2 = 44

J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject

J T2 = 50

FJ = 23

ν1 = 2

ν2 = 69

K FK = 25 ν1 = 15

ν2 = 61

NV T2 = 50

FNV = 20

ν1 = 2

ν2 = 44

Y T2 = 50

FY = 21

ν1 = 2

ν2 = 61

Robust Estimators

T1 = 590

FT = 258

ν1 = 2

ν2 = 7

p = 001 Reject

BF TBF = 1312

FBF = 562

ν1 = 5

ν2 = 60

p = 001 Reject

J2 T2 = 652 ν1 = 2 CV = 133 Reject

J T2 = 652

FJ = 295

ν1 = 2

ν2 = 63

p = 001 Reject

K FK = 281 ν1 = 20

ν2 = 66

p = 001 Reject

NV T2 = 652

FNV = 279

ν1 = 2

ν2 = 60

p = 001 Reject

Y T2 = 652

FY = 283

ν1 = 2

ν2 = 66

p = 001 Reject

Note T2 = Hotellingrsquos (1931) T

2 BF = Brown amp Forsythe (1974) J2 = James (1954) second

order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao

(1965)