multivariate tests of means in independent groups designs...
Post on 14-Aug-2020
6 Views
Preview:
TRANSCRIPT
Multivariate Tests of Means
1
Running head MULTIVARIATE TESTS IN INDEPENDENT GROUPS DESIGNS
Multivariate Tests of Means in Independent Groups Designs
Effects of Covariance Heterogeneity and Non-Normality
Lisa M Lix and H J Keselman
University of Manitoba
Author Contact Information
Lisa Lix
Department of Community Health Sciences Faculty of Medicine
408-727 McDermot Avenue
University of Manitoba
Winnipeg MB R3E 3P5
e-mail lisa_lixcpeumanitobaca
Biographical Information
Lisa Lix is Assistant Professor Department of Community Health Sciences University of
Manitoba Winnipeg Canada Her research interests are in the areas of longitudinal data analysis
multivariate methods and robust estimation and testing Her current publications are found in
British Journal of Mathematical and Statistical Psychology Multivariate Behavioral Research
Psychophysiology and Journal of Community Health and Epidemiology
Harvey Keselman is Professor of Psychology University of Manitoba Winnipeg Canada His
areas of interest include the analysis of repeated measurements multiple comparison procedures
and robust estimation and testing His publications have appeared in journals such as British
Journal of Mathematical and Statistical Psychology Educational and Psychological
Measurement Journal of Educational and Behavioral Statistics Psychological Methods
Psychometrika Psychophysiology and Statistics in Medicine
Multivariate Tests of Means
2
Abstract
Health evaluation research often employs multivariate designs in which data on several outcome
variables are obtained for independent groups of subjects This article examines statistical
procedures for testing hypotheses of multivariate mean equality in two-group designs The
conventional test for multivariate means Hotellingrsquos T2 rests on certain assumptions about the
distribution of the data and the population variances and covariances When these assumptions
are violated which is often the case in applied health research T2 will result in invalid
conclusions about the null hypothesis We describe procedures that are robust or insensitive to
assumption violations A numeric example illustrates the statistical concepts that are presented
and a computer program to implement these robust solutions is introduced
Multivariate Tests of Means
3
Multivariate Tests of Means in Independent Groups Designs
Effects of Covariance Heterogeneity and Non-Normality
Health evaluation research frequently involves the collection of multiple outcome
measurements on two or more groups of subjects For example Harasym et al (1996) obtained
scores on multiple personal traits using the Myers-Briggs inventory for nursing students with
three different styles of learning Knapp and Miller (1983) discuss the measurement of multiple
dimensions of healthcare quality as part of a system-wide evaluation program
In the simplest case where there are only two groups such as a case and a control group
Hotellingrsquos (1931) T2 is the traditional method for testing that the means for the set of outcome
variables are equivalent across groups (ie the hypothesis of multivariate mean equality) The T2
statistic is the analogue of Studentrsquos two-group t statistic for testing equality of group means for
a single outcome variable While Hotellingrsquos T2 is the most common choice in the multivariate
context most applied researchers are unaware that it rests on a set of derivational assumptions
that are not likely to be satisfied in evaluation research Specifically this test assumes that the
outcome measurements follow a multivariate normal distribution and exhibit a common
covariance structure Multivariate normality is the assumption that all outcome variables and all
combinations of the variables are normally distributed The assumption of a common covariance
matrix is that the populations will exhibit the same variances and covariances for all of the
outcome variables For example with two groups and three dependent variables the assumption
of covariance homogeneity is that the three variances and three covariances are equivalent for the
two populations from which the data were sampled
The T2 test is not robust to assumption violations meaning that it is sensitive to changes
in those factors which are extraneous to the hypothesis of interest (Ito 1980) In fact this test
Multivariate Tests of Means
4
may become seriously biased when assumptions are not satisfied resulting in spurious decisions
about the null hypothesis Moreover the assumptions of normality and covariance homogeneity
are not likely to be satisfied in practice Outliers or extreme observations are often a significant
concern in evaluation research (see eg Sharmer 2001) Furthermore subjects who are exposed
to a particular healthcare treatment or intervention may exhibit greater variability on the outcome
measures than subjects who are not exposed to it (see eg Grissom 2000 Hill amp Dixon 1982
Hoover 2002) Consequently researchers who rely on Hotellingrsquos (1931) T2 procedure to test
hypotheses about equality of multivariate group means may unwittingly be filling the literature
with non-replicable results or at other times may fail to detect intervention effects when they are
present This should be of significant concern to health evaluation researchers because the results
of statistical tests are routinely used to make decisions about the effectiveness of clinical
interventions and to plan healthcare program content and delivery In this era of evidence-based
decision making it is important to ensure that the statistical procedures that are applied to
evaluation data will produce valid results
Within the last 50 years a number of statistical procedures that are robust to violations of
the assumption of covariance homogeneity have been proposed in the literature However these
procedures are largely unknown to applied researchers and therefore are not likely to be adopted
in practice Moreover all of these procedures are sensitive to departures from multivariate
normality Recent research shows that it is possible to obtain a test that is robust to the combined
effects of covariance heterogeneity and non-normality This involves substituting robust
measures of location and scale for the usual mean and covariances in tests that are insensitive to
covariance heterogeneity These robust measures are less affected by the presence of outlying
scores or skewed distributions than traditional measures
Multivariate Tests of Means
5
The purpose of this paper is to introduce health evaluation researchers to both the
concepts and the applications of robust test procedures for multivariate data This paper begins
with an introduction to the statistical notation that will be helpful in understanding the concepts
This is followed by a discussion of procedures that can be used to test the hypothesis of
multivariate mean equality when statistical assumptions are and are not satisfied We will then
show how to obtain a test that is robust to the combined effects of covariance heterogeneity and
multivariate non-normality Throughout this presentation a numeric example will help to
illustrate the concepts and computations Finally we demonstrate a computer program that can
be used to implement the statistical tests described in this paper Robust procedures are largely
inaccessible to applied researchers because they have not yet been incorporated into extant
statistical software packages The program that we introduce will be beneficial to evaluation
researchers who want to test hypotheses of mean equality in multivariate designs but are
concerned about whether their data may violatr the assumptions which underlie conventional
methods of analysis
Statistical Notation
Consider the case of a single outcome variable Let Yij represent the measurement on that
outcome variable for the ith subject in the jth group (i = 1 hellip nj j = 1 2) Under a normal theory
model it is assumed that Yij follows a normal distribution with mean μj and variance 2σ j When
there are only two groups of subjects the null hypothesis is
μμ 210H (1)
In other words one wishes to test whether the population means for the outcome variable are
equivalent
Multivariate Tests of Means
6
To generalize to the multivariate context assume that we have measurements for each
subject on p outcome variables In other words instead of just a single value we now have a set
of p values for each subject Using matrix notation Yij represents the vector (ie row) of p
outcome measurements for the ith subject in the jth group that is
Yij = [Yij1 hellip Yijp] For example Yij may represent the values for a series of measures of physical
function or attitudes towards a healthcare intervention It is assumed that Yij follows a
multivariate normal distribution with mean j and variance-covariance matrix j (ie Yij ~ N[ j
j]) The vector μj contains the mean scores on each outcome variable that is
μj = [μj1 μj2 hellip μjp] The variance-covariance matrix j is a p x p matrix with the variances for
each outcome variable on the diagonal and the covariances for all pairs of outcome variables on
the off diagonal
σσ
σσσ
2
1
112
2
1
jpjp
pjjj
j
Σ
The null hypothesis
H 210 μμ (2)
is used to test whether the means for the set of p outcome variables are equal across the two
groups
As Knapp and Miller (1983) observe adopting a test of multivariate (ie joint)
equivalence is preferable to adopting multiple tests of univariate equivalence particularly when
the outcome variables are correlated Type I errors which are erroneous conclusions about true
null hypotheses may occur when multiple univariate tests are performed If the outcome
variables are independent the probability that at least one erroneous decision will be made on the
set of p outcome variables is 1 ndash (1 - α)p which is approximately equal to αp when α is small
Multivariate Tests of Means
7
where α is the nominal level of significance For example with α = 05 and p = 3 the probability
of making at least one erroneous decision is asymp 14
To illustrate the multivariate concepts that have been presented to this point we will use
the example data set of Table 1 These data are for two groups of subjects and two outcome
variables Let nj represent the sample size for the jth group The example data are for an
unbalanced design (ie unequal group sizes) where n1 = 6 and n2 = 8 The vector of scores for
the first subject of group 1 is Y11 = [1 51] the vector for the second subject of group 1 is
Y21 = [28 48] and so on
Let jY and Sj represent the sample mean vector and sample covariance matrix for the jth
group Table 2 contains these summary statistics for the example data set The mean scores for
the first outcome variable are 130 and 194 for groups 1 and 2 respectively For the second
outcome variable the corresponding means are 497 and 478 The variances for groups 1 and 2
on the first outcome measure are 748 and 23 respectively The larger variance for the first
group is primarily due to the presence of two extreme values of 1 and 28 for the first and second
subjects respectively The corresponding variances on the second outcome variable are 39 and
36 For group 1 the covariance for the two outcome variables is ndash70 The population
correlation for two variables q and q΄ (ie ρqq΄ q q΄ = 1 hellip p) can be obtained from the
covariance and the variances
σσ
σ
22
where σqq΄ is the covariance and 2
qσ is the variance for the qth outcome variable The sample
correlation coefficient rqq΄ is used to estimate the population correlation coefficient In the
Multivariate Tests of Means
8
example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two
variables for group 1
Tests for Mean Equality when Assumptions are Satisfied
Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of
equality of population means in a univariate design (ie equation 1) The test statistic which
assumes equality of population variances is
11
21
2
21
nns
YYt (3)
where jY represents the mean for the jth group and s2 the variance that is pooled for the two
groups is
2
11
21
2
22
2
112
nn
snsns (4)
where 2
js is the variance for the jth group
The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing
means with mean vectors and the pooled variance with the pooled covariance matrix
11
21
1
21
T
211 YYSYYnn
T (5)
where T is the transpose operator which is used to convert the row vector 21 YY to a
column vector -1
denotes the inverse of a matrix and S is the pooled sample covariance matrix
2
11
21
2211
nn
nn SSS (6)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
2
Abstract
Health evaluation research often employs multivariate designs in which data on several outcome
variables are obtained for independent groups of subjects This article examines statistical
procedures for testing hypotheses of multivariate mean equality in two-group designs The
conventional test for multivariate means Hotellingrsquos T2 rests on certain assumptions about the
distribution of the data and the population variances and covariances When these assumptions
are violated which is often the case in applied health research T2 will result in invalid
conclusions about the null hypothesis We describe procedures that are robust or insensitive to
assumption violations A numeric example illustrates the statistical concepts that are presented
and a computer program to implement these robust solutions is introduced
Multivariate Tests of Means
3
Multivariate Tests of Means in Independent Groups Designs
Effects of Covariance Heterogeneity and Non-Normality
Health evaluation research frequently involves the collection of multiple outcome
measurements on two or more groups of subjects For example Harasym et al (1996) obtained
scores on multiple personal traits using the Myers-Briggs inventory for nursing students with
three different styles of learning Knapp and Miller (1983) discuss the measurement of multiple
dimensions of healthcare quality as part of a system-wide evaluation program
In the simplest case where there are only two groups such as a case and a control group
Hotellingrsquos (1931) T2 is the traditional method for testing that the means for the set of outcome
variables are equivalent across groups (ie the hypothesis of multivariate mean equality) The T2
statistic is the analogue of Studentrsquos two-group t statistic for testing equality of group means for
a single outcome variable While Hotellingrsquos T2 is the most common choice in the multivariate
context most applied researchers are unaware that it rests on a set of derivational assumptions
that are not likely to be satisfied in evaluation research Specifically this test assumes that the
outcome measurements follow a multivariate normal distribution and exhibit a common
covariance structure Multivariate normality is the assumption that all outcome variables and all
combinations of the variables are normally distributed The assumption of a common covariance
matrix is that the populations will exhibit the same variances and covariances for all of the
outcome variables For example with two groups and three dependent variables the assumption
of covariance homogeneity is that the three variances and three covariances are equivalent for the
two populations from which the data were sampled
The T2 test is not robust to assumption violations meaning that it is sensitive to changes
in those factors which are extraneous to the hypothesis of interest (Ito 1980) In fact this test
Multivariate Tests of Means
4
may become seriously biased when assumptions are not satisfied resulting in spurious decisions
about the null hypothesis Moreover the assumptions of normality and covariance homogeneity
are not likely to be satisfied in practice Outliers or extreme observations are often a significant
concern in evaluation research (see eg Sharmer 2001) Furthermore subjects who are exposed
to a particular healthcare treatment or intervention may exhibit greater variability on the outcome
measures than subjects who are not exposed to it (see eg Grissom 2000 Hill amp Dixon 1982
Hoover 2002) Consequently researchers who rely on Hotellingrsquos (1931) T2 procedure to test
hypotheses about equality of multivariate group means may unwittingly be filling the literature
with non-replicable results or at other times may fail to detect intervention effects when they are
present This should be of significant concern to health evaluation researchers because the results
of statistical tests are routinely used to make decisions about the effectiveness of clinical
interventions and to plan healthcare program content and delivery In this era of evidence-based
decision making it is important to ensure that the statistical procedures that are applied to
evaluation data will produce valid results
Within the last 50 years a number of statistical procedures that are robust to violations of
the assumption of covariance homogeneity have been proposed in the literature However these
procedures are largely unknown to applied researchers and therefore are not likely to be adopted
in practice Moreover all of these procedures are sensitive to departures from multivariate
normality Recent research shows that it is possible to obtain a test that is robust to the combined
effects of covariance heterogeneity and non-normality This involves substituting robust
measures of location and scale for the usual mean and covariances in tests that are insensitive to
covariance heterogeneity These robust measures are less affected by the presence of outlying
scores or skewed distributions than traditional measures
Multivariate Tests of Means
5
The purpose of this paper is to introduce health evaluation researchers to both the
concepts and the applications of robust test procedures for multivariate data This paper begins
with an introduction to the statistical notation that will be helpful in understanding the concepts
This is followed by a discussion of procedures that can be used to test the hypothesis of
multivariate mean equality when statistical assumptions are and are not satisfied We will then
show how to obtain a test that is robust to the combined effects of covariance heterogeneity and
multivariate non-normality Throughout this presentation a numeric example will help to
illustrate the concepts and computations Finally we demonstrate a computer program that can
be used to implement the statistical tests described in this paper Robust procedures are largely
inaccessible to applied researchers because they have not yet been incorporated into extant
statistical software packages The program that we introduce will be beneficial to evaluation
researchers who want to test hypotheses of mean equality in multivariate designs but are
concerned about whether their data may violatr the assumptions which underlie conventional
methods of analysis
Statistical Notation
Consider the case of a single outcome variable Let Yij represent the measurement on that
outcome variable for the ith subject in the jth group (i = 1 hellip nj j = 1 2) Under a normal theory
model it is assumed that Yij follows a normal distribution with mean μj and variance 2σ j When
there are only two groups of subjects the null hypothesis is
μμ 210H (1)
In other words one wishes to test whether the population means for the outcome variable are
equivalent
Multivariate Tests of Means
6
To generalize to the multivariate context assume that we have measurements for each
subject on p outcome variables In other words instead of just a single value we now have a set
of p values for each subject Using matrix notation Yij represents the vector (ie row) of p
outcome measurements for the ith subject in the jth group that is
Yij = [Yij1 hellip Yijp] For example Yij may represent the values for a series of measures of physical
function or attitudes towards a healthcare intervention It is assumed that Yij follows a
multivariate normal distribution with mean j and variance-covariance matrix j (ie Yij ~ N[ j
j]) The vector μj contains the mean scores on each outcome variable that is
μj = [μj1 μj2 hellip μjp] The variance-covariance matrix j is a p x p matrix with the variances for
each outcome variable on the diagonal and the covariances for all pairs of outcome variables on
the off diagonal
σσ
σσσ
2
1
112
2
1
jpjp
pjjj
j
Σ
The null hypothesis
H 210 μμ (2)
is used to test whether the means for the set of p outcome variables are equal across the two
groups
As Knapp and Miller (1983) observe adopting a test of multivariate (ie joint)
equivalence is preferable to adopting multiple tests of univariate equivalence particularly when
the outcome variables are correlated Type I errors which are erroneous conclusions about true
null hypotheses may occur when multiple univariate tests are performed If the outcome
variables are independent the probability that at least one erroneous decision will be made on the
set of p outcome variables is 1 ndash (1 - α)p which is approximately equal to αp when α is small
Multivariate Tests of Means
7
where α is the nominal level of significance For example with α = 05 and p = 3 the probability
of making at least one erroneous decision is asymp 14
To illustrate the multivariate concepts that have been presented to this point we will use
the example data set of Table 1 These data are for two groups of subjects and two outcome
variables Let nj represent the sample size for the jth group The example data are for an
unbalanced design (ie unequal group sizes) where n1 = 6 and n2 = 8 The vector of scores for
the first subject of group 1 is Y11 = [1 51] the vector for the second subject of group 1 is
Y21 = [28 48] and so on
Let jY and Sj represent the sample mean vector and sample covariance matrix for the jth
group Table 2 contains these summary statistics for the example data set The mean scores for
the first outcome variable are 130 and 194 for groups 1 and 2 respectively For the second
outcome variable the corresponding means are 497 and 478 The variances for groups 1 and 2
on the first outcome measure are 748 and 23 respectively The larger variance for the first
group is primarily due to the presence of two extreme values of 1 and 28 for the first and second
subjects respectively The corresponding variances on the second outcome variable are 39 and
36 For group 1 the covariance for the two outcome variables is ndash70 The population
correlation for two variables q and q΄ (ie ρqq΄ q q΄ = 1 hellip p) can be obtained from the
covariance and the variances
σσ
σ
22
where σqq΄ is the covariance and 2
qσ is the variance for the qth outcome variable The sample
correlation coefficient rqq΄ is used to estimate the population correlation coefficient In the
Multivariate Tests of Means
8
example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two
variables for group 1
Tests for Mean Equality when Assumptions are Satisfied
Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of
equality of population means in a univariate design (ie equation 1) The test statistic which
assumes equality of population variances is
11
21
2
21
nns
YYt (3)
where jY represents the mean for the jth group and s2 the variance that is pooled for the two
groups is
2
11
21
2
22
2
112
nn
snsns (4)
where 2
js is the variance for the jth group
The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing
means with mean vectors and the pooled variance with the pooled covariance matrix
11
21
1
21
T
211 YYSYYnn
T (5)
where T is the transpose operator which is used to convert the row vector 21 YY to a
column vector -1
denotes the inverse of a matrix and S is the pooled sample covariance matrix
2
11
21
2211
nn
nn SSS (6)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
3
Multivariate Tests of Means in Independent Groups Designs
Effects of Covariance Heterogeneity and Non-Normality
Health evaluation research frequently involves the collection of multiple outcome
measurements on two or more groups of subjects For example Harasym et al (1996) obtained
scores on multiple personal traits using the Myers-Briggs inventory for nursing students with
three different styles of learning Knapp and Miller (1983) discuss the measurement of multiple
dimensions of healthcare quality as part of a system-wide evaluation program
In the simplest case where there are only two groups such as a case and a control group
Hotellingrsquos (1931) T2 is the traditional method for testing that the means for the set of outcome
variables are equivalent across groups (ie the hypothesis of multivariate mean equality) The T2
statistic is the analogue of Studentrsquos two-group t statistic for testing equality of group means for
a single outcome variable While Hotellingrsquos T2 is the most common choice in the multivariate
context most applied researchers are unaware that it rests on a set of derivational assumptions
that are not likely to be satisfied in evaluation research Specifically this test assumes that the
outcome measurements follow a multivariate normal distribution and exhibit a common
covariance structure Multivariate normality is the assumption that all outcome variables and all
combinations of the variables are normally distributed The assumption of a common covariance
matrix is that the populations will exhibit the same variances and covariances for all of the
outcome variables For example with two groups and three dependent variables the assumption
of covariance homogeneity is that the three variances and three covariances are equivalent for the
two populations from which the data were sampled
The T2 test is not robust to assumption violations meaning that it is sensitive to changes
in those factors which are extraneous to the hypothesis of interest (Ito 1980) In fact this test
Multivariate Tests of Means
4
may become seriously biased when assumptions are not satisfied resulting in spurious decisions
about the null hypothesis Moreover the assumptions of normality and covariance homogeneity
are not likely to be satisfied in practice Outliers or extreme observations are often a significant
concern in evaluation research (see eg Sharmer 2001) Furthermore subjects who are exposed
to a particular healthcare treatment or intervention may exhibit greater variability on the outcome
measures than subjects who are not exposed to it (see eg Grissom 2000 Hill amp Dixon 1982
Hoover 2002) Consequently researchers who rely on Hotellingrsquos (1931) T2 procedure to test
hypotheses about equality of multivariate group means may unwittingly be filling the literature
with non-replicable results or at other times may fail to detect intervention effects when they are
present This should be of significant concern to health evaluation researchers because the results
of statistical tests are routinely used to make decisions about the effectiveness of clinical
interventions and to plan healthcare program content and delivery In this era of evidence-based
decision making it is important to ensure that the statistical procedures that are applied to
evaluation data will produce valid results
Within the last 50 years a number of statistical procedures that are robust to violations of
the assumption of covariance homogeneity have been proposed in the literature However these
procedures are largely unknown to applied researchers and therefore are not likely to be adopted
in practice Moreover all of these procedures are sensitive to departures from multivariate
normality Recent research shows that it is possible to obtain a test that is robust to the combined
effects of covariance heterogeneity and non-normality This involves substituting robust
measures of location and scale for the usual mean and covariances in tests that are insensitive to
covariance heterogeneity These robust measures are less affected by the presence of outlying
scores or skewed distributions than traditional measures
Multivariate Tests of Means
5
The purpose of this paper is to introduce health evaluation researchers to both the
concepts and the applications of robust test procedures for multivariate data This paper begins
with an introduction to the statistical notation that will be helpful in understanding the concepts
This is followed by a discussion of procedures that can be used to test the hypothesis of
multivariate mean equality when statistical assumptions are and are not satisfied We will then
show how to obtain a test that is robust to the combined effects of covariance heterogeneity and
multivariate non-normality Throughout this presentation a numeric example will help to
illustrate the concepts and computations Finally we demonstrate a computer program that can
be used to implement the statistical tests described in this paper Robust procedures are largely
inaccessible to applied researchers because they have not yet been incorporated into extant
statistical software packages The program that we introduce will be beneficial to evaluation
researchers who want to test hypotheses of mean equality in multivariate designs but are
concerned about whether their data may violatr the assumptions which underlie conventional
methods of analysis
Statistical Notation
Consider the case of a single outcome variable Let Yij represent the measurement on that
outcome variable for the ith subject in the jth group (i = 1 hellip nj j = 1 2) Under a normal theory
model it is assumed that Yij follows a normal distribution with mean μj and variance 2σ j When
there are only two groups of subjects the null hypothesis is
μμ 210H (1)
In other words one wishes to test whether the population means for the outcome variable are
equivalent
Multivariate Tests of Means
6
To generalize to the multivariate context assume that we have measurements for each
subject on p outcome variables In other words instead of just a single value we now have a set
of p values for each subject Using matrix notation Yij represents the vector (ie row) of p
outcome measurements for the ith subject in the jth group that is
Yij = [Yij1 hellip Yijp] For example Yij may represent the values for a series of measures of physical
function or attitudes towards a healthcare intervention It is assumed that Yij follows a
multivariate normal distribution with mean j and variance-covariance matrix j (ie Yij ~ N[ j
j]) The vector μj contains the mean scores on each outcome variable that is
μj = [μj1 μj2 hellip μjp] The variance-covariance matrix j is a p x p matrix with the variances for
each outcome variable on the diagonal and the covariances for all pairs of outcome variables on
the off diagonal
σσ
σσσ
2
1
112
2
1
jpjp
pjjj
j
Σ
The null hypothesis
H 210 μμ (2)
is used to test whether the means for the set of p outcome variables are equal across the two
groups
As Knapp and Miller (1983) observe adopting a test of multivariate (ie joint)
equivalence is preferable to adopting multiple tests of univariate equivalence particularly when
the outcome variables are correlated Type I errors which are erroneous conclusions about true
null hypotheses may occur when multiple univariate tests are performed If the outcome
variables are independent the probability that at least one erroneous decision will be made on the
set of p outcome variables is 1 ndash (1 - α)p which is approximately equal to αp when α is small
Multivariate Tests of Means
7
where α is the nominal level of significance For example with α = 05 and p = 3 the probability
of making at least one erroneous decision is asymp 14
To illustrate the multivariate concepts that have been presented to this point we will use
the example data set of Table 1 These data are for two groups of subjects and two outcome
variables Let nj represent the sample size for the jth group The example data are for an
unbalanced design (ie unequal group sizes) where n1 = 6 and n2 = 8 The vector of scores for
the first subject of group 1 is Y11 = [1 51] the vector for the second subject of group 1 is
Y21 = [28 48] and so on
Let jY and Sj represent the sample mean vector and sample covariance matrix for the jth
group Table 2 contains these summary statistics for the example data set The mean scores for
the first outcome variable are 130 and 194 for groups 1 and 2 respectively For the second
outcome variable the corresponding means are 497 and 478 The variances for groups 1 and 2
on the first outcome measure are 748 and 23 respectively The larger variance for the first
group is primarily due to the presence of two extreme values of 1 and 28 for the first and second
subjects respectively The corresponding variances on the second outcome variable are 39 and
36 For group 1 the covariance for the two outcome variables is ndash70 The population
correlation for two variables q and q΄ (ie ρqq΄ q q΄ = 1 hellip p) can be obtained from the
covariance and the variances
σσ
σ
22
where σqq΄ is the covariance and 2
qσ is the variance for the qth outcome variable The sample
correlation coefficient rqq΄ is used to estimate the population correlation coefficient In the
Multivariate Tests of Means
8
example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two
variables for group 1
Tests for Mean Equality when Assumptions are Satisfied
Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of
equality of population means in a univariate design (ie equation 1) The test statistic which
assumes equality of population variances is
11
21
2
21
nns
YYt (3)
where jY represents the mean for the jth group and s2 the variance that is pooled for the two
groups is
2
11
21
2
22
2
112
nn
snsns (4)
where 2
js is the variance for the jth group
The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing
means with mean vectors and the pooled variance with the pooled covariance matrix
11
21
1
21
T
211 YYSYYnn
T (5)
where T is the transpose operator which is used to convert the row vector 21 YY to a
column vector -1
denotes the inverse of a matrix and S is the pooled sample covariance matrix
2
11
21
2211
nn
nn SSS (6)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
4
may become seriously biased when assumptions are not satisfied resulting in spurious decisions
about the null hypothesis Moreover the assumptions of normality and covariance homogeneity
are not likely to be satisfied in practice Outliers or extreme observations are often a significant
concern in evaluation research (see eg Sharmer 2001) Furthermore subjects who are exposed
to a particular healthcare treatment or intervention may exhibit greater variability on the outcome
measures than subjects who are not exposed to it (see eg Grissom 2000 Hill amp Dixon 1982
Hoover 2002) Consequently researchers who rely on Hotellingrsquos (1931) T2 procedure to test
hypotheses about equality of multivariate group means may unwittingly be filling the literature
with non-replicable results or at other times may fail to detect intervention effects when they are
present This should be of significant concern to health evaluation researchers because the results
of statistical tests are routinely used to make decisions about the effectiveness of clinical
interventions and to plan healthcare program content and delivery In this era of evidence-based
decision making it is important to ensure that the statistical procedures that are applied to
evaluation data will produce valid results
Within the last 50 years a number of statistical procedures that are robust to violations of
the assumption of covariance homogeneity have been proposed in the literature However these
procedures are largely unknown to applied researchers and therefore are not likely to be adopted
in practice Moreover all of these procedures are sensitive to departures from multivariate
normality Recent research shows that it is possible to obtain a test that is robust to the combined
effects of covariance heterogeneity and non-normality This involves substituting robust
measures of location and scale for the usual mean and covariances in tests that are insensitive to
covariance heterogeneity These robust measures are less affected by the presence of outlying
scores or skewed distributions than traditional measures
Multivariate Tests of Means
5
The purpose of this paper is to introduce health evaluation researchers to both the
concepts and the applications of robust test procedures for multivariate data This paper begins
with an introduction to the statistical notation that will be helpful in understanding the concepts
This is followed by a discussion of procedures that can be used to test the hypothesis of
multivariate mean equality when statistical assumptions are and are not satisfied We will then
show how to obtain a test that is robust to the combined effects of covariance heterogeneity and
multivariate non-normality Throughout this presentation a numeric example will help to
illustrate the concepts and computations Finally we demonstrate a computer program that can
be used to implement the statistical tests described in this paper Robust procedures are largely
inaccessible to applied researchers because they have not yet been incorporated into extant
statistical software packages The program that we introduce will be beneficial to evaluation
researchers who want to test hypotheses of mean equality in multivariate designs but are
concerned about whether their data may violatr the assumptions which underlie conventional
methods of analysis
Statistical Notation
Consider the case of a single outcome variable Let Yij represent the measurement on that
outcome variable for the ith subject in the jth group (i = 1 hellip nj j = 1 2) Under a normal theory
model it is assumed that Yij follows a normal distribution with mean μj and variance 2σ j When
there are only two groups of subjects the null hypothesis is
μμ 210H (1)
In other words one wishes to test whether the population means for the outcome variable are
equivalent
Multivariate Tests of Means
6
To generalize to the multivariate context assume that we have measurements for each
subject on p outcome variables In other words instead of just a single value we now have a set
of p values for each subject Using matrix notation Yij represents the vector (ie row) of p
outcome measurements for the ith subject in the jth group that is
Yij = [Yij1 hellip Yijp] For example Yij may represent the values for a series of measures of physical
function or attitudes towards a healthcare intervention It is assumed that Yij follows a
multivariate normal distribution with mean j and variance-covariance matrix j (ie Yij ~ N[ j
j]) The vector μj contains the mean scores on each outcome variable that is
μj = [μj1 μj2 hellip μjp] The variance-covariance matrix j is a p x p matrix with the variances for
each outcome variable on the diagonal and the covariances for all pairs of outcome variables on
the off diagonal
σσ
σσσ
2
1
112
2
1
jpjp
pjjj
j
Σ
The null hypothesis
H 210 μμ (2)
is used to test whether the means for the set of p outcome variables are equal across the two
groups
As Knapp and Miller (1983) observe adopting a test of multivariate (ie joint)
equivalence is preferable to adopting multiple tests of univariate equivalence particularly when
the outcome variables are correlated Type I errors which are erroneous conclusions about true
null hypotheses may occur when multiple univariate tests are performed If the outcome
variables are independent the probability that at least one erroneous decision will be made on the
set of p outcome variables is 1 ndash (1 - α)p which is approximately equal to αp when α is small
Multivariate Tests of Means
7
where α is the nominal level of significance For example with α = 05 and p = 3 the probability
of making at least one erroneous decision is asymp 14
To illustrate the multivariate concepts that have been presented to this point we will use
the example data set of Table 1 These data are for two groups of subjects and two outcome
variables Let nj represent the sample size for the jth group The example data are for an
unbalanced design (ie unequal group sizes) where n1 = 6 and n2 = 8 The vector of scores for
the first subject of group 1 is Y11 = [1 51] the vector for the second subject of group 1 is
Y21 = [28 48] and so on
Let jY and Sj represent the sample mean vector and sample covariance matrix for the jth
group Table 2 contains these summary statistics for the example data set The mean scores for
the first outcome variable are 130 and 194 for groups 1 and 2 respectively For the second
outcome variable the corresponding means are 497 and 478 The variances for groups 1 and 2
on the first outcome measure are 748 and 23 respectively The larger variance for the first
group is primarily due to the presence of two extreme values of 1 and 28 for the first and second
subjects respectively The corresponding variances on the second outcome variable are 39 and
36 For group 1 the covariance for the two outcome variables is ndash70 The population
correlation for two variables q and q΄ (ie ρqq΄ q q΄ = 1 hellip p) can be obtained from the
covariance and the variances
σσ
σ
22
where σqq΄ is the covariance and 2
qσ is the variance for the qth outcome variable The sample
correlation coefficient rqq΄ is used to estimate the population correlation coefficient In the
Multivariate Tests of Means
8
example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two
variables for group 1
Tests for Mean Equality when Assumptions are Satisfied
Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of
equality of population means in a univariate design (ie equation 1) The test statistic which
assumes equality of population variances is
11
21
2
21
nns
YYt (3)
where jY represents the mean for the jth group and s2 the variance that is pooled for the two
groups is
2
11
21
2
22
2
112
nn
snsns (4)
where 2
js is the variance for the jth group
The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing
means with mean vectors and the pooled variance with the pooled covariance matrix
11
21
1
21
T
211 YYSYYnn
T (5)
where T is the transpose operator which is used to convert the row vector 21 YY to a
column vector -1
denotes the inverse of a matrix and S is the pooled sample covariance matrix
2
11
21
2211
nn
nn SSS (6)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
5
The purpose of this paper is to introduce health evaluation researchers to both the
concepts and the applications of robust test procedures for multivariate data This paper begins
with an introduction to the statistical notation that will be helpful in understanding the concepts
This is followed by a discussion of procedures that can be used to test the hypothesis of
multivariate mean equality when statistical assumptions are and are not satisfied We will then
show how to obtain a test that is robust to the combined effects of covariance heterogeneity and
multivariate non-normality Throughout this presentation a numeric example will help to
illustrate the concepts and computations Finally we demonstrate a computer program that can
be used to implement the statistical tests described in this paper Robust procedures are largely
inaccessible to applied researchers because they have not yet been incorporated into extant
statistical software packages The program that we introduce will be beneficial to evaluation
researchers who want to test hypotheses of mean equality in multivariate designs but are
concerned about whether their data may violatr the assumptions which underlie conventional
methods of analysis
Statistical Notation
Consider the case of a single outcome variable Let Yij represent the measurement on that
outcome variable for the ith subject in the jth group (i = 1 hellip nj j = 1 2) Under a normal theory
model it is assumed that Yij follows a normal distribution with mean μj and variance 2σ j When
there are only two groups of subjects the null hypothesis is
μμ 210H (1)
In other words one wishes to test whether the population means for the outcome variable are
equivalent
Multivariate Tests of Means
6
To generalize to the multivariate context assume that we have measurements for each
subject on p outcome variables In other words instead of just a single value we now have a set
of p values for each subject Using matrix notation Yij represents the vector (ie row) of p
outcome measurements for the ith subject in the jth group that is
Yij = [Yij1 hellip Yijp] For example Yij may represent the values for a series of measures of physical
function or attitudes towards a healthcare intervention It is assumed that Yij follows a
multivariate normal distribution with mean j and variance-covariance matrix j (ie Yij ~ N[ j
j]) The vector μj contains the mean scores on each outcome variable that is
μj = [μj1 μj2 hellip μjp] The variance-covariance matrix j is a p x p matrix with the variances for
each outcome variable on the diagonal and the covariances for all pairs of outcome variables on
the off diagonal
σσ
σσσ
2
1
112
2
1
jpjp
pjjj
j
Σ
The null hypothesis
H 210 μμ (2)
is used to test whether the means for the set of p outcome variables are equal across the two
groups
As Knapp and Miller (1983) observe adopting a test of multivariate (ie joint)
equivalence is preferable to adopting multiple tests of univariate equivalence particularly when
the outcome variables are correlated Type I errors which are erroneous conclusions about true
null hypotheses may occur when multiple univariate tests are performed If the outcome
variables are independent the probability that at least one erroneous decision will be made on the
set of p outcome variables is 1 ndash (1 - α)p which is approximately equal to αp when α is small
Multivariate Tests of Means
7
where α is the nominal level of significance For example with α = 05 and p = 3 the probability
of making at least one erroneous decision is asymp 14
To illustrate the multivariate concepts that have been presented to this point we will use
the example data set of Table 1 These data are for two groups of subjects and two outcome
variables Let nj represent the sample size for the jth group The example data are for an
unbalanced design (ie unequal group sizes) where n1 = 6 and n2 = 8 The vector of scores for
the first subject of group 1 is Y11 = [1 51] the vector for the second subject of group 1 is
Y21 = [28 48] and so on
Let jY and Sj represent the sample mean vector and sample covariance matrix for the jth
group Table 2 contains these summary statistics for the example data set The mean scores for
the first outcome variable are 130 and 194 for groups 1 and 2 respectively For the second
outcome variable the corresponding means are 497 and 478 The variances for groups 1 and 2
on the first outcome measure are 748 and 23 respectively The larger variance for the first
group is primarily due to the presence of two extreme values of 1 and 28 for the first and second
subjects respectively The corresponding variances on the second outcome variable are 39 and
36 For group 1 the covariance for the two outcome variables is ndash70 The population
correlation for two variables q and q΄ (ie ρqq΄ q q΄ = 1 hellip p) can be obtained from the
covariance and the variances
σσ
σ
22
where σqq΄ is the covariance and 2
qσ is the variance for the qth outcome variable The sample
correlation coefficient rqq΄ is used to estimate the population correlation coefficient In the
Multivariate Tests of Means
8
example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two
variables for group 1
Tests for Mean Equality when Assumptions are Satisfied
Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of
equality of population means in a univariate design (ie equation 1) The test statistic which
assumes equality of population variances is
11
21
2
21
nns
YYt (3)
where jY represents the mean for the jth group and s2 the variance that is pooled for the two
groups is
2
11
21
2
22
2
112
nn
snsns (4)
where 2
js is the variance for the jth group
The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing
means with mean vectors and the pooled variance with the pooled covariance matrix
11
21
1
21
T
211 YYSYYnn
T (5)
where T is the transpose operator which is used to convert the row vector 21 YY to a
column vector -1
denotes the inverse of a matrix and S is the pooled sample covariance matrix
2
11
21
2211
nn
nn SSS (6)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
6
To generalize to the multivariate context assume that we have measurements for each
subject on p outcome variables In other words instead of just a single value we now have a set
of p values for each subject Using matrix notation Yij represents the vector (ie row) of p
outcome measurements for the ith subject in the jth group that is
Yij = [Yij1 hellip Yijp] For example Yij may represent the values for a series of measures of physical
function or attitudes towards a healthcare intervention It is assumed that Yij follows a
multivariate normal distribution with mean j and variance-covariance matrix j (ie Yij ~ N[ j
j]) The vector μj contains the mean scores on each outcome variable that is
μj = [μj1 μj2 hellip μjp] The variance-covariance matrix j is a p x p matrix with the variances for
each outcome variable on the diagonal and the covariances for all pairs of outcome variables on
the off diagonal
σσ
σσσ
2
1
112
2
1
jpjp
pjjj
j
Σ
The null hypothesis
H 210 μμ (2)
is used to test whether the means for the set of p outcome variables are equal across the two
groups
As Knapp and Miller (1983) observe adopting a test of multivariate (ie joint)
equivalence is preferable to adopting multiple tests of univariate equivalence particularly when
the outcome variables are correlated Type I errors which are erroneous conclusions about true
null hypotheses may occur when multiple univariate tests are performed If the outcome
variables are independent the probability that at least one erroneous decision will be made on the
set of p outcome variables is 1 ndash (1 - α)p which is approximately equal to αp when α is small
Multivariate Tests of Means
7
where α is the nominal level of significance For example with α = 05 and p = 3 the probability
of making at least one erroneous decision is asymp 14
To illustrate the multivariate concepts that have been presented to this point we will use
the example data set of Table 1 These data are for two groups of subjects and two outcome
variables Let nj represent the sample size for the jth group The example data are for an
unbalanced design (ie unequal group sizes) where n1 = 6 and n2 = 8 The vector of scores for
the first subject of group 1 is Y11 = [1 51] the vector for the second subject of group 1 is
Y21 = [28 48] and so on
Let jY and Sj represent the sample mean vector and sample covariance matrix for the jth
group Table 2 contains these summary statistics for the example data set The mean scores for
the first outcome variable are 130 and 194 for groups 1 and 2 respectively For the second
outcome variable the corresponding means are 497 and 478 The variances for groups 1 and 2
on the first outcome measure are 748 and 23 respectively The larger variance for the first
group is primarily due to the presence of two extreme values of 1 and 28 for the first and second
subjects respectively The corresponding variances on the second outcome variable are 39 and
36 For group 1 the covariance for the two outcome variables is ndash70 The population
correlation for two variables q and q΄ (ie ρqq΄ q q΄ = 1 hellip p) can be obtained from the
covariance and the variances
σσ
σ
22
where σqq΄ is the covariance and 2
qσ is the variance for the qth outcome variable The sample
correlation coefficient rqq΄ is used to estimate the population correlation coefficient In the
Multivariate Tests of Means
8
example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two
variables for group 1
Tests for Mean Equality when Assumptions are Satisfied
Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of
equality of population means in a univariate design (ie equation 1) The test statistic which
assumes equality of population variances is
11
21
2
21
nns
YYt (3)
where jY represents the mean for the jth group and s2 the variance that is pooled for the two
groups is
2
11
21
2
22
2
112
nn
snsns (4)
where 2
js is the variance for the jth group
The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing
means with mean vectors and the pooled variance with the pooled covariance matrix
11
21
1
21
T
211 YYSYYnn
T (5)
where T is the transpose operator which is used to convert the row vector 21 YY to a
column vector -1
denotes the inverse of a matrix and S is the pooled sample covariance matrix
2
11
21
2211
nn
nn SSS (6)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
7
where α is the nominal level of significance For example with α = 05 and p = 3 the probability
of making at least one erroneous decision is asymp 14
To illustrate the multivariate concepts that have been presented to this point we will use
the example data set of Table 1 These data are for two groups of subjects and two outcome
variables Let nj represent the sample size for the jth group The example data are for an
unbalanced design (ie unequal group sizes) where n1 = 6 and n2 = 8 The vector of scores for
the first subject of group 1 is Y11 = [1 51] the vector for the second subject of group 1 is
Y21 = [28 48] and so on
Let jY and Sj represent the sample mean vector and sample covariance matrix for the jth
group Table 2 contains these summary statistics for the example data set The mean scores for
the first outcome variable are 130 and 194 for groups 1 and 2 respectively For the second
outcome variable the corresponding means are 497 and 478 The variances for groups 1 and 2
on the first outcome measure are 748 and 23 respectively The larger variance for the first
group is primarily due to the presence of two extreme values of 1 and 28 for the first and second
subjects respectively The corresponding variances on the second outcome variable are 39 and
36 For group 1 the covariance for the two outcome variables is ndash70 The population
correlation for two variables q and q΄ (ie ρqq΄ q q΄ = 1 hellip p) can be obtained from the
covariance and the variances
σσ
σ
22
where σqq΄ is the covariance and 2
qσ is the variance for the qth outcome variable The sample
correlation coefficient rqq΄ is used to estimate the population correlation coefficient In the
Multivariate Tests of Means
8
example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two
variables for group 1
Tests for Mean Equality when Assumptions are Satisfied
Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of
equality of population means in a univariate design (ie equation 1) The test statistic which
assumes equality of population variances is
11
21
2
21
nns
YYt (3)
where jY represents the mean for the jth group and s2 the variance that is pooled for the two
groups is
2
11
21
2
22
2
112
nn
snsns (4)
where 2
js is the variance for the jth group
The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing
means with mean vectors and the pooled variance with the pooled covariance matrix
11
21
1
21
T
211 YYSYYnn
T (5)
where T is the transpose operator which is used to convert the row vector 21 YY to a
column vector -1
denotes the inverse of a matrix and S is the pooled sample covariance matrix
2
11
21
2211
nn
nn SSS (6)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
8
example data set of Table 1 a moderate negative correlation of rqq΄ = ndash4 exists for the two
variables for group 1
Tests for Mean Equality when Assumptions are Satisfied
Studentrsquos t statistic is the conventional procedure for testing the null hypothesis of
equality of population means in a univariate design (ie equation 1) The test statistic which
assumes equality of population variances is
11
21
2
21
nns
YYt (3)
where jY represents the mean for the jth group and s2 the variance that is pooled for the two
groups is
2
11
21
2
22
2
112
nn
snsns (4)
where 2
js is the variance for the jth group
The multivariate Hotellingrsquos (1931) T2 statistic is formed from equation 3 by replacing
means with mean vectors and the pooled variance with the pooled covariance matrix
11
21
1
21
T
211 YYSYYnn
T (5)
where T is the transpose operator which is used to convert the row vector 21 YY to a
column vector -1
denotes the inverse of a matrix and S is the pooled sample covariance matrix
2
11
21
2211
nn
nn SSS (6)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
9
This test statistic is easily obtained from standard software packages such as SAS (SAS Institute
1999b) Statistical significant of this T1 statistic is evaluated using the T2 distribution The test
statistic can also be converted to an F statistic
2) -
11T
p(N
N - p - FT (7)
where N = n1 + n2 Statistical significance is then assessed by comparing the FT statistic to its
critical value F[p N ndash p ndash 1] that is a critical value from the F distribution with p and N ndash p ndash 1
degrees of freedom (df)
When the data are sampled from populations that follow a normal distribution but have
unequal covariance matrices (ie 1 2) Hotellingrsquos (1931) T2 will generally maintain the rate
of Type I errors (ie the probability of rejecting a true null hypothesis) close to if the design is
balanced (ie n1 = n2 Christensen amp Rencher 1997 Hakstian Roed amp Lind 1979 Hopkins amp
Clay 1963) However when the design is unbalanced either a liberal or a conservative test will
result depending on the nature of the relationship between the covariance matrices and group
sizes A liberal result is one in which the actual Type I error rate will exceed α Liberal results
are problematic researchers will be filling the literature with false positives (ie saying there are
treatment effects when none are present) A conservative result on the other hand is one in
which the Type I error rate will be less than Conservative results are also a cause for concern
because they may result in test procedures that have low statistical power to detect true
differences in population means (ie real effects will be undetected)
If the group with the largest sample size also exhibits the smallest element values of j
which is known as a negative pairing condition the error will be liberal For example Hopkins
and Clay (1963) showed that when group sizes were 10 and 20 and the ratio of the largest to the
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
10
smallest standard deviations of the groups was 16 the true rate of Type I errors for α = 05 was
11 When the ratio of the group standard deviations was increased to 32 the Type I error rate
was 21 more than four times the nominal level of significance For positive pairings of group
sizes and covariance matrices such that the group with the largest sample size also exhibits the
largest element values of j the T2 procedure tends to produce a conservative test In fact the
error rate may be substantially below the nominal level of significance For example Hopkins
and Clay observed Type I error rates of 02 and 01 respectively for α = 05 for positive pairings
for the two standard deviation values noted previously These liberal and conservative results for
normally distributed data have been demonstrated in a number of studies (Everitt 1979 Hakstian
et al 1979 Holloway amp Dunn 1967 Hopkins amp Clay 1963 Ito amp Schull 1964 Zwick 1986)
for both moderate and large degrees of covariance heterogeneity
When the assumption of multivariate normality is violated the performance of
Hotellingrsquos (1931) T2 test depends on both the degree of departure from a multivariate normal
distribution and the nature of the research design The earliest research on Hotellingrsquos T2 when
the data are non-normal suggested that tests of the null hypothesis for two groups were relatively
insensitive to departures from this assumption (eg Hopkins amp Clay 1963) This may be true
when the data are only moderately non-normal However Everitt (1979) showed that this test
procedure can become quite conservative when the distribution is skewed or when outliers are
present in the tails of the distribution particularly when the design is unbalanced (see also
Zwick 1986)
Tests for Mean Equality when Assumptions are not Satisfied
Both parametric and nonparametric alternatives to the T2 test have been proposed in the
literature Applied researchers often regard nonparametric procedures as appealing alternatives
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
11
because they rely on rank scores which are typically perceived as being easy to conceptualize
and interpret However these procedures test hypotheses about equality of distributions rather
than equality of means They are therefore sensitive to covariance heterogeneity because
distributions with unequal variances will necessarily result in rejection of the null hypothesis
Zwick (1986) showed that nonparametric alternatives to Hotellingrsquos (1931) T2 could control the
Type I error rate when the data were sampled from non-normal distributions and covariances
were equal Not surprisingly when covariances were unequal these procedures produced biased
results particularly when group sizes were unequal
Covariance heterogeneity There are several parametric alternatives to Hotellingrsquos (1931)
T2 test These include the Brown-Forsythe (Brown amp Forsythe 1974 [BF]) James (1954) first
and second order (J1 amp J2) Johansen (1980 [J]) Kim (1992 [K]) Nel and Van der Merwe
(1986 [NV]) and Yao (1965 [Y]) procedures The BF J1 J2 and J procedures have also been
generalized to multivariate designs containing more than two groups of subject
The J1 J2 J NV and Y procedures are all obtained from the same test statistic
21
1
2
2
1
1T
212 YYSS
YYnn
T (8)
The J NV and Y procedures approximate the distribution of T2 differently because the df for
these four procedures are computed using different formulas The J1 and J2 procedures each use
a different critical value to assess statistical significance However they both rely on large-
sample theory regarding the distribution of the test statistic in equation 8 What this means is that
when sample sizes are sufficiently large the T2 statistic approximately follows a chi-squared (χ2)
distribution For both procedures this test statistic is referred to an ldquoadjustedrdquo χ2
critical value If
the test statistic exceeds that critical value the null hypothesis of equation 2 is rejected The
critical value for the J1 procedure is slightly smaller than the one for the J2 procedure As a
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
12
result the J1 procedure generally produces larger Type I error rates than the J2 procedure and
therefore is not often recommended (de la Ray amp Nel 1993) While the J2 procedure may offer
better Type I error control it is computationally complex The critical value for J2 is described in
the Appendix along with the F-statistic conversions and df computations for the J NV and Y
procedures
The K procedure is based on an F statistic It is more complex than preceding test
procedures because eigenvalues and eigenvectors of the group covariance matrices must be
computed1 For completeness the formula used to compute the K test statistic and its df are
found in the Appendix
The BF procedure (see also Mehrotra 1997) relies on a test statistic that differs slightly
from the one presented in equation 8
11 21
1
22
11T
21BF YYSSYYN
n
N
nT (9)
As one can see the test statistic in equation 9 weights the group covariance matrices in a
different way than the test statistic in equation 8 Again for completeness the numeric solutions
for the BF F statistic and df are found in the Appendix
Among the BF J2 J K NV and Y tests there appears to be no one best choice in all
data-analytic situations when the data are normally distributed although a comprehensive
comparison of all of these procedures has not yet been conducted Factors such as the degree of
covariance heterogeneity total sample size the degree of imbalance of the group sizes and the
relationship between the group sizes and covariance matrices will determine which procedure
will afford the best Type I error control and maximum statistical power to detect group
differences Christensen and Rencher (1997) noted in their extensive comparison among the J2
J K NV and Y procedures that the J and Y procedures could occasionally result in inflated
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
13
Type I error rates for negative pairings of group sizes and covariance matrices These liberal
tendencies were exacerbated as the number of outcome variables increased The authors
recommended the K procedure overall observing that it offered the greatest statistical power
among those procedures that never produced inflated Type I error rates However the authors
report a number of situations of covariance heterogeneity in which the K procedure could
become quite conservative Type I error rates as low as 02 for α = 05 were reported when p =
10 n1 = 30 and n2 = 20
For multivariate non-normal distributions Algina Oshima and Tang (1991) showed that
the J1 J2 J and Y procedures could not control Type I error rates when the underlying
population distributions were highly skewed For a lognormal distribution which has skewness
of 6182 they observed many instances in which empirical Type I error rates of all of these
procedures were more than four times the nominal level of significance Wilcox (1995) found
that the J test produced excessive Type I errors when sample sizes were small (ie n1 = 12 and
n2 = 18) and the data were generated from non-normal distributions the K procedure became
conservative when the skewness was 618 For larger group sizes (ie n1 = 24 and n2 = 36) the J
procedure provided acceptable control of Type I errors when the data were only moderately non-
normal but for the maximum skewness considered it also became conservative Fouladi and
Yockey (2002) found that the degree of departure from a multivariate normal distribution was a
less important predictor of Type I error performance than sample size Across the range of
conditions which they examined the Y test produced the greatest average Type I error rates and
the NV procedure the smallest Error rates were only slightly influenced by the degree of
skewness or kurtosis of the data however these authors looked at only very modest departures
from a normal distribution the maximum degree of skewness considered was 75
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
14
Non-normality For univariate designs a test procedure that is robust to the biasing
effects of non-normality may be obtained by adopting estimators of location and scale that are
insensitive to the presence of extreme scores andor a skewed distribution (Keselman
Kowalchuk amp Lix 1998 Lix amp Keselman 1998) There are a number of robust estimators that
have been proposed in the literature among these the trimmed mean has received a great deal of
attention because of its good theoretical properties ease of computation and ease of
interpretation (Wilcox 1995a) The trimmed mean is obtained by removing (ie censoring) the
most extreme scores in the distribution Hence one removes the effects of the most extreme
scores which have the tendency to ldquoshiftrdquo the mean in their direction
One should recognize at the outset that while robust estimators are insensitive to
departures from a normal distribution they test a different null hypothesis than least-squares
estimators The null hypothesis is about equality of trimmed population means In other words
one is testing a hypothesis that focuses on the bulk of the population rather than the entire
population Thus if one subscribes to the position that inferences pertaining to robust parameters
are more valid than inferences pertaining to the usual least-squares parameters then procedures
based on robust estimators should be adopted
To illustrate the computation of the trimmed mean let Y(1)j Y(2)j jn jY )( represent
the ordered observations for the jth group on a single outcome variable In other words one
begins by ordering the observations for each group from smallest to largest Then let gj = [ nj]
where represents the proportion of observations that are to be trimmed in each tail of the
distribution and [x] is the greatest integer x The effective sample size for the jth group is
defined as hj = nj ndash 2gj The sample trimmed mean
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
15
jj
j
gn
gi(i)j
jj
Yh
Y1
t
1 (10)
is computed by censoring the gj smallest and the gj largest observations The most extreme scores
for each group of subjects are trimmed independently of the extreme scores for all other groups
A fixed proportion of the observations is trimmed from each tail of the distribution 20 percent
trimming is generally recommended (Wilcox 1995a)
The Winsorized variance is the theoretically correct measure of scale that corresponds to
the trimmed mean (Yuen 1974) and is used to obtain the diagonal elements of the group
covariance matrix To obtain the Winsorized variance the sample Winsorized mean is first
computed
1
1
w
jn
i
ij
j
j Zn
Y (11)
where
)(
)1(
)1()1(
if
if
if
jgnij )jg(n
)jg(nijjgij
jgijjgij
jjjj
jjj
jj
YYY
YYYY
YYYZ
The Winsorized mean is obtained by replacing the gj smallest values with the next most extreme
value and the gj largest values with the next most extreme value The Winsorized variance for
the jth group on a single outcome variable 2
wjs is
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
16
1
2
1w
2w
j
n
ijij
jn
YZ
s
j
(12)
and the standard error of the trimmed mean is 11 2w jjjj hhsn The Winsorized
covariance for the outcome variables q and q (q q = 1 hellip p) is
1
1w
j
qwjqij
n
iwjqijq
qjqn
YZYZ
s
j
(13)
and the Winsorized covariance matrix for the jth group is
s2
w1w
1w12w
2
1w
w
jpjp
pjjj
j
s
sss
S
To illustrate we return to the data set of Table 1 For the first outcome variable for the
first group the ordered observations are
28131312111
and with 20 trimming g1 = [6 x 20] = 1 The scores of 1 and 28 are removed and the mean of
the remaining scores is computed which produces 212t1Y Table 2 contains the vectors of
trimmed means for the two groups
To Winsorize the data set for the first group on the first outcome variable the largest and
smallest values in the set of ordered observations are replaced by the next most extreme scores
producing the following set of ordered observations
131313121111
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
17
The Winsorized mean w1Y is 122 While in this example the Winsorized mean has the same
value as the trimmed mean these two estimators will not as a rule produce an equivalent result
Table 2 contains the Winsorized covariance matrices for the two groups
A test which is robust to the biasing effects of both multivariate non-normality and
covariance heterogeneity can be obtained by using one of the BF J2 J K NV or Y test
procedures and substituting the trimmed means and the Winsorized covariance matrix for the
least-squares mean and covariance matrix (see Wilcox 1995b) For example with robust
estimators the T1 statistic of equation 5 becomes
11
t2t1
1
21
w
T
t2t1t1 YYSYYhh
T (14)
where
1
1
1
12w
2
2
1w
1
1w SSS
h
n
h
n (15)
Wilcox (1995b) compared the K and J procedures when trimmed means and Winsorized
covariances were substituted for the usual estimators when the data followed a multivariate non-
normal distribution The Type I error performance of the J procedure with robust estimators was
similar to that of the K procedure when sample sizes were sufficiently large (ie n1 = 24 and n2
= 36) More importantly however there was a dramatic improvement in power when the test
procedures with robust estimators were compared to their least-squares counterparts this was
observed both for heavy-tailed (ie extreme values in the tails) and skewed distributions The
differences in power were as great as 60 percentage points which represents a substantial
difference in the ability to detect outcome effects
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
18
Computer Program to Obtain Numeric Solutions
Appendix B contains a module of programming code that will produce numeric results
using least-squares and robust estimators with the test procedures enumerated previously that is
the BF J2 J K NV and Y procedures The module is written in the SAS language (SAS
Institute Inc 1999a) The IML (Interactive Matrix Language) component of SAS is required to
run this program This program can be used with either the PC or UNIX versions of SAS it was
generated using SAS version 82 The program can be downloaded from Lisa Lixrsquos website
httphomeccumanitobaca~lixlm
In order to run the program the data set group sizes proportion of trimming and
nominal level of significance α must be input It is assumed that the data set is complete so that
there are no missing values for any of the subjects on the outcome variables The program
generates as output the summary statistics for each group (ie means and covariance matrices)
For each test procedure the relevant T andor F statistics are produced along with the numerator
(ν1) and denominator (ν2) df for the F statistic and either a p-value or critical value These results
can be produced for both least squares estimators and robust estimators with separate calls to the
program
To produce results for the example data of Table 1 with least-squares estimators the
following data input lines are required
Y=1 51 28 48 12 49 13 51 13 52 11 47 19 46 18 48 18 50 21 50 19 45 20 46 22 48 18
49
NX=6 8
PTRIM=0
ALPHA=05
RUN T2MULT
QUIT
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
19
The first line is used to specify the data set Y Notice that a comma separates the series of
measurements for each subject and parentheses enclose the data set The next line of code
specifies the group sizes Again parentheses enclose the element values No comma is required
to separate the two elements The next line of code specifies PTRIM the proportion of trimming
that will occur in each tail of the distribution If PTRIM=0 then no observations are trimmed or
Winsorized If PTRIM gt 0 then the proportion specified is the proportion of observations that
are trimmedWinsorized To produce the recommended 20 trimming PTRIM=20 Note that a
symmetric trimming approach is automatically assumed in the program trimming proportions
for the right and left tails are not specified The RUN T2MULT code invokes the program and
generates output Observe that each line of code ends with a semi-colon Also it is necessary that
these lines of code follow the FINISH statement that concludes the program module
Table 3 contains the output produced by the SASIML program for each test statistic for
the example data set For comparative purposes the program produces the results for Hotellingrsquos
(1931) T2 We do not recommend however that the results for this procedure be reported The
output for least-squares estimators is provided first A second invocation of the program with
PTRIM=20 is required to produce the results for robust estimators As noted previously the
program will output a T statistic andor an F statistic along with the df and p-value or critical
value This information is used to either reject or fail to reject the null hypothesis
As Table 3 reveals when least-squares estimators are adopted all of the test procedures
fail to reject the null hypothesis of equality of multivariate means One would conclude that there
is no difference between the two groups on the multivariate means However when robust
estimators are adopted all of the procedures result in rejection of the null hypothesis of equality
of multivariate trimmed means leading to the conclusion that the two groups do differ on the
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
20
multivariate means These results demonstrate the influence that a small number of extreme
observations can have on tests of mean equality in multivariate designs
Conclusions and Recommendations
Although Hunter and Schmidt (1995) argue against the use of tests of statistical
significance their observation that ldquomethods of data analysis used in research have a major
effect on research progressrdquo (p 425) is certainly valid in the current discussion Recent advances
in data-analytic techniques for multivariate data are unknown to the majority of applied health
researchers Traditional procedures for testing multivariate hypotheses of mean equality make
specific assumptions concerning the data distribution and the group variances and covariances
Valid tests of hypotheses of healthcare intervention effects are obtained only when the
assumptions underlying tests of statistical significance are satisfied If these assumptions are not
satisfied erroneous conclusions regarding the nature or presence of intervention effects may be
made
In this article we have reviewed the shortcomings of Hotellingrsquos (1931) T2 test and
described a number of procedures that are insensitive to the assumption of equality of population
covariance matrices for multivariate data Substituting robust estimators for the usual least-
squares estimators will result in test procedures that are insensitive to both covariance
heterogeneity and multivariate non-normality Robust estimators are measures of location and
scale less influenced by the presence of extreme scores in the tails of a distribution Robust
estimators based on the concepts of trimming and Winsorizing result in the most extreme scores
either being removed or replaced by less extreme scores To facilitate the adoption of the robust
test procedures by applied researchers we have presented a computer program that can be used
to obtain robust solutions for multivariate two-group data
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
21
The choice among the Brown-Forsythe (1974) James (1954) second order Johansen
(1980) Kim (1992) Nel and Van der Merwe (1986) and Yao (1965) procedures with robust
estimators will depend on the characteristics of the data such as the number of dependent
variables the nature of the relationship between group sizes and covariance matrices and the
degree of inequality of population covariance matrices Current knowledge suggests that the Kim
(1992) procedure may be among the best choice (Wilcox 1995b) because it does not result in
liberal or conservative tests under many data-analytic conditions and provides good statistical
power to detect between-group differences on multiple outcome variables Further research is
needed however to provide more specific recommendations regarding the performance of these
six procedures when robust estimators are adopted
Finally we would like to note that the majority of the procedures that have been
described in this paper can be generalized to the case of more than two independent groups (see
eg Coombs amp Algina 1996) Thus applied health researchers have the opportunity to adopt
robust test procedures for a variety of multivariate data-analytic situations
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
22
References
Algina J Oshima T C amp Tang K L (1991) Robustness of Yaorsquos Jamesrsquo and Johansenrsquos
tests under variance-covariance heteroscedasticity and nonnormality Journal of Educational
Statistics 16 125-139
Brown M B amp Forsythe A B (1974) The small sample behavior of some statistics which test
the equality of several means Technometrics 16 385-389
Christensen W F amp Rencher A C (1997) A comparison of Type I error rates and power
levels for seven solutions to the multivariate Behrens-Fisher problem Communications in
Statistics ndash Simulation and Computation 26 1251-1273
Coombs W T amp Algina J (1996) New test statistics for MANOVAdescriptive discriminant
analysis Educational and Psychological Measurement 56 382-402
de la Rey N amp Nel D G (1993) A comparison of the significance levels and power functions
of several solutions to the multivariate Behrens-Fisher problem South African Statistical
Journal 27 129-148
Everitt B S (1979) A Monte Carlo investigation of the robustness of Hotellingrsquos one- and two-
sample T2 tests Journal of the American Statistical Association 74 48-51
Fouladi R T amp Yockey R D (2002) Type I error control of two-group multivariate tests on
means under conditions of heterogeneous correlation structure and varied multivariate
distributions Communications in Statistics ndash Simulation and Computation 31 375-400
Grissom R J (2000) Heterogeneity of variance in clinical data Journal of Consulting and
Clinical Psychology 68 155-165
Hakstian A R Roed J C amp Lind J C (1979) Two-sample T2 procedure and the assumption
of homogeneous covariance matrices Psychological Bulletin 56 1255-1263
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
23
Harasym P H Leong E J Lucier G E amp Lorscheider F L (1996) Relationship between
Myers-Briggs psychological traits and use of course objectives in anatomy and physiology
Evaluation amp the Health Professions 19 243-252
Hill M A amp Dixon W J (1982) Robustness in real life A study of clinical laboratory data
Biometrics 38 377-396
Holloway L N amp Dunn O J (1967) The robustness of Hotellingrsquos T2 Journal of the
American Statistical Association 62 124-136
Hoover D R (2002) Clinical trials of behavioural interventions with heterogeneous teaching
subgroup effects Statistics in Medicine 30 1351-1364
Hopkins J W amp Clay P P F (1963) Some empirical distributions of bivariate T2 and
homoscedasticity criterion M under unequal variance and leptokurtosis Journal of the
American Statistical Association 58 1048-1053
Hotelling H (1931) The generalization of studentrsquos ratio Annals of Mathematical Statistics 2
360-378
Ito P K (1980) Robustness of ANOVA and MANOVA test procedures In P R Krishnaiah
(ed) Handbook of Statistics Vol 1 (pp 199-236) North-Holland New York
Ito K amp Schull W J (1964) On the robustness of the T2
0 test in multivariate analysis of
variance when variance-covariance matrices are not equal Biometrika 51 71-82
James G S (1954) Tests of linear hypotheses in univariate and multivariate analysis when the
ratios of population variances are unknown Biometrika 41 19-43
Johansen S (1980) The Welch-James approximation to the distribution of the residual sum of
squares in a weighted linear regression Biometrika 67 85-92
Keselman H J Kowalchuk R K amp Lix L M (1998) Robust nonorthogonal analyses
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
24
revisited An update based on trimmed means Psychometrika 63 145-163
Kim S J (1992) A practical solution to the multivariate Behrens-Fisher problem Biometrika
79 171-176
Knapp R G amp Miller M C (1983) Monitoring simultaneously two or more indices of health
care Evaluation amp the Health Professions 6 465-482
Lix L M amp Keselman H J (1998) To trim or not to trim Tests of mean equality under
heteroscedasticity and nonnormality Educational and Psychological Measurement 58 409-
429
Mehrotra D V (1997) Improving the Brown-Forsythe solution to the generalized Behrens-
Fisher problem Communications in Statistics ndash Simulation and Computation 26 1139-1145
Nel D G amp van der Merwe C A (1986) A solution to the multivariate Behrens-Fisher
problem Communications in Statistics ndash Simulation and Computation 15 3719-3735
SAS Institute Inc (1999a) SASIML userrsquos guide Version 8 Author Cary NC
SAS Institute Inc (1999b) SASSTAT userrsquos guide Version 8 Author Cary NC
Schmidt F amp Hunter J E (1995) The impact of data-analysis methods on cumulative research
knowledge Evaluation amp the Health Professions 18 408-427
Sharmer L (2001) Evaluation of alcohol education programs on attitude knowledge and self-
reported behavior of college students Evaluation amp the Health Professions 24 336-357
Vallejo G Fidalgo A amp Fernandez P (2001) Effects of covariance heterogeneity on three
procedures for analyzing multivariate repeated measures designs Multivariate Behavioral
Research 36 1-27
Wilcox R R (1995a) ANOVA A paradigm for low power and misleading measures of effect
size Review of Educational Research 65 51-77
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
25
Wilcox R R (1995b) Simulation results on solutions to the multivariate Behrens-Fisher
problem via trimmed means The Statistician 44 213-225
Yao Y (1965) An approximate degrees of freedom solution to the multivariate Behrens-Fisher
problem Biometrika 52 139-147
Yuen K K (1974) The two-sample trimmed t for unequal population variances Biometrika 61
165-170
Zwick R (1986) Rank and normal scores alternatives to Hotellingrsquos T2 Multivariate
Behavioral Research 21 169-186
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
26
Appendix
Numeric Formulas for Alternatives to Hotellingrsquos (1931) T2 Test
Brown and Forsythe (1974)
The numeric formulas presented here are based on the work of Brown and Forsythe with
the modifications to the df calculations suggested by Mehrotra (1997 see also Vallejo Fidalgo
amp Fernandez 2001) Let wj = njN and jw = 1 ndash wj Then
ν
BF
2
BF2BF T
pfF (A1)
where νBF2 = f2 ndash p + 1 TBF is given in equation 9 and
1
1
1
122
22
22
2
1122
11
1
122
12
SSSS
GG
wtrwtrn
wtrwtrn
trtrf (A2)
In equation A2 tr denotes the trace of a matrix and 22111 SSG ww The test statistic FBF is
compared to the critical value F[νBF1 νBF2] where
ν
22
2
11
22
22
2
112
22
2
1
22
1BF1
SSSSGG
GG
wtrwtrwtrwtrtrtr
trtr (A3)
and G2 = w1S1 + w2S2
James (1954) Second Order
The test statistic T2 of equation 8 is compared to the critical value 2
p (A + 2
p B) + q
where 2
p is the 1 ndash α percentile point of the χ2 distribution with p df
1
1
1
1
2
11
2
2
1-
2
2
1
1-
1
AAAA trn
trnp
A (A4)
Aj = Sjnj 21 AAA and
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
27
2
1
1
1
2
1
1
1
)2(
1 2
2
1-
2
1-
2
1-
2
2
1
1-
1
1-
1
1-
1
AAAAAAAAAAAA trtrn
trtrnpp
B (A5)
The constant q is based on a lengthy formula which has not been reproduced here it can be
found in equation 67 of James (1954)
Johansen (1980)
Let FJ = T2c2 where c2 = p + 2C ndash 6C(p + 1) and
2
1
1-221- 1
1
2
1
j
jj
j
trtrn
C AAAA (A6)
The test statistic FJ is compared to the critical value F[p νJ] where νJ = p(p + 2)3C
Kim (1992)
The K procedure is based on the test statistic
ν
11
21
1-T
21KK
mfcF
YYVYY (A7)
where 21
2
2121
21
21
2
21
22
2
1 2 AAAAAAAV rr
1
1
2
1 p
l
l
p
l
l
h
h
c (A8)
1
2
2
1
p
l
l
p
l
l
h
h
m (A9)
hl = (dl + 1)(dl12
+r)2 where dl is the l
th eigenvalue of 1
21AA r = | 1
21AA |1(2p)
and | | is the
determinant The test statistic FK is compared to the critical value F[m νK] where νK = f1 ndash p + 1
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
28
2
1j
2
21
1
1
jj b
T
nf (A10)
and 21
1-1-T
21 YYVAVYY jjb
Nel and van der Merwe (1986)
Let
ν
2
2N
NVpf
TF (A11)
where νN = f2 ndash p + 1 and
1
12
1
2222
2
j
jj
j
trtrn
trtrf AAAA (A12)
The FNV statistic is compared to the critical value F[p νN]
Yao (1965)
The statistic FY is referred to the critical value F[p νK]
ν
1
2KY
pf
TF (A13)
where f1 is given by equation A10 and νK again equals f1 ndash p + 1
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
29
Footnotes
1The sum of the eigenvalues of a matrix is called the trace of a matrix
2The skewness for the normal distribution is zero
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
30
Table 1 Multivariate Example Data Set
Group Subject Yi1 Yi2
1 1 1 51
1 2 28 48
1 3 12 49
1 4 13 51
1 5 13 52
1 6 11 47
2 1 19 46
2 2 18 48
2 3 18 50
2 4 21 50
2 5 19 45
2 6 20 46
2 7 22 48
2 8 18 49
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
31
Table 2 Summary Statistics for Least-Squares and Robust Estimators
Least-Squares Estimators
Robust Estimators
63
040322S
93
078741S
7490131Y 7474192Y
8492121t
Y 847219t2Y
03
1061w2S
32
30011wS
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
Multivariate Tests of Means
32
Table 3 Hypothesis Test Results for Multivariate Example Data Set
Procedure Test Statistic df p-valueCritical value (CV) Decision re Null
Hypothesis
Least-Squares Estimators
T2
T1 = 61
FT = 28
ν1 = 2
ν2 = 11
p = 106 Fail to Reject
BF TBF = 91
FBF = 37
ν1 = 4
ν2 = 44
p = 116 Fail to Reject
J2 T2 = 50 ν1 = 2 CV = 142 Fail to Reject
J T2 = 50
FJ = 23
ν1 = 2
ν2 = 69
p = 175 Fail to Reject
K FK = 25 ν1 = 15
ν2 = 61
p = 164 Fail to Reject
NV T2 = 50
FNV = 20
ν1 = 2
ν2 = 44
p = 237 Fail to Reject
Y T2 = 50
FY = 21
ν1 = 2
ν2 = 61
p = 198 Fail to Reject
Robust Estimators
T2
T1 = 590
FT = 258
ν1 = 2
ν2 = 7
p = 001 Reject
BF TBF = 1312
FBF = 562
ν1 = 5
ν2 = 60
p = 001 Reject
J2 T2 = 652 ν1 = 2 CV = 133 Reject
J T2 = 652
FJ = 295
ν1 = 2
ν2 = 63
p = 001 Reject
K FK = 281 ν1 = 20
ν2 = 66
p = 001 Reject
NV T2 = 652
FNV = 279
ν1 = 2
ν2 = 60
p = 001 Reject
Y T2 = 652
FY = 283
ν1 = 2
ν2 = 66
p = 001 Reject
Note T2 = Hotellingrsquos (1931) T
2 BF = Brown amp Forsythe (1974) J2 = James (1954) second
order J = Johansen (1980) K = Kim (1992) NV = Nel amp van der Merwe (1986) Y = Yao
(1965)
top related