applied multivariate data analysis (everitt/applied multivariate data analysis) || multivariate data...

Applied Multivariate Data Analysis Brian S. Everitt and Graham Dunn© 2001 Brian S. Everitt and Graham Dunn. Published 2001 by Brian S. Everitt and Graham Dunn

1 Multivariate data and multivariate statistics

1.1 Introduction

The methods used in the systematic pursuit of knowledge are very similar in all branches of science. They involve the recognition and formulation of problems, the collection of relevant empirical data through either passive observation or experimental intervention, and often the use of mathematical or statistical analysis to explore relationships in the data or to test specific hypotheses about the observations. In some areas of activity, however, such as the social, behavioural and biological sciences, there are special problems that either do not exist or at least are less common in others. For example, the complexity and ambiguity of some aspects of human behaviour create major difficulties for the psychologist when trying to draw reliable and valid inferences. These difficulties are usually not a problem for his or her counterpart working in a physics or chemistry laboratory. Consequently it has long been recognized by the social and behavioural scientists that they will generally need to employ relatively more sophisticated and complex analytical tools to investigate their data. In some cases these tools have been supplied by statisticians, in others they have been developed first by the subject-matter specialists themselves. A classic example of the latter is factor analysis, the basic concepts of which originated with the work of Spearman as early as 1904 and later extended by Thurstone, Burt and Thompson and others in the 1920s and 1930s. It was only sometime after this that a statistician, Lawley (1941), considered the problem in a more formal way and applied the maximum likelihood approach to the estimation of the parameters of the factor analysis model, and it was only during the 1960s and 1970s that suitable algorithms were developed to find such estimates.

Routine use of many of the methods of multivariate statistical analysis had to await the dramatic revolution in data analysis brought about by the development and increasing availability of the electronic computer and associated software packages. Many of the analyses described in this text are trivially


easy for the non-statistician to carry out on a personal computer once he or she has mastered the use of the required software package. Knowing what is being done (and why), however, is another matter. It is important to know which methods might be useful and which are almost certainly inappropriate. It is equally important that users of software packages fully understand the output that they produce and are able to draw valid inferences from this output. The naive use of software packages by mathematically and statistically unsophisticated researchers has its obvious pitfalls and can, and probably often does, lead to practically worthless results and misleading conclusions. It is clearly impractical and not even sensible, however, to insist that such analyses should only be carried out by qualified statisticians (and even statisticians make mistakes!). It is therefore important that both the user and the consumer of these methodologies understand enough of. their characteristics (both their advantages and their pitfalls) to be able to make an informed choice about which method to use and also to be able to critically appraise their own results and those of others. This is one of the main goals of the present text.

1.2 Types of data

The data with which we are primarily concerned consist of a series of measurements or observations made on a number of subjects, patients, objects or other entities of interest. They might comprise the results of applying a battery of cognitive tests to a sample of patients with Alzheimer's disease, the taxonomic characteristics of bacteria or the relative proportions of several constituents of different types of rock (or food), for example. One special type of multivariate data set involves the collection of repeated measures of the same characteristic over time. And in a situation that might be termed doubly multivariate we might indeed have a multivariate set of characteristics that is assessed at each of several time points.

A typical multivariate data matrix, X, will have the form

X = ( ::: ::: : : : =~ l ' Xnl Xn2 Xnp

where the typical element, xii, is the value of the jth variable for the ith individual. If there are several distinct groups of individuals one of the xiis might be a categorical variable with values of I, 2, etc. to distinguish these groups. The number Of individuals under investigation is n and the number of observations taken on each of these n individuals is p. Table l.l gives a hypothetical example of such a multivariate data matrix. Here n = 10, p = 7 and, for example, X34 = 135.

In many cases, as in Table 1.1, the variables measured on each of the individuals will be of different types depending on whether they are conveying

Types of data 3

Table 1.1 Data matrix for a hypothetical example of 10 individuals

Individual Sex Age (yrs) 10 Depression Health Weight (lbs)

1 Male 21 120 Yes Very good 150 2 Male 43 NK No Very good 160 3 Male 22 135 No Average 135 4 Male 86 150 No Very poor 140 5 Male 60 92 Yes Good 110 6 Female 16 130 Yes Good 110 7 Female NK 150 Yes Very good 120 8 Female 43 NK Yes Average 120 9 Female 22 84 No Average 105

10 Female 80 70 No Good 100

Note: NK =not known

quantitative or merely qualitative information. The most common way of distinguishing these types is the following:

• Nominal - unordered categorical variables. Examples include treatment allocation, the sex of the respondent, hair colour, presence or absence of depression, and so on.

• Ordinal - where there is an ordering but no implication of distance between the different points of the scale. Examples include social class and selfperception of health (each coded from I to V, say), and educational level (no schooling, primary, secondary or tertiary education).

• Interval - where there are equal differences between successive points on the scale, but the position of zero is arbitrary. The classic example is the measurement of temperature using the Celsius or Fahrenheit scales. In some cases a variable such as a measure of depression, anxiety or intelligence, for example, might be treated as if it were interval-scaled when this, in fact, might be difficult to justify. We take a fairly pragmatic approach to such problems and frequently treat these variables as interval-scaled measures- but the readers should always question whether this might be a sensible thing to do and what implications a wrong decision might have.

• Ratio - the highest level of measurement, where one can investigate the relative magnitude of scores as well as the differences between them. The position of zero is fixed. The classic example is the absolute measure of temperature (in kelvin, for example) but other common ones include age (or any other time from a fixed event), weight and length.

The qualitative information in Table l.l could have been presented in terms of numerical codes (as often would be the case in a multivariate data set) such that Sex= l for males and Sex= 2 for females, for example, or Health= 5 when very good and Health= l for very poor, and so on. But it is vital that both the user and consumer of these data appreciate that the same numerical codes (l, say) will convey completely different information, depending on the scale of measurement.


A further feature of Table 1.1 is that it contains missing values (NK). Age has not been recorded for individual number 7, and no IQ value is available for individuals 2 and 8. Missing observations arise from a variety of reasons, and it is important to put some effort into discovering why an observation is missing. One explanation is that such an observation might not be applicable to that individual. In a taxonomic study, for example, in which the investigator might wish to classify dinosaur fossils, 'wing length' might be an important variable. Clearly dinosaurs without wings will have missing values for this variable! In other cases the measurement might be missing by accident or because the respondent either forgot or refused to provide the information. Occasionally, one might be able to obtain the information from elsewhere or to repeat the measurement and then replace the missing value with useful information.

Missing values can cause problems for many of the methods of analysis described in this text, particularly if there are a lot of them. Although there are many ways of dealing with missing-data problems (both valid and invalid!), these are, in general, beyond the scope of this text. One method with fairly universal applicability, however, is to impute ('estimate') the missing values from a knowledge of the data that are not missing. Such imputation methods range from the very simple (replace the missing value with the mean of the values from subjects with non-missing data, for example) to the technically complex (multiple imputation acknowledging the stochastic nature of the data) and are briefly described in Appendix B. However, one should always bear in mind that the imputed values are not real measurements. We do not get something for nothing! And if there is a substantial proportion of the individuals with large amounts of missing data one should clearly question whether any fomi of statistical analysis is worth the bother.

1.3 Basic multivariate statistics

Readers will be familiar with the production of simple descriptive statistics from univariate data. These include sample proportions, means and standard deviations (variances), for example. In the case of pairs of measurements, readers are likely also to be familiar with bivariate correlations (such as Pearson's product-moment correlation) and, perhaps, the corresponding covariances. When we move on to consider inferential statistics (estimation and hypothesis testing) we also have to clearly distinguish, say, the value of an unknown parameter (the population mean, for example) from a statistic obtained from a sample of individuals. In this section we briefly introduce the common multivariate equivalents of the familiar univariate and bivariate summaries.

For the time being we will restrict our discussion to quantitative measurements (other situations will be dealt with when they arise in the later chapters). In order to summarize a multivariate data set we need to produce summaries for each of the variables separately and also summarize the relationships between

Basic multivariate statistics 5

them. In the latter case, we usually take pairs of variables at a time and look at their covariance or correlation. The quantities of interest are defined below.

1.3.1 Mean

For p variables, the population mean vector is usually represented as Jl1 = [f.LI, f.Lz, ... , f.Lp], where

f.L; = E(x;). (1.1)

An estimate of J11, based on n p-dimensional observations, is x' = [x1, x2 , .•. , xp], where X; is the sample mean of the variable x;.

1.3.2 Variance

The vector of population variances can be represented by <J1 = [aT, a~, ... , a~], where

(1.2)

An estimate of <J1 based on n p-dimensional observations is s' = [sT, s~, ... , s~J, where sf is the sample variance of x;.

1.3.3 Covariance

The covariance of two variables, x; and xj, is defined by

Cov(x;,xj) = E(x;- f.L;)(xj- f.Lj). (1.3)

If i = j, we note that the covariance of the variable with itself is simply its variance, and therefore there is no need to define variances and covariances independently in the multivariate case.

The covariance of X; and xj is usually denoted by a-u (so the variance of the variable x; is often denoted by a;; rather than aT).

With p variables, x 1,x2, ... ,xP, there are p variances and p(p-1)/2 covariances. In general these quantities are arranged in a p x p symmetric matrix, ~.where

(a, al2 a,,) azi azz azp

~= . '

ap] ap2 aPP

note that aij = aji. This matrix is generally known as the variance-covariance matrix or simply the covariance matrix. The matrix ~ is estimated by the matrix S, given by

n

S = 2:)x;- x)(x;- x)'/(n- 1) ( 1.4) i=l

where x; = [x;1, x;2 , ... , X;p] is the vector of observations for the ith individual.


1.3.4 Correlation

The covariance is often difficult to interpret because it depends on the units in which the two variables are measured; consequently, it is often standardized by dividing by the product of the standard deviations of the two variables to give a quantity called the correlation coefficient, pij, where

(Jij Pij=--- .

.jU;;Ujj (1.5)

The correlation coefficient lies between -1 and + l and gives a measure of the linear relationship of the variables X; and xj. It is positive if high values of x; are associated with high values of xj and negative if high values of X; are associated with low values of xj.

With p variables there are p(p- l)/2 distinct correlations which may be arranged in a p x p matrix, R, whose diagonal elements are unity. This matrix may be written in terms of the covariance matrix, E, as follows:

R = o-1; 2Eo-112 (1.6)

where D-l/2 = diag(l/foli). In most situations we will be dealing with covariance and correlation matrices

of full rank, p, so that both matrices will be non-singular (i.e. invertible).

1.3.5 Linear combinations of variables

Many of the methods of analysis to be described in this text involve linear combinations of the original variables, x 1, x2 , ••. , xP, that is, a variable constructed thus:

(1.7)

where a1, a2 , ••• , aP are a set of scalars. This can be written more simply as

y= a'x

where a'= [a1,a2 , ••. ,ap]· The variable y has a mean given by

E(y) =a' E(x) = a'Jl,

and variance

V(y) = E[a'(x- J1)2].

A little algebra shows that this can be written as

V(y) = a'Ea.

1.4 The aims of multivariate analysis

(1.8)

(1.9)

(1.10)

(1.11)

It is often suggested that it is helpful to recognize that the analysis of data involves two separate stages. The first, particularly in new areas of research,

The aims of multivariate analysis 7

involves data exploration in an attempt to recognize any non-random pattern or structure requiring expianation. At this stage, finding the question is often of more interest than seeking the subsequent answer, the aim of this part of the analysis being to generate possible interesting hypotheses for further study. This activity is now often described as data mining. Here, formal models designed to yield specific answers to rigidly defined questions are not required. Instead, methods are sought which allow possibly unanticipated patterns in the data to be detected, opening up a wide range of competing explanations. Such techniques are generally characterized by their emphasis on the importance of visual displays and graphical representations and by the lack of any associated stochastic model, so that questions of the statistical significance of the results are hardly ever of importance.

A confirmatory analysis becomes possible once a research worker has some well-defined hypothesis in mind. It is here that some type of statistical significance test might be considered. Such tests are well known and, although their misuse has often brought them into some disrepute, they remain of considerable importance.

In this text Chapters 2-6 and Chapter 12 describe techniques which are primarily exploratory, and Chapters 7-11 and Chapter 13 techniques which are largely confirmatory, but this division should not be regarded as much more than a convenient arrangement of the material to be presented, since any sensible investigator will realize the need for exploratory and confirmatory techniques, and many met~ds will often be useful in both roles. Perhaps attempts to rigidly divide data analysis into exploratory and confirmatory parts have been misplaced, and what is really important is that research workers should have a flexible and pragmatic approach to the analysis of their data, with sufficient expertise to enable them to choose the appropriate analytical tool and use it correctly. The choice of tool, of course, depends on the aims or purpose of the analysis.

There are many reasons why we might wish to analyse a multivariate data set using multivariate methods (rather than looking at each of the variables separately using the familiar univariate methods), and we will not try to be exhaustive. Essentially, we will be searching for structure or pattern in the data, which enriches our description of what we think led to the observations or enables us to simplify our description of them. These patterns might be reflected by the correlations between various variables or the similarity of subjects as reflected by their multivariate profiles. Here the word 'profile' is synonymous with the vector, x, of the observations. In one situation we are looking for patterns of similarity between the columns of the data matrix, X, and in the other we are looking at similarities between the rows. In some situations we may be interested in both.

One obvious source of pattern arises from the fact that we may have measurements on similar groups of subjects. We may not, a priori, know what these groups are, how many there are, or which subject belongs to which group, but we use the data to explore the possibilities. This is often called unsupervised pattern recognition. Typically we might use methods such as ordination (including


principal components analysis - Chapter 3 - and multidimensional scaling -Chapter 5) or cluster analysis (Chapter 6), or several different methods, to carry out such a search for group structure.

An alternative motivation might be to find sets of variables that appear to be similar (highly correlated) and postulate that the similarities arise from the fact that they appear to be indicators or the same underlying latent variable(s) or construct(s). Here, again, we might use ordination or cluster analysis, but we are more likely to use some form of exploratory factor analysis (Chapter 12). If we were able to postulate which variables were indicators of what latent variables before carrying out any analysis we might wish to test whether the data are consistent with such a measurement model. Here we might use some form of confirmatory factor analysis (Chapter 13).

If we were able to define known groups (the two sexes, for example, or the two arms of an experiment) we might then wish to know how the multivariate profiles might discriminate between them. This might involve the use of some sort of discriminant analysis (Chapter 11) or generalized linear model (Chapters 7-9). This is often referred to as supervised pattern recognition.

Perhaps one of the most difficult and interesting areas is the exploration of patterns of association between sets of multivariate measures in order to infer (or, at least, postulate) causal pathways. This has its origins in the use of multiple regression and path analysis (Chapter 8) but is one of the main roles for covariance structure modelling or structural equation modelling (Chapter 13).

Most of this text is written from the point of v~w that there are no rules or laws of scientific inference - that 'anything goes' (Feyerbend, 1975). This implies that we see both exploratory and confirmatory methods as two sides of the same coin. We see both methods as essentially tools for data exploration rather than as formal decision-making procedures. For this reason we do not st_ress the values of significance levels, but merely use them as criteria to guide a modelling process (using the term 'modelling' as a method or methods of describing the structure of a data set). We believe that in scientific research it is the skilful interpretation of evidence and subsequent development of hunches that are important, rather than a rigid adherence to a formal set of decision rules associated with significance tests (or any other criteria, for that matter). One aspect of the scientific method, however, which we do not discuss in any detail, but which is the vital component in testing the theories that come out of our data analyses, is replication. It is clearly unsafe to search for pattern in a given data set and to 'confirm' the existence of such a pattern using the same data set. We need to validate our conclusions using further data. At this point our subsequent analysis might become truly confirmatory.

applied multivariate data analysis (everitt/applied multivariate data analysis) || multivariate data...

Documents