multivariate data analysis: its methods

Tutorial n

Chemometrics and Intelligent L&oratory Systems, 2 (1987) 29-36 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands

Multivariate Data Analysis: Its Methods *

MICHEL MELLINGER

Saskatchewan Research Council, 15 Innovation Blvd., Saskatoon, Saskatchewan S7N 2X8 (Canada)

CONTENT8

1 Introduction ....................................... 1.1 Data analysis ................................... 1.2 Multivariate data analysis ...........................

2 Multivariate data analysis methods ....................... 2.1 Factor analysis methods ............................ 2.2 Classification methods ............................. 2.3 Comments on vocabulary and usage ...................

3 Acknowledgements .................................. References.. ........................................

1 INTRODUCTION 1. I Data analysis

In this paper, multivariate data analysis methods are presented from the point of view of users asking themselves the following questions: what do multivariate data analysis techniques do? What kind of method is this author using? Multivariate data analysis methods are powerful tools for the investigation of large and complex data sets such as those generated today at a large rate by users working in various domains. Unfortunately, many users cannot apply these tools to their advantage, because they face a great difficulty when trying to first understand the methods and then apply them to their own data. It is hoped that this paper will help clarify the often-confusing field of multivariate data analysis.

The process of data analysis may be explained by considering the three following concepts: facts, data, and information. A fact is ‘something that has actual existence’ such as a rock formation, a drill core sample, or a human being; facts make up reality. Data (singular: datum) are ‘something given, some measurements used as a basis for reasoning, discussion, or calculation’, such as the thickness and porosity of a rock formation, the

FACTS

(&- NORrON

De&ion-maklng

l SRC Publication No. R-851-1-A-87.

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

...........

...........

...........

...........

...........

...........

...........

...........

...........

. . . . 29

. . . . 29

. . . . 30

. . . . 31

. . . . 31

. . . . 33

. . . . 34

. . . . 35

. . . . 35

Fig. 1. Fundamental concepts related to data analysis, and their relationships.

0169-7439/87/$03.50 0 1987 Elsevier Science Publishers B.V. 29

n Chemometrics and Intelligent Laboratory Systems

mineralogical composition of a rock sample, or the age, sex, and weight of a human being. Informa- tion is ‘knowledge obtained from the investigation or study of facts and data’, it is ‘something that justifies change, that is a prerequisite to decision- making’, such as the oil-bearing potential of a rock formation, the metamorphic grade of a rock, or the health of a human being.

The relationships between these three concepts are illustrated in Fig. 1. Reality cannot be accessed directly nor fully, and one must carry out measurements in order to derive data from facts. Next, data must be investigated before information is obtained (‘extracted’) from the data: this is the process of data analysis. Finally, decisions are made on the basis of the information obtained: decision-making can thus be seen as the ultimate purpose of data analysis.

Data analysis is carried out using various methods and techniques which have either a fairly wide or a narrow applicability, and are based either on few or on many assumptions about the mathematical properties of the data under study. For example, one field of statistics, inferential statistics, is concerned with building specific statistical models for the purpose of extrapolating (predic- ting) unknown properties of a population from known properties of a sample which represents this population. Another major field of statistics is descriptive statistics (see ref. l), concerned with the detection and interpretation of data patterns, usually within large data sets, not only for the purpose of data reduction, but also for the purpose of analyzing, verifying, testing, and proving hypotheses. As stated by Benzecri [2]: “A model must be derived from the data, not the opposite. . . . What is required is an accurate method which permits us to extract structures from data.” Multi- variate data analysis methods, reviewed in the next section, have given much power to the field of descriptive statistics, for the reasons described below.

I.2 Mdtivariate data analysis

For a long time, statisticians have been concerned mainly with the availability of numerous observations for a limited number of variables,

30

aiming at validating a given dependency model or at testing specific hypotheses. The major obstacle to studying simultaneously many variables had of course been the practical limitations of computational procedures. With the advent of the elec- tronic computer, however, this computational obstacle has disappeared, with the result that the implementation of multivariate data analysis has progressed very rapidly. It has become possible to investigate the “variable” dimension, that is, the complex relationships between many variables considered together. This is not to say that inferential statistics has lost its usefulness; but many situations (‘systems’) which could not be investigated successfully with inferential methods can now be studied using descriptive multivariate methods. In some fields of study, such as physics, the reductionist approach, which limits the representation of a system to some of its presumed components and then studies each component separately, is valid and produces very useful results. In other fields of study, however, such as economics or geology, one deals with a complex multi-dimensional system which must be studied as a whole, and whose components do not make much sense when they are isolated from each other; the multivariate extraction of information in such cases becomes necessary.

One intriguing aspect of the multivariate nature of some systems is that of the a priori dimensional&y of the system: how many variables should be measured on a given system in order to describe its properties or behaviour adequately? There is no practical answer to that question: the dimensionality of a problem is usually limited by our limited capability in selecting and measuring all variables relevant to the description of the system and by the somewhat arbitrary nature of this ‘relevancy’. Then, can a variable that was not measured, an absent variable, still influence the data patterns obtained from those variables that were measured? It may, and this will depend upon whether the absent variable is correlated (in a generic sense) with the measured variables: if it is, then the observed data patterns will contain features that cannot be explained fully without knowledge of the absent variable; if it is not correlated, then the data patterns will not contain

Tutorial =

features resulting from that absent variable. This fact is sometimes unfortunate - when absent variables could, and should, have been measured in order to increase information contained in the data set -, but is also fortunate - when the absent variables are precisely those whose influence we are trying to discover through the study of the data patterns.

2 MULTIVARIATE DATA ANALYSIS METHODS

Many multivariate data analysis methods exist, and users who are approaching the field of multivariate statistics commonly find it quite confusing, faced as they are with a prolific vocabulary which developed in a generally uncoordinated manner. Further comments about vocabulary and usage of multivariate statistics will be given in Section 2.3; let us first try to put some order in the apparent multitude of methods we must deal with.

Multivariate data analysis methods fall into

Data table

IIll tl2 Tl T2

- ~86137563597976149745..... 87536594685389754298 98435631904540983453 .._ 87372372554048528309 87648474757640437203 _... 984648864625984 17864. 09746798758502959853

?a analysy

A

Classification J L Factor analysis methods methods

F2

A

(partitioning)

B 1 _..... i _,..._ -_F , I E

C

F

(hierarchical)

Fig. 2. Multivariate data analysis: categories of methods.

only two broad categories: factor analysis methods and classification methods (Fig. 2). The purpose of methods from the former category is to calculate new variables from the initial variables and their approach is geometrical in nature; often (but not necessarily), these new variables are used to examine the data using some kind of geometrical projection or representation. The purpose of classification methods, on the other hand, is to allocate variables (or cases) to classes in order to obtain generally homogeneous subgroups of variables (or cases), and their approach is essentially algebraic in nature. Each category of methods will now be examined in more detail.

2.1 Factor analysis methods

Factor analysis methods calculate, from the initial variables, new variables called factors, which are linear combinations of the initial variables.

Why calculate factors? The initial data table is somewhat redundant because it contains various correlations (in a generic sense) between rows and between columns. This results from the fact that each of the measured variables may not explain alone a particular aspect of the phenomenon we are trying to investigate; it may also be that, among all the measurements that were made, several cover a same range of values for some of the variables. Factors are calculated in such a way that they take into account the correlations (in a generic sense) present in the data table, and that they are uncorrelated (mathematically-speaking: orthogonal to one another). In this way, data structures become apparent in the ‘factor space’ (as opposed to the ‘data space’); these can be interpreted more readily by the user because such data patterns are usually more directly related to the phenomena under study than the somewhat redundant measured variables.

How are factors calculated? Firstly, the data table is transformed in a certain fashion to produce a matrix. Secondly, the eigenvalues and ei- genvectors of this matrix are calculated following standard numerical procedures; each eigenvalue and its related eigenvector define a factor, and each eigenvalue measures the amount of variabil- ity in the data (in a generic sense) that is accounted

31

W Chemometrics and Intelligent Laboratory Systems

for by the factor. Thirdly, the data table is transformed into the calculated factor space. The key step in this procedure is the first step above, where the data table is transformed into a matrix as- sumed to contain enough information about the initial data that the factors calculated from it help describe the data patterns. The reason for the variety of available factor analysis methods is found in the many variations used in this data-to- matrix transformation.

For principal components analysis [1,3,4], the matrix used to calculate the factors is the correlation matrix (in the standard statistical sense) for the variables. The covariance-variance matrix may be used instead, but only when the variables are homogeneous, i.e. when they have essentially equal variances, otherwise a notable ‘scale factor’ dis- turbs the factor space. The factors obtained from this matrix define a factor space for the variables only. Another factor space exists for the cases which, although related mathematically to the variables factor space by means of the data table (the two spaces are said to be ‘dual’ spaces), cannot be directly superimposed onto it; neverthe- less, this may be accomplished indirectly using one or another convention, to produce what is often referred to as bi-plots. For historical reasons, the correlation coefficient has been noted r (as in regression) since its invention, and the calculation of the variables factor space is thus known as R-mode factor analysis; when the dual- ity relationship with the cases factor space was applied to the calculation of this new space, this new procedure was named Q-mode factor analysis.

For correspondence analysis [1,2,5,6], the matrix used to calculate the factors is equivalent to the table of weighted data profiles derived directly from the data table. As the profiles can be calculated for the rows or for the columns of the data table, one might expect that two factor spaces will be obtained: one for the rows and one for the columns. Fortunately, this is not the case: only one factor space is obtained, which is simultaneously the factor space for the rows and the columns. Actually, as described in ref. 1, correspondence analysis can be viewed as finding the best simultaneous representation of two data sets

32

that comprise the rows and columns of a data matrix.

Another factor analysis method is discriminant analysis [1,7,8] often (wrongly) perceived by users as being a classification method. The purpose of discriminant analysis is to find factors, although in a somewhat constrained context: the user tells the method beforehand which cases are believed to belong to which group among a set of pre-defined groups, and the method then finds those factors which discriminate best between the groups of cases. For N groups of cases defined by the user, discriminant analysis finds N - 1 factors, and derives from them N classification functions which permit the eventual allocation of new cases to one or the other of the pre-defined groups. It is this latter application of discriminant analysis that creates confusion in the mind of users who see discriminant analysis as a classification method.

Two closely related factor analysis methods have proved very useful in understanding the relationships between the various factor analysis methods that were developed over time: canonical correlation analysis, developed by Hotelling [9], and multiple canonical correlation analysis, an extension of the previous method developed by Carroll [lo]. Canonical correlation analysis con- siders two sets of variables, sets {A} and {B) measured for the same cases - thus producing the data table ({A}, {B}) - and finds two factor spaces - one for each of {A } and { B } - so that each factor from {A} is correlated as well as possible with one factor from {B}. Although this method is very powerful, its results are difficult to interpret and use, which explains the limited practical success of canonical correlation analysis. Multiple canonical correlation analysis is an extension of the previous method which correlates more than two sets of variables, for example the four sets of variables in the data table ({ A }, { B }, {C}, (0)). ‘IX s method is of great theoretical value and provides an excellent framework for discussing several other multivariate methods [l,ll]:

- for a data table ({A}, {B}), one has Hotell- ing’s canonical correlation analysis;

- for a data table ({x}, {Y}) where x is a single variable and {Y } a set of variables, one has

Tutorial W

multiple regression with x the dependent variable and { Y} the independent (explicative) variables;

- for a data table ({A}, {I }), where {A} is a set of quantitative variables and {Z} is a set of indicator variables (Boolean vectors), one has discriminant analysis, with { Z } being the description of group memberships as pre-defined by the user;

- for a data table ({ A }, { B }) where each of {A } and { B } describes a partition of a population as tabulated in a contingency table, one has (‘simple’) correspondence analysis, which can thus be considered as being essentially a double discriminant analysis;

- for a data table({Z}, {J}, {K},...) where each set is made of one indicator variable (Boolean vector), one has correspondence analysis of complete disjunctive variables, also called by some authors ‘multiple’ correspondence analysis;

- finally, for a data table ({a}, (b}, cc},...) where only one quantitative variable is present in each set, one has principal components analysis.

2.2 Classification methods

Classification methods, also known as cluster- ing or cluster analysis methods, analyze a data table by considering only one entry at a time and do not relate directly the two entries to each other like factor analysis methods can do. In this section, only the classification of the columns of a data table is mentioned; classification of the rows of a data table follows exactly the same procedure. In this respect, it is obvious that if one were to classify first the columns and then the rows of a data table, the two results would be related in some fashion because rows and columns are related through the contents of the data table; this is where the combination of classification and factor analysis methods proves to be very powerful, as explained in refs. 12 and 13.

Classification methods, when applied to the columns of a data table, will group these columns into classes which are non-empty, usually disjoint, sets of columns found to be similar enough to be clustered together. Two key steps are involved: (1) as for any method, the definition of a distance criterion (or similarity criterion, which has opposite properties), which will permit us to calculate

how dissimilar (or similar) two columns are, and (2) the definition of an aggregation criterion, which is an extension of the concept of distance criterion to the measuring of a distance (or similarity) between one column and an existing class of columns (i.e. several columns and not only one, as with the normal distance criterion), and which permits the classification to proceed. Many distance criteria exist which are adapted to various types of data and various types of problems. Several aggregation criteria have also been developed, the most commonly found in classification software being (under these or fairly similar names): average linkage, single linkage, complete linkage, and Ward’s inertia criterion. The reader is referred to refs. 1 and 12-14 and various statistical software manuals for more details. This should suffice for the reader to realize that several dozens of classification techniques exist. In practice, you only need to remember that understanding the two criteria, explained above, in a particular application will allow you to understand each particular technique. In addition, classification methods belong to one of two types: partitioning methods and hierarchical methods.

A partition of a set of elements is defined as a collection of non-empty and disjoint subsets whose union is equal to the initial set. Partitioning methods, also know as K-means methods, will thus allocate the columns of a data table each to one of a pre-defined number of classes until all columns have been classified. The user must help such a method at its start by first specifying how many classes are required, and then giving some initial definition of each class (e.g. by providing loca- tions of centres of gravity, or by providing a set of class nuclei) even though the algorithm may mod- ify the latter as it proceeds. The algorithms of partitioning methods are usually convergent (i.e. usually find a solution), but may not be optimal because their results depend upon the user’s initial choices. Their relatively easy implementation, even when very large data sets are involved (see further), is their main strength.

Hierarchical methods will aggregate the columns of a data table, starting with the two most similar (less distant) columns and continuing with the leftover columns in accordance with the dis-

33

W Chemometrics and Intelligent Laboratory Systems

tance and aggregation criteria selected, until only one class is obtained which is the set of all columns. These methods thus produce a series of nested classes which form a binary tree or dendro- gram. One may note that when this tree is ‘cut’ at any level between the initial state (at the ‘leaves’: all columns separate) and the final state (at the ‘root’: all columns together), it yields a partition of the columns into as many classes as there are ‘branches’ at that level. The result of a hierarchical classification method is thus a collection of nested partitions, and the user can select the level at which a partition may be obtained, which is clearly more flexible than calculating partitions for an a priori number of classes. There are cases when partitioning methods are still of interest, however, essentially when very large data sets (i.e. larger than 5000 X lOO?) are involved and must be reduced to a more manageable size, before eventual further processing by hierarchical classification or other methods.

The domain of classification methods is actually somewhat more complex than this summary indicates, although the two types recognized here - partitioning and hierarchical - are fundamental and are the only ones generally encountered in statistical software. Other methods that can be mentioned are: classification involving overlap- ping clusters [l&16], and classification involving fuzzy sets [17].

2.3 Comments on vocabulary and usage

As stated earlier, a prolific vocabulary has emerged during the development of the field of multivariate statistics, which is both confusing and discouraging for new users. It is hoped that the few comments that follow will not add to that confusion.

The most common source of confusion can be traced to unclear statements made by users about the method or technique they employ and the usage they make of it. For example, when an author states that principal components analysis was used to ‘classify’ samples, doubts are bound to appear in a beginner’s mind. It is an unfortunate fact that the human vocabulary is very limited considering the variety of concepts that

34

human beings can generate; the additional fact that discipline in vocabulary usage is a difficult task does not help the situation. To come back to the example above, a factor analysis method (principal components analysis) is used to project sample distribution patterns onto factorial planes: this is the technical aspect. Then, because groups of samples appear on such projections, the user decides, legitimately enough, to consider such groupings as significant for the purpose of the project under study: this is the usage aspect. Maybe the user should next state that ‘sample groups’were identified’ or ‘new samples were as- signed to recognized groups’, rather than ‘samples were classified’: this is the aspect of vocabulary discipline. I have only one advice for users:

(a) try to distinguish a technique (data --$ information, see ‘in Fig. 1) from its usage (information --t decisions, see in Fig. l), then

(b) develop a clear understanding of the type of technique used and its specifications, and

(c) appreciate the usage of this technique. Here are some other examples of general prob-

lem areas related to vocabulary in multivariate statistics that are clarified when one uses the simple typology of methods explained in the two previous sections. ‘Mapping techniques’, whether linear or non-linear, are projection methods and are thus akin to the factor analysis type. ‘Dis- criminant analysis’, as explained in Section 2.1, is a factor analysis method and not a classification method. ‘Supervised’ classification methods assume that the user has information about the classes before the algorithm is applied, as is the case with partitioning methods for which class nuclei are given as input; unfortunately, this ex- pression is often used, for example in chemomet- tics, to designate discriminant analysis or another form of sample identification in a factor space! Conversely, ‘unsupervised’ classification methods do not assume anything about the elements to be classified, as is the case for hierarchical classification. It is suggested that the expressions ‘unsupervised classification’ and ‘supervised classification’ be used and interpreted with extreme care, because they relate to usage more than to specific techniques. ‘Modelling’ consists in using mathematical formulae to describe a phenomenon, or

Tutorial n

observed correlations (in the generic sense), or any feature that can be described quantitatively: it is thus not related more to multivariate statistics than it is to other statistical techniques. Similarly with ‘prediction’, which consists essentially in extrapolating results from known cases to unknown cases. Finally, a commonly used post-processing technique in factor analysis is ‘factor rotations’, which can be done following various criteria: users frequently ask whether it is an appropriate procedure. Without going into the details of the battle waged by statisticians around this question, well presented in ref. 18, it is worth noting that the initial purpose of factor rotations, which may even involve oblique (i.e. correlated) factors, was to avoid data structures of general interest in favour of those of specific a priori interest to the user. In the general context of data analysis, it is safer to accept the data structures obtained without rotations and to try to interpret them for what they are; if some geometrical transformation of the factor coordinates is useful for some practical reason, such as measuring a distance along an oblique trend of cases, users can take the responsi- bility of making such simple calculations on their own.

A few last comments may be made which are more specifically related to the field of chemometrics as it is exposed in this volume. The SIMCA method (Soft Independent Modelling of Class Analogy [19]) is used following the identification of subsets of cases in a principal components factor space, and consists in the local modelling of each of these subsets separately by local principal components analyses. The result is a series of local principal components models in a common initial principal components space, described by a series of parameters. The PLS method (Partial Least Squares [20]) is based on a double principal components analysis, and finds factors from one variable space which are correlated as well as possible to factors from another variable space, both being ‘measured’ on the same set of samples. As a method, PLS has thus a purpose similar to that of canonical correlation analysis, discussed briefly in Section 2.1.

3 ACKNOWLEDGEMENTS

I wish to thank the organizers and sponsors of the workshop “Multivariate Statistics for Geo- chemists and Geologists” for inviting me to par- ticipate. I also wish to thank the participants of this workshop for the enthusiasm they showed during the course of the workshop, which encour- aged me to write this contribution; I hope it will in turn encourage them to use multivariate data analysis methods in their work. R.G. Brereton, G. Nickless, and an anonymous reviewer are grate- fully acknowledged for their constructive criti- cisms and their suggestions for improving the manuscript.

REFERENCES

1 L. Lebart, A. Morineau and K.M. Warwick, Multivariate Descriptive Statistical Analysis: correspondence analysis and

related techniques for large matrices, Wiley, New York,

1984, 231 pp.

2 J.P. Benz&i, L’Anaiyse des Don&es: 2. L’Analyse des

Correspondences, Dunod, Paris, 1st ed. 1973, 2nd ed. 1980,

632 pp.

3 K. Pearson, On lines and planes of closest fit to systems of

points in space, Philosophical Magazine, 2, No. 11 (1901)

559-572.

4 C.R. Rao, The use and interpretation of principal compo-

nent analysis in applied research, Sankhya, Series A, 26

(1964) 329-357.

5 M.O. Hill, Correspondence analysis: a neglected multi-

variate method, Applied Statistics, 3 (1974) 340-354.

6 M.J. Greenacre, Theory and Applications of Correspondence Analysis, Academic Press, London, 1984, 364 pp.

7 R.A. Fisher, The use of multiple measurements in taxo-

nomic problems, Annals of Eugenics, 7 (1936) 179-188.

8 C.R. Rao, Advanced Statistical Merhoa!s in Biometric Re- search, Wiley, New York, 1952.

9 H. Hotelling, Relations between two sets of variables, Bio-

merrika, 28 (1936) 129-149.

10 J.D. Carroll, A generalization of canonical correlation anaf-

ysis to three or more sets of variables, Proceedings of the

American Psychological Association, 1968, pp. 227-228.

11 J.M. Bouroche and G. Saporta, L’Analyse des Don&es, Presses Universitaires de France, Paris, 1980, 127 pp.

12 M. Jambu and M.O. Lebeaux, Cluster Analysis and Data Analysis, North Holland, New York, 1983, 898 pp.

13 J.P. Fenelon, Qu’Est-Ce Que I’Analyse des Don&es?,

Lefonen, Paris, 1981, 311 pp.

14 J.P. Benz.&i, L’Analyse des Don&es: 1. La Taxinomie, Dunod, Paris, 1st ed. 1973, 2nd ed. 1980, 625 pp.

35

n Chemometrics and Intelligent Laboratory Systems

15

16

17

18

N. Jardine and R. Sibson, A model for taxonomy, Muthe-

matical Biosciences, 2 (1968) 465-482.

N. Jardine and R. Sibson, Mathematical Tuxonomy, Wiley,

New York, 1971.

J.C. Bezdek, Pattern Recognition with Fuzzy Objective Func- tions, Plenum Press, New York, 1981.

J.P. Ben&xi, Histoire et Prkhistoire de I’Anaiyse des

Donn&s, Dunod, Paris, 1982,159 pp.

19 S. Weld, Pattern recognition by means of disjoint principal

components models, Pattern Recognition, 8 (1976) 127-139. 20 H. Weld, Non-linear estimation by iterative least squares

procedures, in Research Papers in Statistics: Festschrift for

Neyman, Wiley, New York, 1966, pp. 411-444.

36

multivariate data analysis: its methods

Documents