# [Wiley Series in Probability and Statistics] Applied Multiway Data Analysis || Overview

Post on 09-Dec-2016

212 views

Category:

## Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>CHAPTER 2 </p><p>OVERVIEW </p><p>2.1 WHAT ARE MULTIWAY DATA?' </p><p>Most statistical methods are used to analyze the scores of objects (subjects, groups, etc.) on a number o f variables, and the resulting data can be arranged in a two-way matrix, that is, a rectangular arrangement o f rows (objects) and columns (variables). However, data are often far more complex than this. For example, the data may have been collected under a number of conditions or at several time points. When either time or conditions are considered, the data are no longer two-way, but become three- way data (see Fig. 2.1). In such a case, there is a matrix for each condition, and these matrices can be arranged next to each other to form a wide combination-mode matrixof subjects by variables x conditions. Alternatively, one may create a tall combination- mode matrix of subjects x conditions by variables. The term combination-mode is used to indicate that one of the ways of the matrix consists of the combination of two </p><p>' Portions of this chapter have been taken from Kroonenberg (2005b); 2005 @John Wiley &amp; Sons Limited. Reproduced and adapted with kind permission. </p><p>Applied Multiway Data Analysis. By Pieter M. Kroonenberg Copyright @ 2007 John Wiley &amp; Sons, Inc. </p><p>15 </p></li><li><p>16 OVERVIEW </p><p>k= 1 k= K l...j...J l . . . j . . . J l . . . j . . . J </p><p>Wide matrix l...L..I k=km k &amp; I l...b..l </p><p>Three-way array Tall matrix </p><p>Figure 2.1 Two-way combination-mode matrices and three-way arrays </p><p>modes rather than a single one. The third possibility is to arrange the set of matrices in a three-dimensional block or three-way array. When both time and conditions play a role, the data set becomes a four-way array, and including even more aspects leads to higher-way data. The collection of techniques designed to analyze multiway data are referred to as multiway methods, or in the three-way case mostly three-mode methods. Making sense of multiway data is the art of multiway data analysis. The term multiway array is now the standard; sometimes the term multiway tensor is used instead of array. We will, however, avoid the term tensor because it has a more specific meaning in mathematics and we are primarily concerned with data arrays rather than algebraic structures. </p><p>The words multiway and multimode can be given very precise meanings but in this book the usage is not always very exact. The word way is considered more general, referring to the multidimensional arrangement irrespective of the content of the data, while the word mode is more specific and refers to the content of each of the ways. Thus, objects, variables, and conditions can be the modes of a three-way data array. When the same entities occur in two different ways, as is the case in a correlation matrix, the data are one-mode two-way data. When correlation matrices for the same variables are available from several different samples, one often speaks of a two-mode three-way data array, where the variables and the samples are the two modes. In this book, we generally refer to multiway and three-way data and multiway and three-way arrays. However, following historical practice we will generally refer to three-mode methods, three-mode analysis, and three-mode models, but to multiway methods, multiway analysis, and multiway models. Terms commonly used in connection with multiway models are trilinear, quadrilinear, and multilinear, referring to linearity of these models in one set of their parameters given the other sets of parameters. </p></li><li><p>WHY MULTIWAY ANALYSIS? 17 </p><p>Individual differences is another expression that deserves attention in the context of multiway data analysis. In standard statistical theory it is assumed that subjects are drawn from a population, and thus are exchangeable. Two subjects are exchangeable, if they are drawn from the same population and it is irrelevant to the validity of generalizations made from the sample which of the entities is included in the sample. Entities in random samples are automatically exchangeable. In multiway analysis random samples are rare, but one hopes that the entities included in a sample are at least exchangeable with nonsampled entities from the population. On the other hand, in many multiway analysis there is not necessarily a sampling framework. Each individual or object is treated as a separate entity and its characteristics are relevant to the analysis. In some research designs this aspect of the data is more important than in others. For instance, in an agricultural experiment each variety of a crop is specifically chosen to be planted and compared with other varieties. Thus, in such studies the idea of a sample from a population is less relevant. However, in studies where the emphasis is exclusively on the development of a correlational structure over time, the specific subjects in the study may be of less interest. </p><p>2.2 WHY MULTIWAY ANALYSIS? </p><p>If there are so many statistical and data-analytic techniques for two-way data, why are these not sufficient for multiway data? The simplest answer is that two-mode methods do not respect the multiway design of the data. In itself this is not unusual since, for instance, time-series data are often analyzed as if the time mode were an unordered mode and the time sequence is only used in interpretation. However, it should be realized that multiway models introduce intricacies that might lead to interpretational difficulties. One of the objects of this book is to assist in using multiway models in practical data analysis. Thus, the book can be seen as an attempt to counter the statement by Gower (2006, p. 128) that many, but not all, of the multilinear models developed by psychometricians are, in my view, uninterpretable - except perhaps to their originators. Luckily, support for the position taken in this book can be had from various quarters. In analytical chemistry, for instance, researchers adopt an entirely different tone. As an example, we may quote Goicoechea, Yu, Olivieri, and Campiglia (2005, p. 2609), who do not belong to the originators: multidimensional data formats, which combine spectral and lifetime information, offer tremendous potential for chemometric analysis. High-order data arrays are particularly useful for the quantitative analysis of complex multicomponent samples and are gaining widespread analytical acceptance. </p><p>Multiway data are supposedly collected because all ways are necessary to answer the pertinent research questions. Such research questions can be summarized in the three-way case as: Who does what to whom and when?, or more specifically: Which groups of subjects behave differently on which variables under which condi- tions? or Which plant varieties behave in a specific manner at which locations on </p></li><li><p>18 OVERVIEW </p><p>which attributes?. Such questions cannot be answered by means of two-mode meth- ods, because these have no separate parameters for each of the three modes. When analyzing three-way data with two-mode methods, one has to rearrange the data as in Fig. 2.1, and this means that either the subjects and conditions are combined into a single mode (tall combination-mode matrix) or the variables and conditions are so combined (wide combination-mode matrix). Thus, two of the modes are always confounded and no independent parameters for these modes are present in the model itself, except when models are used specifically geared toward this situation. </p><p>In general, a multiway model uses fewer parameters for multiway data than an appropriate two-mode model. To what extent this is true depends very much on the specific model used. In some multiway component models low-dimensional representations are defined for all modes, which can lead to enormous reductions in parameters. Unfortunately, that does not means that automatically the results of a multiway analysis are always easier to interpret. Again this depends on the questions asked and the models used. </p><p>An important aspect of multiway models, especially in the social and behavioral sciences, is that they allow for the analysis of individual differences in a variety of conditions. The subjects do not disappear in means, (co)variances, or correlations, and possibly higher-order moments such as the kurtosis and skewness, but they are examined in their own right. This implies that often the data set is taken as is, and not necessarily as a random sample from a larger population in which the subjects are in principle exchangeable. Naturally, this affects the generalizability, but that is considered inevitable. At the same time, however, the subjects are recognized as the data generators and are awarded a special status, for instance, when statistical stability is determined via bootstrap or jackknife procedures (see Section 9.8.2, p. 233; Section 8.8.1, p. 188; Section 8.8.2, p. 188). Furthermore, it is nearly always the contention of the researcher that similar samples are or may become available, so that at least part of the results are valid outside the context of the specific sample. </p><p>2.3 WHAT IS A MODEL? </p><p>2.3.1 Models and methods </p><p>In this book a model is a theoretical, in our case a mathematical, construct or a simplified description of a system or complex entity, esp. one designed to facilitate descriptions and predictions (Collins English Dictionary). A well-known example is the normal distribution, which often serves as a simplified description of a distrib- ution of an observed variable. The simplification comes from the fact that the normal distribution is determined by its shape (described by a formula) and the values of its two parameters, the mean and the variance. However, to describe the empirical distri- bution of a variable we need to know all data values, however many there are. Thus, if we are prepared to assume that a variable is approximately normally distributed, we </p></li><li><p>WHAT IS A MODEL? 19 </p><p>can use the properties of that distribution, and we can describe and interpret its shape using the parameters to describe and compare the distributions of different variables. All of which is extremely difficult to do with empirical distributions because there is no real reduction in complexity. </p><p>A central question is whether a particular model is commensurate with a particular data set, for instance, whether the empirical distribution of a variable can be ade- quately represented by a normal distribution with the specific mean and variance. To study this, one often defines lossfunctions (also called discrepancy functions), which quantify the difference between the data and the model. In the case of the normal distribution, this could take the form of the sum of (squared) differences between the observed values and those derived from the normal distribution. </p><p>In physical sciences, a model is often determined by physical laws that describe the phenomenon under study, such as the way a pendulum swings. In such a case the interpretations of the parameters are known beforehand, and one only has to estimate from the data the values of the parameters for the case at hand. </p><p>In the social and behavioral sciences, similarly defined models are rare. Instead, the idea of a model takes on a different meaning. Models are conceived as descriptive of structural relationships between the entities in the data, and models are not grounded in a particular theory but have a more general character. The principal component analysis (PCA) model is an example of this. It is assumed that the patterns in the data can be described by a bilinear relationship between components for the subjects (rows) and those for the variables (columns), as described in Eq. (2.1): </p><p>S </p><p>x,, = 2,, + e,, = Cars f J S + ez3 (2 = 1.. . . , I : j = 1.. . . , J ) . (2.1) s=l </p><p>where the aLs are the coefficients of the subjects on the components of the subjects (often called scores), the fJs are the coefficients of the variables (often called load- ings), and S is the number of components used to approximate the data (see Chapter 9, especially Section 9.3.1, p. 215). The relationship is called bilinear because given the coefficients of one mode the equation is linear in the other mode, and vice versa. Given a solution, not only the parameters need to be interpreted but also the structure they describe. Generally, there are no detailed laws that explain why the description of the data by a principal component model is the proper and unique way to represent the relationships in the data. However, it is possible to specify via a loss function whether the PCA model provides an adequate description of the data in terms of fit. In this situation i,, will be called the structural image of the data. The description of the structure via principal components in Eq. (2.1) is then the structural representation. However, it is generally more convenient to simply use the word model in this case, keeping in mind the more formal correct designation. </p><p>Some multiway techniques described in this book do not conform to the idea of a model, because there is no single formulation of the relationships between the </p></li><li><p>20 OVERVIEW </p><p>variables. Consequently, it is not possible to define a loss function that indicates how well the solution fits the data. Such techniques are called methods. They generally consist of a series of procedures or steps that have to be followed to obtain a solution, but it is not possible at the end of the computations to say how well the solution fits the data. </p><p>In two-mode analysis, many clustering procedures are methods without underlying models. An example in three-mode analysis is the procedure called STATIS (Struc- turation des tableaux B trois indices de la statistique or Structuring of triple-indexed statistical tables (L'Hermier des Plantes, 1976; Lavit, 1988)), which is much used in France. It consists of three separate but linked steps, but no overall loss function can be specified (see Section 5.8, p. 105). </p><p>The disadvantage of not having a loss function is that it is not possible to compare the fit of different solutions because there is no common reference point. Furthermore, solutions cannot be statistically evaluated via significance tests, should one desire to do this. The models discussed in this section are typically data-analytic models, because there is no reference to statistical properties of these models, such as the type of distributions of the variables or of the errors. When a model has an explicit error structure and distributional assumptions are made, we will use the term statistical models. Nearly all models for multiway data are data-analytic models. Notable exceptions are the models underlying the three-mode mixture method of clustering (see Chapter 16) and three-mode (common) factor analysis (see Section 4.3.5, p. 47). </p><p>2.3.2 Choice of dimensionality </p><p>Characteristic of component models is that they in fact form classes of models, because within each class we may specify the dimensionality of the solution. A principal component model as given in Eq. (2.1) fits the data differen...</p></li></ul>