# [Wiley Series in Probability and Statistics] Finding Groups in Data || Introduction

Post on 20-Dec-2016

213 views

Category:

## Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>C H A P T E R 1 </p><p>Introduction </p><p>1 MOTIVATION </p><p>Cluster analysis is the art of finding groups in data. To see what is meant by this, let us look at Figure 1. It is a plot of eight objects, on which two variables were measured. For instance, the weight of an object might be displayed on the horizontal axis and its height on the vertical one. Because this example contains only two variables, we can investigate it by merely looking at the plot. </p><p>In this small data set there are clearly two distinct groups of objects, namely {TIN, TAL, KIM, ILA} and (LIE, JAC, PET, LEO}. Such groups are called clusters, and to discover them is the aim of cluster analysis. Basically, one wants to form groups in such a way that objects in the same group are similar to each other, whereas objects in different groups are as dissimilar as possible. </p><p>The classification of similar objects into groups is an important human activity. In everyday life, this is part of the learning process: A child learns to distinguish between cats and dogs, between tables and chairs, between men and women, by means of continuously improving subconscious classi- fication schemes. (This explains why cluster analysis is often considered as a branch of pattern recognition and artificial intelligence.) Classification has always played an essential role in science. In the eighteenth century, Linnaeus and Sauvages provided extensive classifications of animals, plants, minerals, and diseases (for a recent survey, see Holman, 1985). In astron- omy, Hertzsprung and Russell classified stars in various categories on the basis of two variables: their light intensity and their surface temperature. In the social sciences, one frequently classifies people with regard to their behavior and preferences. In marketing, it is often attempted to identify market segments, that is, groups of customers with similar needs. Many more examples could be given in geography (clustering of regions), medicine </p><p>1 </p><p>Finding Groups in Data: An Introduction to Cluster Analysis Leonard Kaufman and Peter J. Rousseeuw </p><p>Copyright 01990,2005 by John Wiley &amp; Sons, Inc </p></li><li><p>2 </p><p>180 </p><p>170- </p><p>160 </p><p>INTRODUCTION </p><p>H E I G H T ( C M l 8 </p><p>* </p><p>150 </p><p>1 L O . </p><p>130 </p><p>120 </p><p>110 </p><p>100 </p><p>ga </p><p>L 1; J.AC </p><p>- </p><p>* </p><p>* </p><p>- - </p><p>K IM.0 ILA - .TAL </p><p>P,ET Lgo </p><p>-I- * 10 20 30 40 50 60 70 80 </p><p>W E I G H T ( K G ) </p><p>Figure 1 A plot of eight objects. </p><p>(incidence of specific types of cancer), chemistry (classification of com- pounds), history (grouping of archeological findings), and so on. Moreover, cluster analysis can be used not only to identify a structure already present in the data, but also to impose a structure on a more or less homogeneous data set that has to be split up in a fair way, for instance when dividing a country into telephone areas. Note that cluster analysis is quite different from discriminant analysis in that it actually establishes the groups, whereas discriminant analysis assigns objects to groups that were defined in ad- vance. </p><p>In the past, clusterings were usually performed in a subjective way, by relying on the perception and judgment of the researcher. In the example of Figure 1, we used the human eye-brain system which is very well suited (through millenia of evolution) for classification in up to three dimensions. </p></li><li><p>TYPES OF DATA AND HOW TO HANDLE THEM 3 </p><p>However, the need to classify cases in more than three dimensions and the upcoming objectivity standards of modern science have given rise to so- called automatic classification procedures. Over the last 30 years, a wealth of algorithms and computer programs has been developed for cluster analysis. The reasons for this variety of methods are probably twofold. To begin with, automatic classification is a very young scientific discipline in vigorous development, as can be seen from the thousands of articles scattered over many periodicals (mostly journals of statistics, biology, psychometrics, computer science, and marketing). Nowadays, automatic classification is establishing itself as an independent scientific discipline, as witnessed by a full-fledged periodical (the Journal 01 Classi\$cation, first published in 1984) and the International Federation of Classification Soci- eties (founded in 1985). The second main reason for the diversity of algorithms is that there exists no general definition of a cluster, and in fact there are several kinds of them: spherical clusters, drawn-out clusters, linear clusters, and so on. Moreover, different applications make use of different data types, such as continuous variables, discrete variables, similarities, and dissimilarities. Therefore, one needs different clustering methods in order to adapt to the kind of application and the type of clusters sought. Cluster analysis has become known under a variety of names, such as numerical taxonomy, automatic classification, botryology (Good, 1977), and typologi- cal analysis (Chandon and Pinson, 1981). </p><p>In this book, several algorithms are provided for transforming the data, for performing cluster analysis, and for displaying the results graphically. Section 2 of this introduction discusses the various types of data and what to do with them, and Section 3 gives a brief survey of the clustering methods contained in the book, with some guidelines as to which algorithm to choose. In particular the crucial distinction between partitioning and hierarchical methods is considered. In Section 4 a schematic overview is presented. In Section 5, it is explained how to use the program DAISY to transform your data. </p><p>2 TYPES OF DATA AND HOW TO HANDLE THEM Our first objective is to study some types of data which typically occur and to investigate ways of processing the data to make them suitable for cluster analysis. </p><p>Suppose there are n objects to be clustered, which may be persons, flowers, words, countries, or whatever. Clustering algorithms typically oper- ate on either of two input structures. The first represents the objects by means of p measurements or attributes, such as height, weight, sex, color, </p></li><li><p>4 INTRODUCTION </p><p>and so on. These measurements can be arranged in an n-by-p matrix, where the rows correspond to the objects and the columns to the attributes. In Tucker's (1964) terminology such an objects-by-variables matrix is said to be wo-mode, since the row and column entities are different. The second structure is a collection of proximities that must be available for all pairs of objects. These proximities make up an n-by-n table, which is called a one-mode matrix because the row and column entities are the same set of objects. We shall consider two types of proximities, namely dissimilarities (which measure how far away two objects are from each other) and similarities (which measure how much they resemble each other). Let us now have a closer look at the types of data used in this book by considering them one by one. </p><p>2.1 Interval-Scaled Variables </p><p>In this situation the n objects are characterized by p continuous measure- ments. These values are positive or negative real numbers, such as height, weight, temperature, age, cost, ..., which follow a linear scale. For in- stance, the time interval between 1905 and 1915 was equal in length to that between 1967 and 1977. Also, it takes the same amount of energy to heat an object of -16.4"C to -12.4"C as to bring it from 352C to 39.2"C. In general it is required that intervals keep the same importance throughout the scale. </p><p>These measurements can be organized in an n-by-p matrix, where the rows correspond to the objects (or cases) and the columns correspond to the variables. When the fth measurement of the ith object is denoted by xi f (where i = 1,. . . , n and f = 1,. . . , p) this matrix looks like </p><p>n objects </p><p>P ... </p><p>... </p><p>variables X I / * * - </p><p>X i f ... </p><p>X n / . * * x.P ;I For instance, consider the following real data set. For eight people, the weight (in kilograms) and the height (in centimeters) is recorded in Table 1. In this situation, n = 8 and p = 2. One could have recorded many more variables, like age and blood pressure, but as there are only two variables in this example it is easy to make a scatterplot, which corresponds to Figure 1 </p></li><li><p>TYPES OF DATA AND HOW TO HANDLE THEM 5 </p><p>Table 1 Weight and Height of Eight People, Expressed in Kilograms and Centimeters </p><p>Weight Height Name (kg) (cm) Ilan 15 95 Jacqueline 49 156 Kim 13 95 Lieve 45 160 Leon 85 178 Peter 66 176 Talia 12 90 Tina 10 78 </p><p>given earlier. Note that the units on the vertical axis are drawn to the same size as those on the horizontal axis, even though they represent different physical concepts. The plot contains two obvious clusters, which can in this case be interpreted easily: the one consists of small children and the other of adults. </p><p>Note that other variables might have led to completely different cluster- ings. For instance, measuring the concentration of certain natural hormones might have yielded a clear-cut partition into three male and five female persons. By choosing still other variables, one might have found blood types, skin types, or many other classifications. </p><p>Let us now consider the effect of changing measurement units. If weight and height of the subjects had been expressed in pounds and inches, the results would have looked quite different. A pound equals 0.4536 kg and an inch is 2.54 cm. Therefore, Table 2 contains larger numbers in the column of weights and smaller numbers in the column of heights. Figure 2, although plotting essentially the same data as Figure 1, looks much flatter. In this figure, the relative importance of the variable weight is much larger than in Figure 1. (Note that Kim is closer to Ilan than to Talia in Figure 1, but that she is closer to Talia than to Ilan in Figure 2.) As a consequence, the two clusters are not as nicely separated as in Figure 1 because in this particular example the height of a person gives a better indication of adulthood than his or her weight. If height had been expressed in feet (1 f t = 30.48 cm), the plot would become flatter still and the variable weight would be rather dominant. </p><p>In some applications, changing the measurement units may even lead one to see a very different clustering structure. For example, the age (in years) and height (in centimeters) of four imaginary people are given in </p></li><li><p>6 INTRODUCTION </p><p>80 </p><p>60 </p><p>20 </p><p>Table 2 Weight and Height of the Same Eight People, But Now Expressed in Pounds and Inches </p><p>Name (1b) (in.) Weight Height </p><p>* </p><p>- KIM </p><p> TAL.QOILA TIN. - </p><p>Ilan Jacqueline Kim Lieve Leon Peter Talia Tina </p><p>33.1 108.0 28.7 99.2 187.4 145.5 26.5 22.0 </p><p>37.4 61.4 37.4 63.0 70.0 69.3 35.4 30.7 </p><p>Table 3 and plotted in Figure 3. It appears that {A, B ) and { C, 0) are two well-separated clusters. On the other hand, when height is expressed in feet one obtains Table 4 and Figure 4, where the obvious clusters are now {A, C} and { B, D}. This partition is completely different from the first because each subject has received another companion. (Figure 4 would have been flattened even more if age had been measured in days.) </p><p>To avoid this dependence on the choice of measurement units, one has the option of standardizing the data. This converts the original measure- ments to unitless variables. First one calculates the mean value of variable </p><p>HEIGHT (INCHES) t 8 </p><p>&amp; T LEO </p><p>Figure 2 Plot corresponding to Table 2. </p></li><li><p>TYPES OF DATA AND HOW TO HANDLE THEM 7 Table 3 Age (in years) and Height (in centimeters) of Four People </p><p>Age Height Person ( Y d (cm) A 35 190 B 40 190 C 35 160 D 40 160 </p><p>"O t 0 . </p><p>C D - 20 30 35 LO 50 A G E ( Y E A R S ) </p><p>L, . Figure 3 Plot of height (in centimeters) versus age for four people. </p><p>Table 4 Age (in years) and Height (in feet) of the Same Four People ~ </p><p>Age Height Person (Yr) (ft) </p><p>A 35 6.2 B 40 6.2 C 35 5.2 D 40 5.2 </p></li><li><p>8 INTRODUCTION </p><p>HEIGHT IFEETI </p><p>5 C. 1 3 </p><p>2 </p><p>1 </p><p>35 36 37 38 39 10 41 AGE (YEARS1 </p><p>- 1 . - . . - - - - Figure 4 Plot of height (in feet) versus age for the same four people. </p><p>f, given by 1 </p><p>n / = - (x , / + X/ + - * - + X U / ) </p><p>for each f = 1,. . . , p. Then one computes a measure of the dispersion or spread of this fth variable. Traditionally, people use the standard deviation </p><p>for this purpose. However, this measure is affected very much by the presence of outlying values. For instance, suppose that one of the xi/ has been wrongly recorded, so that it is much too large. In this case std/ will be unduly inflated, because xi / - mf is squared. Hartigan (1975, p. 299) notes that one needs a dispersion measure that is not too sensitive to outliers. Therefore, from now on we will use the mean absolute deviation </p><p>where the contribution of each measurement xi/ is proportional to the absolute value \ x i / - m/l. This measure is more robust (see, e.g., Hampel </p></li><li><p>TYPES OF DATA AND HOW TO HANDLE THEM 9 et al., 1986) in the sense that one outlying observation will not have such a large influence on sf. (Note that there exist estimates that are much more robust, but we would like to avoid a digression into the field of robust statistics.) </p><p>Let us assume that sf is nonzero (otherwise variable f is constant over all objects and must be removed). Then the standardized measurements are defined by </p><p>and sometimes called z-scores. They are unitless because both the numera- tor and the denominator are expressed in the same units. By construction, the z i f have mean value zero and their mean absolute deviation is equal to 1. When applying standardization, one forgets about the original data (1) and uses the new data matrix </p><p>variables </p><p>objects </p><p>. . . </p><p>. . . </p><p>... </p><p>... </p><p>... </p><p>... </p><p>in all subsequent computations. The advantage of using sf rather than std, in the denominator of (4) is that s/ will not be blown up so much in the case of an outlying x I / , and hence the corresponding zif will still stick out so the ith object can be recognized as an outlier by the clustering algorithm, which will typically put it in a separate cluster. [This motivation differs from that of Milligan and Cooper (1988), who recommended the use of a nonrobust denominator.] </p><p>The preceding description might convey the impression that standardiza- tion would be beneficial in all situations. However, it is merely an option that may or may not be useful in a given application. Sometimes the variables have an absolute meaning, and should not be standardized (for instance, it may happen that several variables are expressed in the same units, so they should not be divided by different s/). Often standardization dampens a clustering structure by reducing the large effects because the variables with a big contribution are divided by a large s f . </p><p>For example, let us standardize the data of Table 3. The mean age equals m1 = 37.5 and the mean absolute deviation of the first variable works out to be s1 = (2.5 + 2.5 + 2.5 + 2.5)/4 = 2.5. Therefore, standardization converts age 40 to + 1 and age 35 to - 1. Analogously, m, = 175 cm and </p></li><li><p>10 INTRODUCTION </p><p>Table 5 Standprdized Age and Height of the Sam...</p></li></ul>