[wiley series in probability and statistics] finding groups in data || appendix

8
Appendix The purpose of this appendix is to provide additional information about the programs described in this book. Besides giving a better understanding of the implementation of the various techniques and algorithms, this informa- tion should allow the reader to use the programs efficiently and to adapt them to his or her particular requirements when needed. Moreover, Section 4 describes a new program called CLUSPLOT, by which a graphical representation of a partition can be obtained. 1 IMPLEMENTATION AND STRUCTURE OF THE PROGRAMS The programs of Chapters 1 to 7 were written in Fortran. In an effort to make the programs as portable as possible we have decided to impose several restrictions: Many statements that are not used in the same way by all compilers or that may make it difficult to modify the programs were excluded, such as COMMON, DATA, EQUIVALENCE, EXTERNAL, REAL, INTEGER, and DOUBLE PRECISION. The only inline functions were ABS and SQRT. Names of variables and routines were limited to five alphabetical charac- ters. To avoid confusion, the characters I, 0, 1, and 0 were not used. The first versions of the programs were written for a mainframe com- puter. These were extensively tested for portability using the PFORT verifier (Ryder, 1974) and ran without problems on many different ma- chines, using either Fortran IV or Fortran 77 compilers. However, when the IBM-PC standard for personal computers came in general use, we decided 312 Finding Groups in Data: An Introduction to Cluster Analysis Leonard Kaufman and Peter J. Rousseeuw Copyright 01990,2005 by John Wiley & Sons, Inc

Upload: peter-j

Post on 24-Mar-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [Wiley Series in Probability and Statistics] Finding Groups in Data || Appendix

Appendix

The purpose of this appendix is to provide additional information about the programs described in this book. Besides giving a better understanding of the implementation of the various techniques and algorithms, this informa- tion should allow the reader to use the programs efficiently and to adapt them to his or her particular requirements when needed. Moreover, Section 4 describes a new program called CLUSPLOT, by which a graphical representation of a partition can be obtained.

1 IMPLEMENTATION AND STRUCTURE OF THE PROGRAMS

The programs of Chapters 1 to 7 were written in Fortran. In an effort to make the programs as portable as possible we have decided to impose several restrictions:

Many statements that are not used in the same way by all compilers or that may make it difficult to modify the programs were excluded, such as COMMON, DATA, EQUIVALENCE, EXTERNAL, REAL, INTEGER, and DOUBLE PRECISION.

The only inline functions were ABS and SQRT. Names of variables and routines were limited to five alphabetical charac-

ters. To avoid confusion, the characters I, 0, 1, and 0 were not used.

The first versions of the programs were written for a mainframe com- puter. These were extensively tested for portability using the PFORT verifier (Ryder, 1974) and ran without problems on many different ma- chines, using either Fortran IV or Fortran 77 compilers. However, when the IBM-PC standard for personal computers came in general use, we decided

312

Finding Groups in Data: An Introduction to Cluster Analysis Leonard Kaufman and Peter J. Rousseeuw

Copyright 01990,2005 by John Wiley & Sons, Inc

Page 2: [Wiley Series in Probability and Statistics] Finding Groups in Data || Appendix

RUNNING THE PROGRAMS 313

to switch to the PC and to make all the programs interactively operated. Although this was done at the expense of some of the portability, this original aim was withheld whenever possible. Adapting the programs to various compilers has proven to be very straightforward. Even the adapta- tions necessary to run CLARA on a parallel computer system (see Section 5.3 of Chapter 3) were minimal.

The way the logical names of input and output files are assigned in the programs also serves the purpose of making them portable. The numbers of the input and output files are given in the main program unit, right after the dimension statements. In all programs, except DAISY and CLARA, the following statements are used:

LUA = 1 (input file containing the data set) LUB - 2 (output file containing the clustering results) LUC = 3 (file used for saving the data set if it was entered from the keyboard).

The program DAISY may open up to three new files: LUB contains information on the variables and the missing values, LUC is the file used for saving the data set if it was entered from the keyboard, and the dissimilarity matrix computed by the program is sent to LUD. In CLARA, which can only be used to process large data sets, the data always must be read from an existing file, so LUC is not used there. In all programs the actual numbers may of course be adapted to the user’s hardware.

The names of all input and output files are entered during the interactive dialogue at the beginning of each run. The variables used for these names are FNAMEA, FNAMEB, and FNAMEC. They consist of up to 30 characters and may contain a drive and a path. In all programs except CLARA the data may also be entered from the keyboard, provided the user types KEY in answer to the relevant question. This instructs the program to assign the characters CON to the variable FNAMEA. The name of the output file can be given as CON (if the output should be shown on screen), as PRN (if it should be directed to the printer), or the name of a disk file may be given.

2 RUNNING THE PROGRAMS

The programs described in this book all run on an IBM-PC, XT, AT or any compatible computer with at least 256 K of storage. A printer is not mandatory, but if one is available the output can be sent to it. In order to make this output easy to interpret, it was decided to make all of its lines at most 80 characters long, so that it fits on any screen or printer. The first character of any output line is always left blank.

Page 3: [Wiley Series in Probability and Statistics] Finding Groups in Data || Appendix

314 APPENDIX

The correspondence between programs, techniques, and chapters is shown in the following table:

Program Purpose Chapter

DAISY Computing dissimilarities 1 PAM Partitioning by means of the 2

CLARA Partitioning large data sets 3 FANNY Fuzzy partitioning 4

k-medoid method

TWINS Agglomerative and divisive 5and6 clustering (it combines the programs AGNES and DIANA)

MONA Monothetic analysis of binary data 7

Each program can be run by simply typing its name and hitting the return key. This starts an interactive dialogue during which the user may select certain options by responding to a series of questions. The program checks each answer to see whether it is a valid option or specification. If this is not the case, a message is given and the question is reiterated. For example, when entering the number of objects in PAM (and in most other programs), the following dialogue is possible:

THE PRESENT VERSION OF THE PROGRAM CAN HANDLE UP TO 100 OBJECTS.

GRAM MUST BE ADAPTED)

HOW MANY OBJECTS ARE TO BE CLUSTERED ?

PLEASE GIVE A NUMBER BETWEEN 3 AND

AT LEAST 3 OBJECTS ARE NEEDED FOR CLUSTER ANALYSIS, PLEASE FORESEE MORE OBJECTS

HOW MANY OBJECTS ARE TO BE CLUSTERED 7

PLEASE GIVE A NUMBER BETWEEN 3 AND

NOT ALLOWED ! PLEASE ENTER YOUR CHOICE AGAIN :

HOW MANY OBJECTS ARE TO BE CLUSTERED ?

PLEASE GIVE A NUMBER BETWEEN 3 AND

(IF MORE ARE TO BE CLUSTERED, THE ARRAYS INSIDE THE PRO-

...................................... 100 : - 2

...................................... 100 : - 120

...................................... 100 : 12

Page 4: [Wiley Series in Probability and Statistics] Finding Groups in Data || Appendix

RUNNING THE PROGRAMS 315

Many options take the form of a yes/no question. Only the first character of the answer is read. Both y and Y are interpreted as yes, and both n and N are taken to mean no. All other characters will cause the question to be asked again, as in the following excerpt:

DO YOU WANT GRAPHICAL OUTPUT (SILHOUEITES) ? PLEASE ANSWER YES OR NO : - u NOT ALLOWED ! PLEASE ENTER YOUR CHOICE AGAIN : y

Because such questions occur very often, a special subroutine was written for this purpose.

Provisions similar to those for yes/no questions were made for other situations, such as the choice between measurement input ( m ) and input of dissimilarities (d). Sometimes options are numbered, such as 1, 2, 3, and 4 at the beginning of DAISY. In such case all other characters are rejected.

When the name of an input file is entered, the program verifies whether it exists. i f it cannot be found on the disk, the program states this and asks for the name once again. However, the program does not check whether a file specified for ourpur already exists. If it does, it will be overwritten by the new file. For both input and output files, the program does verify the correctness of the file name syntax. A mistake will be signaled by the program, which will then ask for another name.

For the input of the data, the programs allow the choice between free and fixed format. The free format, which is the easiest to use, supposes that the data all consist of numbers separated by blanks. If some data stick together or if some of it is not numerical (even if these variables are not used for the clustering), the user must supply an input format. Because the Fortran variables used for the data are real, only F and E formats are allowed. The general structure of an F format is Fw.d where w denotes the total number of characters (including the decimal point) and d is the number of digits after the decimal point. If the number to be read does not contain a decimal point, it may occupy all w positions. In this case, the rightmost d digits are interpreted as following the decimal point. (However, decimal points that occur at the “wrong” place take precedence over the format specifications.) Examples of F formats are F15.7 and F8.0. E formats are written as Ew.d. Also here, w is the total number of characters in the field and d is the number of digits after the decimal point. The number may also contain an exponent that is either a sign followed by an integer or an E followed by an optional sign followed by an integer. Note that the data may also contain an exponent when an F format is used. If the same E or F format is appropriate for several variables, it can be preceded by a repetition factor (as for example in 8F10.2).

Page 5: [Wiley Series in Probability and Statistics] Finding Groups in Data || Appendix

316 APPENDIX

D

1 3 Country data set N Y N Y COUNTRY.DAT COUNTRY.PAM Y

i a (input of dissimilarities) (number of objects) (initial number of clusters) (final number of clusters) (title for output) (small output is wanted) (graphical output is wanted) (no object labels will be entered) (data will be read in free format) (name of the file containing the data) (name of the output file) (all entered specifications are OK)

Figure 1 Input file to be used with the program PAM: Its use replaces the interactive dialogue.

To make it possible to disregard one or more positions on the input line, an X format may be used. For example, the format 1OX ensures that 10 positions are skipped. Such a format is necessary if this field contains nonnumerical characters. Finally, one or several / make it possible to employ several input lines for the same object. Each time a / is encoun- tered, the program moves to the next input line.

Apart from the usual interactive input of options and parameters as described in each chapter, it is also possible to use an input file containing all the answers and options that are otherwise entered by keyboard. Figure 1 shows an example of such an input file for PAM. The meaning of each line is shown between brackets.

In order to use such an input file (instead of typing the options and parameters on the keyboard), the input to the program must be redirected. This is possible by using a “less than” sign in the command. Suppose the input file depicted in Figure 1 is named 0PTIONS.DAT and that it should be used with PAM. This is achieved by typing the instruction

PAM < OPTIONSDAT

assuming that the file 0PTIONS.DAT is on the current disk.

3 ADAPTING THE PROGRAMS TO YOUR NEEDS

In the course of an application it may be necessary or desirable to modify a program. The most frequent type of modification is due to the size of the data set to be clustered. All of the programs begin by assigning values to the maximum dimensions of problems that can be processed. For example, in PAM, FANNY, AGNES, and DIANA the following upper limits are

Page 6: [Wiley Series in Probability and Statistics] Finding Groups in Data || Appendix

ADAPTING THE PROGRAMS TO YOUR NEEDS 317

set:

MAXNN = 100 (maximum number of objects) MAX= = 80 (maximum number of variables in the data set) MAXPP - 20 (maximum number of variables that

can be used for the clustering)

In order to adapt these values, both the assignment statements and the DIMENSION and CHARACTER statements located at the beginning of the main unit must be changed. However, no changes are necessary in the subroutines because the upper limits are passed on as arguments in the calling sequences. It should of course be noted that a modification of the dimensions of the arrays has implications on the compilation of the programs. The characteristics of the compiler should be considered when carrying out any changes. For example, if MS Fortran is used it might be necessary to add a $LARGE metacommand at the beginning of a program to accommodate large arrays.

Sometimes the user might want to change the algorithm used by the program. This is most likely to be the case for AGNES (Chapter 5 ) because of the popularity of a variety of agglomerative clustering methods. As was explained in Section 5.1 of Chapter 5 , a slight alteration in the subroutine AVERL makes it possible to replace the group average method by one of six other agglomerative techniques. In CLARA, the algorithm can be (slightly) modified by changing either the number or the size of the samples that are clustered. In the beginning of the main program unit the statements

NRAN = 5 NSAM = 40 + 2 * K K

are used to determine the number of samples and their size. Both state- ments may be altered.

Finally, the user may also change the appearance of the graphical output of all the clustering programs by modifying the following assignment statement that is to be found in the subroutine ENTR:

NUM =‘0123456789+ + ’

Changing the last two characters * and + will alter the graphical output considerably, because these characters make up the silhouettes and the banners.

Page 7: [Wiley Series in Probability and Statistics] Finding Groups in Data || Appendix

318 APPENDIX

4 THE PROGRAM CLUSPLOT

The graphical output of the partitioning programs (PAM, CLARA, and FANNY) consists of the so-called silhouettes. These provide a visual representation of each cluster, together with some statistics (the average silhouette width for each cluster and for the entire data set). Apart from this representation it seemed instructive to have a display containing both the objects and the clusters. Such a drawing can picture the size and shape of the clusters and their relative position in space. (Note that this informa- tion is partially provided by the neighbor of each object, as shown in the silhouette plot.)

CLUSPLOT is a menu driven PASCAL program that uses output from a partitioning algorithm to visualize the objects and clusters in a two-dimen- sional plot. The input data structure can either be a matrix of objects by measurements or a dissimilarity matrix. The program also requires the input of a clustering vector, containing the cluster number of each object. The number of clusters drawn by CLUSPLOT is the number of different entries found in the clustering vector,

Fipre 2 Example of a CLUSPLOT output.

Page 8: [Wiley Series in Probability and Statistics] Finding Groups in Data || Appendix

THE PROGRAM CLUSPLOT 319

If a dissimilarity matrix is used as input, CLUSPLOT begins by convert- ing the data to two-dimensional coordinates by means of multidimensional scaling. The same technique is used if the data consist of a matrix of measurements with more than two variables.

In the plot, clusters are indicated by circles or ellipses. The program has options allowing the rotation and intersection of the clusters, in order to use the screen area in an optimal way. Further options make it possible to draw the individual points, number the clusters, plot the coordinate axes, shade the clusters, and draw lines between them. Also, the plot can be saved in a file for a quick reproduction on a later occasion.

Figure 2 is a plot showing the results of PAM, obtained by clustering the 12 countries data set into three clusters. The partition itself was discussed in Section 3 of Chapter 2.