twostep cluster analysis

50
Cluster Analysis The purpose of cluster analysis is to reduce a large data set to meaningful subgroups of individuals or objects. The division is accomplished on the basis of similarity of the objects across a set of specified characteristics. Outliers are a problem with this technique, often caused by too many irrelevant variables. The sample should be representative of the population, and it is desirable to have uncorrelated factors. There are three main clustering methods: hierarchical, which is a treelike process appropriate for smaller data sets; nonhierarchical, which requires specification of the number of clusters a priori, and a combination of both. There are 4 main rules for developing clusters: the clusters should be different, they should be reachable, they should be measurable, and the clusters should be profitable (big enough to matter). This is a great tool for market segmentation. classification issues from two varying perspectives. When considering groups of objects in a multivariate data set, two situations can arise. Given a data set containing measurements on individuals, in some cases we want to see if some natural groups or classes of individuals exist, and in other cases, we want to classify the individuals according to a set of existing groups. Cluster analysis develops tools and methods concerning the former case, that is, given a data matrix containing multivariate measurements on a large number of individuals (or objects), the objective is to build some natural subgroups or clusters of individuals. This is done by grouping individuals that are \similar" according to some appropriate criterion. Once the clusters are obtained, it is generally useful to describe each group using some descriptive tool from Chapters 1, 8 or 9 to create a better understanding of the differences that exist among the formulated groups. Cluster analysis is applied in many fields such as the natural sciences, the medical sciences, economics, marketing, etc. In marketing, for instance, it is useful to build and describe the different segments of a market from a survey on potential consumers. An insurance company, on the other hand, might be interested in the distinction among classes of potential customers so that it can derive optimal prices for its services. Other examples are provided below. Discriminant analysis presented in Chapter 12 addresses the other issue of

Upload: aishatu-musa-abba

Post on 11-Nov-2014

66 views

Category:

Documents


2 download

DESCRIPTION

Two step Cluster Analysis

TRANSCRIPT

Page 1: TwoStep Cluster Analysis

Cluster Analysis

The purpose of cluster analysis is to reduce a large data set to meaningful subgroups of individuals or objects The division is accomplished on the basis of similarity of the objects across a set of specified characteristics Outliers are a problem with this technique often caused by too many irrelevant variables The sample should be representative of the population and it is desirable to have uncorrelated factors There are three main clustering methods hierarchical which is a treelike process appropriate for smaller data sets nonhierarchical which requires specification of the number of clusters a priori and a combination of both There are 4 main rules for developing clusters the clusters should be different they should be reachable they should be measurable and the clusters should be profitable (big enough to matter) This is a great tool for market segmentationclassification issues from two varying perspectives When considering groups of objects in a multivariate data set two situations can arise Given a data set containing measurements on individuals in some cases we want to see if some naturalgroups or classes of individuals exist and in other cases we want to classify the individuals according to a set of existing groups Cluster analysis develops tools and methods concerning the former case that is given a data matrix containing multivariate measurements on a large number of individuals (or objects) the objective is to build some natural subgroups or clusters of individuals This is done by grouping individuals that are similar according to some appropriate criterion Once the clusters are obtained it is generally useful to describe each group using some descriptive tool from Chapters 1 8 or 9 to create a better understanding of the differences that exist among the formulated groupsCluster analysis is applied in many fields such as the natural sciences the medical sciences economics marketing etc In marketing for instance it is useful to build and describe the different segments of a market from a survey on potential consumers An insurance company on the other hand might be interested in the distinction among classes of potential customers so that it can derive optimal prices for its services Other examples are provided below Discriminant analysis presented in Chapter 12 addresses the other issue of classification It focuses on situations where the different groups are known a priori Decision rules are provided in classifying a multivariate observation into one of the known groups

Cluster analysis is a set of tools for building groups (clusters) from multivariate data objectsThe aim is to construct groups with homogeneous properties out of heterogeneous large302 11 Cluster Analysissamples The groups or clusters should be as homogeneous as possible and the differences among the various groups as large as possible Cluster analysis can be divided into two fundamental steps1 Choice of a proximity measure

One checks each pair of observations (objects) for the similarity of their values A similarity (proximity) measure is defined to measure the closeness of the objectsThe closer they are the more homogeneous they are2 Choice of group-building algorithmOn the basis of the proximity measures the objects assigned to groups so that differences between groups become large and observations in a group become as close as possible

The starting point of a cluster analysis is a data matrix X(n _ p) with n measurements(objects) of p variables The proximity (similarity) among objects is described by amatrix

The matrix D contains measures of similarity or dissimilarity among the n objects If theValues dij are distances then they measure dissimilarity The greater the distance the less similar are the objects If the values dij are proximity measures then the opposite is true ie the greater the proximity value the more similar are the objects A distance matrix for example could be defined by the L2-norm dij= kxi1048576xjk2 where xi and xj denote the rows of the data matrix X Distance and similarity are of course dual If dijis a distance then d0ij = maxij fdijg1048576dijis a proximity measure The nature of the observations plays an important role in the choice of proximity measure Nominal values (like binary variables) lead in general to proximity values whereas metric values lead (in general) to distance matrices We first present possibilities for D in the binary case and then consider the continuous case

Similarity of objects with binary structureIn order to measure the similarity between objects we always compare pairs of observations(xi xj) where xgt

i = (xi1 xip) xgt

j = (xj1 xjp) and xik xjk2 f0 1g Obviously thereare four cases

where_ and _ are weighting factors Table 112 shows some similarity measures for given weighting factorsThese measures provide alternative ways of weighting mis matching and positive (presence of a common character) or negative (absence of a common character) matchings In principle we could also consider the Euclidian distance However the disadvantage of this distance is that it treats the observations 0 and 1 in the same way If xik= 1 denotes say knowledge of a certain language then the contrary xik= 0 (not knowing the language) should eventually be treated differently

Cluster analysis is concerned with group identification The goal of cluster analysis is to partition a set of observations into a distinct number of unknown groups or clusters in such a manner that all observations within a group are similar while observations in different groups are not similar If data are represented as an n times pMatrix Y =yi j_ where

the goal of cluster analysis is to develop a classification scheme that will partition the rows of Y into k distinct groups (clusters) The rows of the matrix usually represent items or objects

To uncover the groupings in the data a measure of nearness also called a proximity measure needs to be defined Two natural measures of nearness are the degree of distance or ldquodissimilarityrdquo and the degree of association or ldquosimilarityrdquo between groups The choice of the proximity measure depends on the subject matter scale of measurement (nominal ordinal interval ratio) and type of variables (continuous categorical) being analyzed In many applications of cluster analysis one begins with a proximity matrix rather than a data matrix Given the proximity matrix of order (n times n) say the entries may represent dissimilarities[drs] or similarities [srs] between the r th and sth objects Cluster analysis is a tool for classifying objects into groups and is not concerned with the geometric representation of the objects in a low-dimensional space To explore the dimensionality of the space one may use multidimensional scaling

Proximity Measures

Proximity measures are used to represent the nearness of two objects If a proximity measure represents similarity the value of the measure increases as two objects become more similar Alternatively if the proximity measure represents dissimilarity the value of the measure decreases in value as two objects become more alike Letting yr and ys represent two objects in a p-variate space an example of a dissimilarity measure is the Euclidean distance between yr and ys As a measure of similarity one may use the proportion of the elements in the two vectors that match More formally one needs to establish a set of mathematical axioms to create dissimilarity and similarity measuresDissimilarity MeasuresGiven two objects yr and ys in a p-dimensional space a dissimilarity measure satisfies the following conditions

Condition (3) implies that the measure is symmetric so that the dissimilarity measure that compares yr(object r ) with ys(object s) is the same as the comparison for object s versus object r Condition (2) requires the measure to be zero whenever object r equals object sCluster AnalysisTo initiate a cluster analysis one constructs a proximity matrix The proximity matrix represents the strength of the relationship between pairs of rows in Y_ptimesnor the data matrix Yntimesp Algorithms designed to perform cluster analysis are usually divided into two broad classes called hierarchical and nonhierarchical clustering methods Generally speaking hierarchical methods generate a sequence of cluster solutions beginning with clusters Agglomerative Hierarchical Clustering MethodsAgglomerative hierarchical clustering methods use the elements of a proximity matrix to generate a tree diagram or dendogram as shown in Figure 931 To begin the process we start with n = 5 clusters which are the branches of the tree Combining item 1 with 2reduces the number of clusters by one from 5 to 4 Joining items 3 and 4 results in 3clusters Next joining item 5 with the cluster (3 4) results in 2 clusters Finally all items are combined to form a single cluster the root of the treeAverage Link MethodWhen comparing two clusters of objects R and S the single link and complete link methods of combining clusters depended only upon a single pair of objects within each cluster Instead of using a minimum or maximum measure the average link method calculates the distance between

two clusters using the average of the dissimilarities in each cluster where r isinR s isinS and nR and nS represent the number of objects in each cluster Hence the dissimilarities in Step 3 are replaced by an average of nR nS dissimilarities between all pairs of elements r isinR and s isinSCentroid Method

In the average link method the distance between two clusters is defined as an average of dissimilarity measures Alternatively suppose cluster R contains nR elements and cluster S contains nS elements Then the centroids for the two item clusters are

and the square of the Euclidean distance between the two clusters is d2rs= yrminus ys2For the centroid agglomerative process one begins with any dissimilarity matrix D (in SAS the distances are squared unless one uses the NOSQUARE option) Then the two most similar clusters are combined using the weighted average of the two clusters Letting T represent the new cluster the centroid of T is

The centroid method is called the median method if an unweighted average of the centroidsis used yt=_yr+ ys_2 The median method is preferred when nRgtgtnS or nS amp nRLetting the dissimilarity matrix D =_d2r_where d2rs= _yrminus ys_2 suppose the element sr isinR and s isinS are combined into a cluster T where

Then to calculate the square of the Euclidean distance between cluster T and the centroid yu of a third cluster U the following formula may be used

This is a special case of a general algorithm for updating proximity measures for the single link complete link average link centroid and median methods developed by Williams and Lance (1977)Wardrsquos (Incremental Sum of Squares) Method

Given n objects with p variables the sum of squares within clusters where each object forms its own group is zero For all objects in a single group the sum of squares within clusters the sum of squares error is equal to the total sum of squares

Thus the sum of squares within clusters is between zero and SSE Wardrsquos method for forming clusters joins objects based upon minimizing the minimal increment in the within or error sum of squares At each step of the process n(nminus1)2 pairs of clusters are formed and the two objects that increase the sum of squares for error least are joined The processis continued until all objects are joined The dendogram is constructed based upon the minimum increase in the sum of squares for error To see how the process works let

for clusters R and S Combining cluster R and S to form cluster T the error sum of squares for cluster T is

whereyt=_nRyr+ nSys_ (nR+ nS) Then the incremental increase in joining R and S to form cluster T is SSEtminus (SSEr+ SSEs ) Or letting SSEt be the total sum of squares and SSEr+ SSEs the within cluster sum of squares the incremental increase in the errorsum of squares is no more than the between cluster sum of squares The incremental between cluster sum of squares

(IBCSS) isFor clusters with one object (938) becomes d2rs2 Hence starting with a dissimilarity matrix D =_d2rs_where d2rs is the square of the Euclidean distances (the default in SAS for Wardrsquos method) between objects r and s the two most similar objects are combined and the new incremental sum of squares proximity matrix has elements prs= d2rs2 Combining objects r and s to form a new cluster with mean yt using (935) the incremental increase in the error sum of squares may be calculated using the formula developed by Williams and Lance (1977) as

Cluster analysis is used to categorize objects such as patients voters products institutions countries and cities among others into homogeneous groups based upon a vector of p variables The clustering of objects into homogeneous groups depends on the scale of the variables the algorithm used for clustering and the criterion used to estimate the number of clusters In general variables with very large variances relative to others and outliers have an adverse effect on cluster analysis methods They tend to dominate the proximity measureCluster analysis is an exploratory data analysis methodology It tries to discover how objects may or may not be combined The analysis depends on the amount of random noise in the data the existence of outliers in the data the variables selected for the analysis the proximity measure used the spatial properties of the data and the clustering method employed

Two Step Cluster AnalysisThe Two Step Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a dataset that would otherwise not be apparent The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques

bull Handling of categorical and continuous variables By assuming variables to be independent a joint multinomial-normal distribution can be placed on categorical and continuous variables

bull Automatic selection of number of clusters By comparing the values of a model-choice criterion across different clustering solutions the procedure can automatically determine the optimal number of clusters

bull Scalability By constructing a cluster features (CF) tree that summarizes the records the TwoStep algorithm allows you to analyze large data files

Show me

Example Retail and consumer product companies regularly apply clustering techniques to data that describe their customers buying habits gender age income level etc These companies tailor their marketing and product development strategies to each consumer group to increase sales and build brand loyalty

Distance Measure This selection determines how the similarity between two clusters is computed

bull Log-likelihood The likelihood measure places a probability distribution on the variables Continuous variables are assumed to be normally distributed while categorical variables are assumed to be multinomial All variables are assumed to be independent

bull Euclidean The Euclidean measure is the straight line distance between two clusters It can be used only when all of the variables are continuous

Number of Clusters This selection allows you to specify how the number of clusters is to be determined

bull Determine automatically The procedure will automatically determine the best number of clusters using the criterion specified in the Clustering Criterion group Optionally enter a positive integer specifying the maximum number of clusters that the procedure should consider

bull Specify fixed Allows you to fix the number of clusters in the solution Enter a positive integer

Count of Continuous Variables This group provides a summary of the continuous variable standardization specifications made in the Options dialog box See the topic TwoStep Cluster Analysis Options for more information

Clustering Criterion This selection determines how the automatic clustering algorithm determines the number of clusters Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified

Data This procedure works with both continuous and categorical variables Cases represent objects to be clustered and the variables represent attributes upon which the clustering is based

Case Order Note that the cluster features tree and the final solution may depend on the order of cases To minimize order effects randomly order the cases You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution In situations where this is difficult due to extremely large file sizes multiple runs with a sample of cases sorted in different random orders might be substituted

Assumptions The likelihood distance measure assumes that variables in the cluster model are independent Further each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions but you should try to be aware of how well these assumptions are met

Use the Bivariate Correlations procedure to test the independence of two continuous variables Use the Crosstabs procedure to test the independence of two categorical variables Use the Means procedure to test the independence between a continuous variable and categorical variable Use the Explore procedure to test the normality of a continuous variable Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution

To Obtain a TwoStep Cluster Analysis

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt TwoStep Cluster

Select one or more categorical or continuous variables

Optionally you can

bull Adjust the criteria by which clusters are constructed

bull Select settings for noise handling memory allocation variable standardization and cluster model input

bull Request model viewer output

bull Save model results to the working file or to an external XML file

TwoStep Cluster Analysis Options

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 2: TwoStep Cluster Analysis

One checks each pair of observations (objects) for the similarity of their values A similarity (proximity) measure is defined to measure the closeness of the objectsThe closer they are the more homogeneous they are2 Choice of group-building algorithmOn the basis of the proximity measures the objects assigned to groups so that differences between groups become large and observations in a group become as close as possible

The starting point of a cluster analysis is a data matrix X(n _ p) with n measurements(objects) of p variables The proximity (similarity) among objects is described by amatrix

The matrix D contains measures of similarity or dissimilarity among the n objects If theValues dij are distances then they measure dissimilarity The greater the distance the less similar are the objects If the values dij are proximity measures then the opposite is true ie the greater the proximity value the more similar are the objects A distance matrix for example could be defined by the L2-norm dij= kxi1048576xjk2 where xi and xj denote the rows of the data matrix X Distance and similarity are of course dual If dijis a distance then d0ij = maxij fdijg1048576dijis a proximity measure The nature of the observations plays an important role in the choice of proximity measure Nominal values (like binary variables) lead in general to proximity values whereas metric values lead (in general) to distance matrices We first present possibilities for D in the binary case and then consider the continuous case

Similarity of objects with binary structureIn order to measure the similarity between objects we always compare pairs of observations(xi xj) where xgt

i = (xi1 xip) xgt

j = (xj1 xjp) and xik xjk2 f0 1g Obviously thereare four cases

where_ and _ are weighting factors Table 112 shows some similarity measures for given weighting factorsThese measures provide alternative ways of weighting mis matching and positive (presence of a common character) or negative (absence of a common character) matchings In principle we could also consider the Euclidian distance However the disadvantage of this distance is that it treats the observations 0 and 1 in the same way If xik= 1 denotes say knowledge of a certain language then the contrary xik= 0 (not knowing the language) should eventually be treated differently

Cluster analysis is concerned with group identification The goal of cluster analysis is to partition a set of observations into a distinct number of unknown groups or clusters in such a manner that all observations within a group are similar while observations in different groups are not similar If data are represented as an n times pMatrix Y =yi j_ where

the goal of cluster analysis is to develop a classification scheme that will partition the rows of Y into k distinct groups (clusters) The rows of the matrix usually represent items or objects

To uncover the groupings in the data a measure of nearness also called a proximity measure needs to be defined Two natural measures of nearness are the degree of distance or ldquodissimilarityrdquo and the degree of association or ldquosimilarityrdquo between groups The choice of the proximity measure depends on the subject matter scale of measurement (nominal ordinal interval ratio) and type of variables (continuous categorical) being analyzed In many applications of cluster analysis one begins with a proximity matrix rather than a data matrix Given the proximity matrix of order (n times n) say the entries may represent dissimilarities[drs] or similarities [srs] between the r th and sth objects Cluster analysis is a tool for classifying objects into groups and is not concerned with the geometric representation of the objects in a low-dimensional space To explore the dimensionality of the space one may use multidimensional scaling

Proximity Measures

Proximity measures are used to represent the nearness of two objects If a proximity measure represents similarity the value of the measure increases as two objects become more similar Alternatively if the proximity measure represents dissimilarity the value of the measure decreases in value as two objects become more alike Letting yr and ys represent two objects in a p-variate space an example of a dissimilarity measure is the Euclidean distance between yr and ys As a measure of similarity one may use the proportion of the elements in the two vectors that match More formally one needs to establish a set of mathematical axioms to create dissimilarity and similarity measuresDissimilarity MeasuresGiven two objects yr and ys in a p-dimensional space a dissimilarity measure satisfies the following conditions

Condition (3) implies that the measure is symmetric so that the dissimilarity measure that compares yr(object r ) with ys(object s) is the same as the comparison for object s versus object r Condition (2) requires the measure to be zero whenever object r equals object sCluster AnalysisTo initiate a cluster analysis one constructs a proximity matrix The proximity matrix represents the strength of the relationship between pairs of rows in Y_ptimesnor the data matrix Yntimesp Algorithms designed to perform cluster analysis are usually divided into two broad classes called hierarchical and nonhierarchical clustering methods Generally speaking hierarchical methods generate a sequence of cluster solutions beginning with clusters Agglomerative Hierarchical Clustering MethodsAgglomerative hierarchical clustering methods use the elements of a proximity matrix to generate a tree diagram or dendogram as shown in Figure 931 To begin the process we start with n = 5 clusters which are the branches of the tree Combining item 1 with 2reduces the number of clusters by one from 5 to 4 Joining items 3 and 4 results in 3clusters Next joining item 5 with the cluster (3 4) results in 2 clusters Finally all items are combined to form a single cluster the root of the treeAverage Link MethodWhen comparing two clusters of objects R and S the single link and complete link methods of combining clusters depended only upon a single pair of objects within each cluster Instead of using a minimum or maximum measure the average link method calculates the distance between

two clusters using the average of the dissimilarities in each cluster where r isinR s isinS and nR and nS represent the number of objects in each cluster Hence the dissimilarities in Step 3 are replaced by an average of nR nS dissimilarities between all pairs of elements r isinR and s isinSCentroid Method

In the average link method the distance between two clusters is defined as an average of dissimilarity measures Alternatively suppose cluster R contains nR elements and cluster S contains nS elements Then the centroids for the two item clusters are

and the square of the Euclidean distance between the two clusters is d2rs= yrminus ys2For the centroid agglomerative process one begins with any dissimilarity matrix D (in SAS the distances are squared unless one uses the NOSQUARE option) Then the two most similar clusters are combined using the weighted average of the two clusters Letting T represent the new cluster the centroid of T is

The centroid method is called the median method if an unweighted average of the centroidsis used yt=_yr+ ys_2 The median method is preferred when nRgtgtnS or nS amp nRLetting the dissimilarity matrix D =_d2r_where d2rs= _yrminus ys_2 suppose the element sr isinR and s isinS are combined into a cluster T where

Then to calculate the square of the Euclidean distance between cluster T and the centroid yu of a third cluster U the following formula may be used

This is a special case of a general algorithm for updating proximity measures for the single link complete link average link centroid and median methods developed by Williams and Lance (1977)Wardrsquos (Incremental Sum of Squares) Method

Given n objects with p variables the sum of squares within clusters where each object forms its own group is zero For all objects in a single group the sum of squares within clusters the sum of squares error is equal to the total sum of squares

Thus the sum of squares within clusters is between zero and SSE Wardrsquos method for forming clusters joins objects based upon minimizing the minimal increment in the within or error sum of squares At each step of the process n(nminus1)2 pairs of clusters are formed and the two objects that increase the sum of squares for error least are joined The processis continued until all objects are joined The dendogram is constructed based upon the minimum increase in the sum of squares for error To see how the process works let

for clusters R and S Combining cluster R and S to form cluster T the error sum of squares for cluster T is

whereyt=_nRyr+ nSys_ (nR+ nS) Then the incremental increase in joining R and S to form cluster T is SSEtminus (SSEr+ SSEs ) Or letting SSEt be the total sum of squares and SSEr+ SSEs the within cluster sum of squares the incremental increase in the errorsum of squares is no more than the between cluster sum of squares The incremental between cluster sum of squares

(IBCSS) isFor clusters with one object (938) becomes d2rs2 Hence starting with a dissimilarity matrix D =_d2rs_where d2rs is the square of the Euclidean distances (the default in SAS for Wardrsquos method) between objects r and s the two most similar objects are combined and the new incremental sum of squares proximity matrix has elements prs= d2rs2 Combining objects r and s to form a new cluster with mean yt using (935) the incremental increase in the error sum of squares may be calculated using the formula developed by Williams and Lance (1977) as

Cluster analysis is used to categorize objects such as patients voters products institutions countries and cities among others into homogeneous groups based upon a vector of p variables The clustering of objects into homogeneous groups depends on the scale of the variables the algorithm used for clustering and the criterion used to estimate the number of clusters In general variables with very large variances relative to others and outliers have an adverse effect on cluster analysis methods They tend to dominate the proximity measureCluster analysis is an exploratory data analysis methodology It tries to discover how objects may or may not be combined The analysis depends on the amount of random noise in the data the existence of outliers in the data the variables selected for the analysis the proximity measure used the spatial properties of the data and the clustering method employed

Two Step Cluster AnalysisThe Two Step Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a dataset that would otherwise not be apparent The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques

bull Handling of categorical and continuous variables By assuming variables to be independent a joint multinomial-normal distribution can be placed on categorical and continuous variables

bull Automatic selection of number of clusters By comparing the values of a model-choice criterion across different clustering solutions the procedure can automatically determine the optimal number of clusters

bull Scalability By constructing a cluster features (CF) tree that summarizes the records the TwoStep algorithm allows you to analyze large data files

Show me

Example Retail and consumer product companies regularly apply clustering techniques to data that describe their customers buying habits gender age income level etc These companies tailor their marketing and product development strategies to each consumer group to increase sales and build brand loyalty

Distance Measure This selection determines how the similarity between two clusters is computed

bull Log-likelihood The likelihood measure places a probability distribution on the variables Continuous variables are assumed to be normally distributed while categorical variables are assumed to be multinomial All variables are assumed to be independent

bull Euclidean The Euclidean measure is the straight line distance between two clusters It can be used only when all of the variables are continuous

Number of Clusters This selection allows you to specify how the number of clusters is to be determined

bull Determine automatically The procedure will automatically determine the best number of clusters using the criterion specified in the Clustering Criterion group Optionally enter a positive integer specifying the maximum number of clusters that the procedure should consider

bull Specify fixed Allows you to fix the number of clusters in the solution Enter a positive integer

Count of Continuous Variables This group provides a summary of the continuous variable standardization specifications made in the Options dialog box See the topic TwoStep Cluster Analysis Options for more information

Clustering Criterion This selection determines how the automatic clustering algorithm determines the number of clusters Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified

Data This procedure works with both continuous and categorical variables Cases represent objects to be clustered and the variables represent attributes upon which the clustering is based

Case Order Note that the cluster features tree and the final solution may depend on the order of cases To minimize order effects randomly order the cases You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution In situations where this is difficult due to extremely large file sizes multiple runs with a sample of cases sorted in different random orders might be substituted

Assumptions The likelihood distance measure assumes that variables in the cluster model are independent Further each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions but you should try to be aware of how well these assumptions are met

Use the Bivariate Correlations procedure to test the independence of two continuous variables Use the Crosstabs procedure to test the independence of two categorical variables Use the Means procedure to test the independence between a continuous variable and categorical variable Use the Explore procedure to test the normality of a continuous variable Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution

To Obtain a TwoStep Cluster Analysis

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt TwoStep Cluster

Select one or more categorical or continuous variables

Optionally you can

bull Adjust the criteria by which clusters are constructed

bull Select settings for noise handling memory allocation variable standardization and cluster model input

bull Request model viewer output

bull Save model results to the working file or to an external XML file

TwoStep Cluster Analysis Options

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 3: TwoStep Cluster Analysis

where_ and _ are weighting factors Table 112 shows some similarity measures for given weighting factorsThese measures provide alternative ways of weighting mis matching and positive (presence of a common character) or negative (absence of a common character) matchings In principle we could also consider the Euclidian distance However the disadvantage of this distance is that it treats the observations 0 and 1 in the same way If xik= 1 denotes say knowledge of a certain language then the contrary xik= 0 (not knowing the language) should eventually be treated differently

Cluster analysis is concerned with group identification The goal of cluster analysis is to partition a set of observations into a distinct number of unknown groups or clusters in such a manner that all observations within a group are similar while observations in different groups are not similar If data are represented as an n times pMatrix Y =yi j_ where

the goal of cluster analysis is to develop a classification scheme that will partition the rows of Y into k distinct groups (clusters) The rows of the matrix usually represent items or objects

To uncover the groupings in the data a measure of nearness also called a proximity measure needs to be defined Two natural measures of nearness are the degree of distance or ldquodissimilarityrdquo and the degree of association or ldquosimilarityrdquo between groups The choice of the proximity measure depends on the subject matter scale of measurement (nominal ordinal interval ratio) and type of variables (continuous categorical) being analyzed In many applications of cluster analysis one begins with a proximity matrix rather than a data matrix Given the proximity matrix of order (n times n) say the entries may represent dissimilarities[drs] or similarities [srs] between the r th and sth objects Cluster analysis is a tool for classifying objects into groups and is not concerned with the geometric representation of the objects in a low-dimensional space To explore the dimensionality of the space one may use multidimensional scaling

Proximity Measures

Proximity measures are used to represent the nearness of two objects If a proximity measure represents similarity the value of the measure increases as two objects become more similar Alternatively if the proximity measure represents dissimilarity the value of the measure decreases in value as two objects become more alike Letting yr and ys represent two objects in a p-variate space an example of a dissimilarity measure is the Euclidean distance between yr and ys As a measure of similarity one may use the proportion of the elements in the two vectors that match More formally one needs to establish a set of mathematical axioms to create dissimilarity and similarity measuresDissimilarity MeasuresGiven two objects yr and ys in a p-dimensional space a dissimilarity measure satisfies the following conditions

Condition (3) implies that the measure is symmetric so that the dissimilarity measure that compares yr(object r ) with ys(object s) is the same as the comparison for object s versus object r Condition (2) requires the measure to be zero whenever object r equals object sCluster AnalysisTo initiate a cluster analysis one constructs a proximity matrix The proximity matrix represents the strength of the relationship between pairs of rows in Y_ptimesnor the data matrix Yntimesp Algorithms designed to perform cluster analysis are usually divided into two broad classes called hierarchical and nonhierarchical clustering methods Generally speaking hierarchical methods generate a sequence of cluster solutions beginning with clusters Agglomerative Hierarchical Clustering MethodsAgglomerative hierarchical clustering methods use the elements of a proximity matrix to generate a tree diagram or dendogram as shown in Figure 931 To begin the process we start with n = 5 clusters which are the branches of the tree Combining item 1 with 2reduces the number of clusters by one from 5 to 4 Joining items 3 and 4 results in 3clusters Next joining item 5 with the cluster (3 4) results in 2 clusters Finally all items are combined to form a single cluster the root of the treeAverage Link MethodWhen comparing two clusters of objects R and S the single link and complete link methods of combining clusters depended only upon a single pair of objects within each cluster Instead of using a minimum or maximum measure the average link method calculates the distance between

two clusters using the average of the dissimilarities in each cluster where r isinR s isinS and nR and nS represent the number of objects in each cluster Hence the dissimilarities in Step 3 are replaced by an average of nR nS dissimilarities between all pairs of elements r isinR and s isinSCentroid Method

In the average link method the distance between two clusters is defined as an average of dissimilarity measures Alternatively suppose cluster R contains nR elements and cluster S contains nS elements Then the centroids for the two item clusters are

and the square of the Euclidean distance between the two clusters is d2rs= yrminus ys2For the centroid agglomerative process one begins with any dissimilarity matrix D (in SAS the distances are squared unless one uses the NOSQUARE option) Then the two most similar clusters are combined using the weighted average of the two clusters Letting T represent the new cluster the centroid of T is

The centroid method is called the median method if an unweighted average of the centroidsis used yt=_yr+ ys_2 The median method is preferred when nRgtgtnS or nS amp nRLetting the dissimilarity matrix D =_d2r_where d2rs= _yrminus ys_2 suppose the element sr isinR and s isinS are combined into a cluster T where

Then to calculate the square of the Euclidean distance between cluster T and the centroid yu of a third cluster U the following formula may be used

This is a special case of a general algorithm for updating proximity measures for the single link complete link average link centroid and median methods developed by Williams and Lance (1977)Wardrsquos (Incremental Sum of Squares) Method

Given n objects with p variables the sum of squares within clusters where each object forms its own group is zero For all objects in a single group the sum of squares within clusters the sum of squares error is equal to the total sum of squares

Thus the sum of squares within clusters is between zero and SSE Wardrsquos method for forming clusters joins objects based upon minimizing the minimal increment in the within or error sum of squares At each step of the process n(nminus1)2 pairs of clusters are formed and the two objects that increase the sum of squares for error least are joined The processis continued until all objects are joined The dendogram is constructed based upon the minimum increase in the sum of squares for error To see how the process works let

for clusters R and S Combining cluster R and S to form cluster T the error sum of squares for cluster T is

whereyt=_nRyr+ nSys_ (nR+ nS) Then the incremental increase in joining R and S to form cluster T is SSEtminus (SSEr+ SSEs ) Or letting SSEt be the total sum of squares and SSEr+ SSEs the within cluster sum of squares the incremental increase in the errorsum of squares is no more than the between cluster sum of squares The incremental between cluster sum of squares

(IBCSS) isFor clusters with one object (938) becomes d2rs2 Hence starting with a dissimilarity matrix D =_d2rs_where d2rs is the square of the Euclidean distances (the default in SAS for Wardrsquos method) between objects r and s the two most similar objects are combined and the new incremental sum of squares proximity matrix has elements prs= d2rs2 Combining objects r and s to form a new cluster with mean yt using (935) the incremental increase in the error sum of squares may be calculated using the formula developed by Williams and Lance (1977) as

Cluster analysis is used to categorize objects such as patients voters products institutions countries and cities among others into homogeneous groups based upon a vector of p variables The clustering of objects into homogeneous groups depends on the scale of the variables the algorithm used for clustering and the criterion used to estimate the number of clusters In general variables with very large variances relative to others and outliers have an adverse effect on cluster analysis methods They tend to dominate the proximity measureCluster analysis is an exploratory data analysis methodology It tries to discover how objects may or may not be combined The analysis depends on the amount of random noise in the data the existence of outliers in the data the variables selected for the analysis the proximity measure used the spatial properties of the data and the clustering method employed

Two Step Cluster AnalysisThe Two Step Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a dataset that would otherwise not be apparent The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques

bull Handling of categorical and continuous variables By assuming variables to be independent a joint multinomial-normal distribution can be placed on categorical and continuous variables

bull Automatic selection of number of clusters By comparing the values of a model-choice criterion across different clustering solutions the procedure can automatically determine the optimal number of clusters

bull Scalability By constructing a cluster features (CF) tree that summarizes the records the TwoStep algorithm allows you to analyze large data files

Show me

Example Retail and consumer product companies regularly apply clustering techniques to data that describe their customers buying habits gender age income level etc These companies tailor their marketing and product development strategies to each consumer group to increase sales and build brand loyalty

Distance Measure This selection determines how the similarity between two clusters is computed

bull Log-likelihood The likelihood measure places a probability distribution on the variables Continuous variables are assumed to be normally distributed while categorical variables are assumed to be multinomial All variables are assumed to be independent

bull Euclidean The Euclidean measure is the straight line distance between two clusters It can be used only when all of the variables are continuous

Number of Clusters This selection allows you to specify how the number of clusters is to be determined

bull Determine automatically The procedure will automatically determine the best number of clusters using the criterion specified in the Clustering Criterion group Optionally enter a positive integer specifying the maximum number of clusters that the procedure should consider

bull Specify fixed Allows you to fix the number of clusters in the solution Enter a positive integer

Count of Continuous Variables This group provides a summary of the continuous variable standardization specifications made in the Options dialog box See the topic TwoStep Cluster Analysis Options for more information

Clustering Criterion This selection determines how the automatic clustering algorithm determines the number of clusters Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified

Data This procedure works with both continuous and categorical variables Cases represent objects to be clustered and the variables represent attributes upon which the clustering is based

Case Order Note that the cluster features tree and the final solution may depend on the order of cases To minimize order effects randomly order the cases You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution In situations where this is difficult due to extremely large file sizes multiple runs with a sample of cases sorted in different random orders might be substituted

Assumptions The likelihood distance measure assumes that variables in the cluster model are independent Further each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions but you should try to be aware of how well these assumptions are met

Use the Bivariate Correlations procedure to test the independence of two continuous variables Use the Crosstabs procedure to test the independence of two categorical variables Use the Means procedure to test the independence between a continuous variable and categorical variable Use the Explore procedure to test the normality of a continuous variable Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution

To Obtain a TwoStep Cluster Analysis

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt TwoStep Cluster

Select one or more categorical or continuous variables

Optionally you can

bull Adjust the criteria by which clusters are constructed

bull Select settings for noise handling memory allocation variable standardization and cluster model input

bull Request model viewer output

bull Save model results to the working file or to an external XML file

TwoStep Cluster Analysis Options

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 4: TwoStep Cluster Analysis

Proximity Measures

Proximity measures are used to represent the nearness of two objects If a proximity measure represents similarity the value of the measure increases as two objects become more similar Alternatively if the proximity measure represents dissimilarity the value of the measure decreases in value as two objects become more alike Letting yr and ys represent two objects in a p-variate space an example of a dissimilarity measure is the Euclidean distance between yr and ys As a measure of similarity one may use the proportion of the elements in the two vectors that match More formally one needs to establish a set of mathematical axioms to create dissimilarity and similarity measuresDissimilarity MeasuresGiven two objects yr and ys in a p-dimensional space a dissimilarity measure satisfies the following conditions

Condition (3) implies that the measure is symmetric so that the dissimilarity measure that compares yr(object r ) with ys(object s) is the same as the comparison for object s versus object r Condition (2) requires the measure to be zero whenever object r equals object sCluster AnalysisTo initiate a cluster analysis one constructs a proximity matrix The proximity matrix represents the strength of the relationship between pairs of rows in Y_ptimesnor the data matrix Yntimesp Algorithms designed to perform cluster analysis are usually divided into two broad classes called hierarchical and nonhierarchical clustering methods Generally speaking hierarchical methods generate a sequence of cluster solutions beginning with clusters Agglomerative Hierarchical Clustering MethodsAgglomerative hierarchical clustering methods use the elements of a proximity matrix to generate a tree diagram or dendogram as shown in Figure 931 To begin the process we start with n = 5 clusters which are the branches of the tree Combining item 1 with 2reduces the number of clusters by one from 5 to 4 Joining items 3 and 4 results in 3clusters Next joining item 5 with the cluster (3 4) results in 2 clusters Finally all items are combined to form a single cluster the root of the treeAverage Link MethodWhen comparing two clusters of objects R and S the single link and complete link methods of combining clusters depended only upon a single pair of objects within each cluster Instead of using a minimum or maximum measure the average link method calculates the distance between

two clusters using the average of the dissimilarities in each cluster where r isinR s isinS and nR and nS represent the number of objects in each cluster Hence the dissimilarities in Step 3 are replaced by an average of nR nS dissimilarities between all pairs of elements r isinR and s isinSCentroid Method

In the average link method the distance between two clusters is defined as an average of dissimilarity measures Alternatively suppose cluster R contains nR elements and cluster S contains nS elements Then the centroids for the two item clusters are

and the square of the Euclidean distance between the two clusters is d2rs= yrminus ys2For the centroid agglomerative process one begins with any dissimilarity matrix D (in SAS the distances are squared unless one uses the NOSQUARE option) Then the two most similar clusters are combined using the weighted average of the two clusters Letting T represent the new cluster the centroid of T is

The centroid method is called the median method if an unweighted average of the centroidsis used yt=_yr+ ys_2 The median method is preferred when nRgtgtnS or nS amp nRLetting the dissimilarity matrix D =_d2r_where d2rs= _yrminus ys_2 suppose the element sr isinR and s isinS are combined into a cluster T where

Then to calculate the square of the Euclidean distance between cluster T and the centroid yu of a third cluster U the following formula may be used

This is a special case of a general algorithm for updating proximity measures for the single link complete link average link centroid and median methods developed by Williams and Lance (1977)Wardrsquos (Incremental Sum of Squares) Method

Given n objects with p variables the sum of squares within clusters where each object forms its own group is zero For all objects in a single group the sum of squares within clusters the sum of squares error is equal to the total sum of squares

Thus the sum of squares within clusters is between zero and SSE Wardrsquos method for forming clusters joins objects based upon minimizing the minimal increment in the within or error sum of squares At each step of the process n(nminus1)2 pairs of clusters are formed and the two objects that increase the sum of squares for error least are joined The processis continued until all objects are joined The dendogram is constructed based upon the minimum increase in the sum of squares for error To see how the process works let

for clusters R and S Combining cluster R and S to form cluster T the error sum of squares for cluster T is

whereyt=_nRyr+ nSys_ (nR+ nS) Then the incremental increase in joining R and S to form cluster T is SSEtminus (SSEr+ SSEs ) Or letting SSEt be the total sum of squares and SSEr+ SSEs the within cluster sum of squares the incremental increase in the errorsum of squares is no more than the between cluster sum of squares The incremental between cluster sum of squares

(IBCSS) isFor clusters with one object (938) becomes d2rs2 Hence starting with a dissimilarity matrix D =_d2rs_where d2rs is the square of the Euclidean distances (the default in SAS for Wardrsquos method) between objects r and s the two most similar objects are combined and the new incremental sum of squares proximity matrix has elements prs= d2rs2 Combining objects r and s to form a new cluster with mean yt using (935) the incremental increase in the error sum of squares may be calculated using the formula developed by Williams and Lance (1977) as

Cluster analysis is used to categorize objects such as patients voters products institutions countries and cities among others into homogeneous groups based upon a vector of p variables The clustering of objects into homogeneous groups depends on the scale of the variables the algorithm used for clustering and the criterion used to estimate the number of clusters In general variables with very large variances relative to others and outliers have an adverse effect on cluster analysis methods They tend to dominate the proximity measureCluster analysis is an exploratory data analysis methodology It tries to discover how objects may or may not be combined The analysis depends on the amount of random noise in the data the existence of outliers in the data the variables selected for the analysis the proximity measure used the spatial properties of the data and the clustering method employed

Two Step Cluster AnalysisThe Two Step Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a dataset that would otherwise not be apparent The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques

bull Handling of categorical and continuous variables By assuming variables to be independent a joint multinomial-normal distribution can be placed on categorical and continuous variables

bull Automatic selection of number of clusters By comparing the values of a model-choice criterion across different clustering solutions the procedure can automatically determine the optimal number of clusters

bull Scalability By constructing a cluster features (CF) tree that summarizes the records the TwoStep algorithm allows you to analyze large data files

Show me

Example Retail and consumer product companies regularly apply clustering techniques to data that describe their customers buying habits gender age income level etc These companies tailor their marketing and product development strategies to each consumer group to increase sales and build brand loyalty

Distance Measure This selection determines how the similarity between two clusters is computed

bull Log-likelihood The likelihood measure places a probability distribution on the variables Continuous variables are assumed to be normally distributed while categorical variables are assumed to be multinomial All variables are assumed to be independent

bull Euclidean The Euclidean measure is the straight line distance between two clusters It can be used only when all of the variables are continuous

Number of Clusters This selection allows you to specify how the number of clusters is to be determined

bull Determine automatically The procedure will automatically determine the best number of clusters using the criterion specified in the Clustering Criterion group Optionally enter a positive integer specifying the maximum number of clusters that the procedure should consider

bull Specify fixed Allows you to fix the number of clusters in the solution Enter a positive integer

Count of Continuous Variables This group provides a summary of the continuous variable standardization specifications made in the Options dialog box See the topic TwoStep Cluster Analysis Options for more information

Clustering Criterion This selection determines how the automatic clustering algorithm determines the number of clusters Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified

Data This procedure works with both continuous and categorical variables Cases represent objects to be clustered and the variables represent attributes upon which the clustering is based

Case Order Note that the cluster features tree and the final solution may depend on the order of cases To minimize order effects randomly order the cases You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution In situations where this is difficult due to extremely large file sizes multiple runs with a sample of cases sorted in different random orders might be substituted

Assumptions The likelihood distance measure assumes that variables in the cluster model are independent Further each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions but you should try to be aware of how well these assumptions are met

Use the Bivariate Correlations procedure to test the independence of two continuous variables Use the Crosstabs procedure to test the independence of two categorical variables Use the Means procedure to test the independence between a continuous variable and categorical variable Use the Explore procedure to test the normality of a continuous variable Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution

To Obtain a TwoStep Cluster Analysis

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt TwoStep Cluster

Select one or more categorical or continuous variables

Optionally you can

bull Adjust the criteria by which clusters are constructed

bull Select settings for noise handling memory allocation variable standardization and cluster model input

bull Request model viewer output

bull Save model results to the working file or to an external XML file

TwoStep Cluster Analysis Options

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 5: TwoStep Cluster Analysis

In the average link method the distance between two clusters is defined as an average of dissimilarity measures Alternatively suppose cluster R contains nR elements and cluster S contains nS elements Then the centroids for the two item clusters are

and the square of the Euclidean distance between the two clusters is d2rs= yrminus ys2For the centroid agglomerative process one begins with any dissimilarity matrix D (in SAS the distances are squared unless one uses the NOSQUARE option) Then the two most similar clusters are combined using the weighted average of the two clusters Letting T represent the new cluster the centroid of T is

The centroid method is called the median method if an unweighted average of the centroidsis used yt=_yr+ ys_2 The median method is preferred when nRgtgtnS or nS amp nRLetting the dissimilarity matrix D =_d2r_where d2rs= _yrminus ys_2 suppose the element sr isinR and s isinS are combined into a cluster T where

Then to calculate the square of the Euclidean distance between cluster T and the centroid yu of a third cluster U the following formula may be used

This is a special case of a general algorithm for updating proximity measures for the single link complete link average link centroid and median methods developed by Williams and Lance (1977)Wardrsquos (Incremental Sum of Squares) Method

Given n objects with p variables the sum of squares within clusters where each object forms its own group is zero For all objects in a single group the sum of squares within clusters the sum of squares error is equal to the total sum of squares

Thus the sum of squares within clusters is between zero and SSE Wardrsquos method for forming clusters joins objects based upon minimizing the minimal increment in the within or error sum of squares At each step of the process n(nminus1)2 pairs of clusters are formed and the two objects that increase the sum of squares for error least are joined The processis continued until all objects are joined The dendogram is constructed based upon the minimum increase in the sum of squares for error To see how the process works let

for clusters R and S Combining cluster R and S to form cluster T the error sum of squares for cluster T is

whereyt=_nRyr+ nSys_ (nR+ nS) Then the incremental increase in joining R and S to form cluster T is SSEtminus (SSEr+ SSEs ) Or letting SSEt be the total sum of squares and SSEr+ SSEs the within cluster sum of squares the incremental increase in the errorsum of squares is no more than the between cluster sum of squares The incremental between cluster sum of squares

(IBCSS) isFor clusters with one object (938) becomes d2rs2 Hence starting with a dissimilarity matrix D =_d2rs_where d2rs is the square of the Euclidean distances (the default in SAS for Wardrsquos method) between objects r and s the two most similar objects are combined and the new incremental sum of squares proximity matrix has elements prs= d2rs2 Combining objects r and s to form a new cluster with mean yt using (935) the incremental increase in the error sum of squares may be calculated using the formula developed by Williams and Lance (1977) as

Cluster analysis is used to categorize objects such as patients voters products institutions countries and cities among others into homogeneous groups based upon a vector of p variables The clustering of objects into homogeneous groups depends on the scale of the variables the algorithm used for clustering and the criterion used to estimate the number of clusters In general variables with very large variances relative to others and outliers have an adverse effect on cluster analysis methods They tend to dominate the proximity measureCluster analysis is an exploratory data analysis methodology It tries to discover how objects may or may not be combined The analysis depends on the amount of random noise in the data the existence of outliers in the data the variables selected for the analysis the proximity measure used the spatial properties of the data and the clustering method employed

Two Step Cluster AnalysisThe Two Step Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a dataset that would otherwise not be apparent The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques

bull Handling of categorical and continuous variables By assuming variables to be independent a joint multinomial-normal distribution can be placed on categorical and continuous variables

bull Automatic selection of number of clusters By comparing the values of a model-choice criterion across different clustering solutions the procedure can automatically determine the optimal number of clusters

bull Scalability By constructing a cluster features (CF) tree that summarizes the records the TwoStep algorithm allows you to analyze large data files

Show me

Example Retail and consumer product companies regularly apply clustering techniques to data that describe their customers buying habits gender age income level etc These companies tailor their marketing and product development strategies to each consumer group to increase sales and build brand loyalty

Distance Measure This selection determines how the similarity between two clusters is computed

bull Log-likelihood The likelihood measure places a probability distribution on the variables Continuous variables are assumed to be normally distributed while categorical variables are assumed to be multinomial All variables are assumed to be independent

bull Euclidean The Euclidean measure is the straight line distance between two clusters It can be used only when all of the variables are continuous

Number of Clusters This selection allows you to specify how the number of clusters is to be determined

bull Determine automatically The procedure will automatically determine the best number of clusters using the criterion specified in the Clustering Criterion group Optionally enter a positive integer specifying the maximum number of clusters that the procedure should consider

bull Specify fixed Allows you to fix the number of clusters in the solution Enter a positive integer

Count of Continuous Variables This group provides a summary of the continuous variable standardization specifications made in the Options dialog box See the topic TwoStep Cluster Analysis Options for more information

Clustering Criterion This selection determines how the automatic clustering algorithm determines the number of clusters Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified

Data This procedure works with both continuous and categorical variables Cases represent objects to be clustered and the variables represent attributes upon which the clustering is based

Case Order Note that the cluster features tree and the final solution may depend on the order of cases To minimize order effects randomly order the cases You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution In situations where this is difficult due to extremely large file sizes multiple runs with a sample of cases sorted in different random orders might be substituted

Assumptions The likelihood distance measure assumes that variables in the cluster model are independent Further each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions but you should try to be aware of how well these assumptions are met

Use the Bivariate Correlations procedure to test the independence of two continuous variables Use the Crosstabs procedure to test the independence of two categorical variables Use the Means procedure to test the independence between a continuous variable and categorical variable Use the Explore procedure to test the normality of a continuous variable Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution

To Obtain a TwoStep Cluster Analysis

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt TwoStep Cluster

Select one or more categorical or continuous variables

Optionally you can

bull Adjust the criteria by which clusters are constructed

bull Select settings for noise handling memory allocation variable standardization and cluster model input

bull Request model viewer output

bull Save model results to the working file or to an external XML file

TwoStep Cluster Analysis Options

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 6: TwoStep Cluster Analysis

for clusters R and S Combining cluster R and S to form cluster T the error sum of squares for cluster T is

whereyt=_nRyr+ nSys_ (nR+ nS) Then the incremental increase in joining R and S to form cluster T is SSEtminus (SSEr+ SSEs ) Or letting SSEt be the total sum of squares and SSEr+ SSEs the within cluster sum of squares the incremental increase in the errorsum of squares is no more than the between cluster sum of squares The incremental between cluster sum of squares

(IBCSS) isFor clusters with one object (938) becomes d2rs2 Hence starting with a dissimilarity matrix D =_d2rs_where d2rs is the square of the Euclidean distances (the default in SAS for Wardrsquos method) between objects r and s the two most similar objects are combined and the new incremental sum of squares proximity matrix has elements prs= d2rs2 Combining objects r and s to form a new cluster with mean yt using (935) the incremental increase in the error sum of squares may be calculated using the formula developed by Williams and Lance (1977) as

Cluster analysis is used to categorize objects such as patients voters products institutions countries and cities among others into homogeneous groups based upon a vector of p variables The clustering of objects into homogeneous groups depends on the scale of the variables the algorithm used for clustering and the criterion used to estimate the number of clusters In general variables with very large variances relative to others and outliers have an adverse effect on cluster analysis methods They tend to dominate the proximity measureCluster analysis is an exploratory data analysis methodology It tries to discover how objects may or may not be combined The analysis depends on the amount of random noise in the data the existence of outliers in the data the variables selected for the analysis the proximity measure used the spatial properties of the data and the clustering method employed

Two Step Cluster AnalysisThe Two Step Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a dataset that would otherwise not be apparent The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques

bull Handling of categorical and continuous variables By assuming variables to be independent a joint multinomial-normal distribution can be placed on categorical and continuous variables

bull Automatic selection of number of clusters By comparing the values of a model-choice criterion across different clustering solutions the procedure can automatically determine the optimal number of clusters

bull Scalability By constructing a cluster features (CF) tree that summarizes the records the TwoStep algorithm allows you to analyze large data files

Show me

Example Retail and consumer product companies regularly apply clustering techniques to data that describe their customers buying habits gender age income level etc These companies tailor their marketing and product development strategies to each consumer group to increase sales and build brand loyalty

Distance Measure This selection determines how the similarity between two clusters is computed

bull Log-likelihood The likelihood measure places a probability distribution on the variables Continuous variables are assumed to be normally distributed while categorical variables are assumed to be multinomial All variables are assumed to be independent

bull Euclidean The Euclidean measure is the straight line distance between two clusters It can be used only when all of the variables are continuous

Number of Clusters This selection allows you to specify how the number of clusters is to be determined

bull Determine automatically The procedure will automatically determine the best number of clusters using the criterion specified in the Clustering Criterion group Optionally enter a positive integer specifying the maximum number of clusters that the procedure should consider

bull Specify fixed Allows you to fix the number of clusters in the solution Enter a positive integer

Count of Continuous Variables This group provides a summary of the continuous variable standardization specifications made in the Options dialog box See the topic TwoStep Cluster Analysis Options for more information

Clustering Criterion This selection determines how the automatic clustering algorithm determines the number of clusters Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified

Data This procedure works with both continuous and categorical variables Cases represent objects to be clustered and the variables represent attributes upon which the clustering is based

Case Order Note that the cluster features tree and the final solution may depend on the order of cases To minimize order effects randomly order the cases You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution In situations where this is difficult due to extremely large file sizes multiple runs with a sample of cases sorted in different random orders might be substituted

Assumptions The likelihood distance measure assumes that variables in the cluster model are independent Further each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions but you should try to be aware of how well these assumptions are met

Use the Bivariate Correlations procedure to test the independence of two continuous variables Use the Crosstabs procedure to test the independence of two categorical variables Use the Means procedure to test the independence between a continuous variable and categorical variable Use the Explore procedure to test the normality of a continuous variable Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution

To Obtain a TwoStep Cluster Analysis

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt TwoStep Cluster

Select one or more categorical or continuous variables

Optionally you can

bull Adjust the criteria by which clusters are constructed

bull Select settings for noise handling memory allocation variable standardization and cluster model input

bull Request model viewer output

bull Save model results to the working file or to an external XML file

TwoStep Cluster Analysis Options

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 7: TwoStep Cluster Analysis

bull Handling of categorical and continuous variables By assuming variables to be independent a joint multinomial-normal distribution can be placed on categorical and continuous variables

bull Automatic selection of number of clusters By comparing the values of a model-choice criterion across different clustering solutions the procedure can automatically determine the optimal number of clusters

bull Scalability By constructing a cluster features (CF) tree that summarizes the records the TwoStep algorithm allows you to analyze large data files

Show me

Example Retail and consumer product companies regularly apply clustering techniques to data that describe their customers buying habits gender age income level etc These companies tailor their marketing and product development strategies to each consumer group to increase sales and build brand loyalty

Distance Measure This selection determines how the similarity between two clusters is computed

bull Log-likelihood The likelihood measure places a probability distribution on the variables Continuous variables are assumed to be normally distributed while categorical variables are assumed to be multinomial All variables are assumed to be independent

bull Euclidean The Euclidean measure is the straight line distance between two clusters It can be used only when all of the variables are continuous

Number of Clusters This selection allows you to specify how the number of clusters is to be determined

bull Determine automatically The procedure will automatically determine the best number of clusters using the criterion specified in the Clustering Criterion group Optionally enter a positive integer specifying the maximum number of clusters that the procedure should consider

bull Specify fixed Allows you to fix the number of clusters in the solution Enter a positive integer

Count of Continuous Variables This group provides a summary of the continuous variable standardization specifications made in the Options dialog box See the topic TwoStep Cluster Analysis Options for more information

Clustering Criterion This selection determines how the automatic clustering algorithm determines the number of clusters Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified

Data This procedure works with both continuous and categorical variables Cases represent objects to be clustered and the variables represent attributes upon which the clustering is based

Case Order Note that the cluster features tree and the final solution may depend on the order of cases To minimize order effects randomly order the cases You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution In situations where this is difficult due to extremely large file sizes multiple runs with a sample of cases sorted in different random orders might be substituted

Assumptions The likelihood distance measure assumes that variables in the cluster model are independent Further each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions but you should try to be aware of how well these assumptions are met

Use the Bivariate Correlations procedure to test the independence of two continuous variables Use the Crosstabs procedure to test the independence of two categorical variables Use the Means procedure to test the independence between a continuous variable and categorical variable Use the Explore procedure to test the normality of a continuous variable Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution

To Obtain a TwoStep Cluster Analysis

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt TwoStep Cluster

Select one or more categorical or continuous variables

Optionally you can

bull Adjust the criteria by which clusters are constructed

bull Select settings for noise handling memory allocation variable standardization and cluster model input

bull Request model viewer output

bull Save model results to the working file or to an external XML file

TwoStep Cluster Analysis Options

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 8: TwoStep Cluster Analysis

Case Order Note that the cluster features tree and the final solution may depend on the order of cases To minimize order effects randomly order the cases You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution In situations where this is difficult due to extremely large file sizes multiple runs with a sample of cases sorted in different random orders might be substituted

Assumptions The likelihood distance measure assumes that variables in the cluster model are independent Further each continuous variable is assumed to have a normal (Gaussian) distribution and each categorical variable is assumed to have a multinomial distribution Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions but you should try to be aware of how well these assumptions are met

Use the Bivariate Correlations procedure to test the independence of two continuous variables Use the Crosstabs procedure to test the independence of two categorical variables Use the Means procedure to test the independence between a continuous variable and categorical variable Use the Explore procedure to test the normality of a continuous variable Use the Chi-Square Test procedure to test whether a categorical variable has a specified multinomial distribution

To Obtain a TwoStep Cluster Analysis

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt TwoStep Cluster

Select one or more categorical or continuous variables

Optionally you can

bull Adjust the criteria by which clusters are constructed

bull Select settings for noise handling memory allocation variable standardization and cluster model input

bull Request model viewer output

bull Save model results to the working file or to an external XML file

TwoStep Cluster Analysis Options

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 9: TwoStep Cluster Analysis

Outlier Treatment This group allows you to treat outliers specially during clustering if the cluster features (CF) tree fills The CF tree is full if it cannot accept any more cases in a leaf node and no leaf node can be split

bull If you select noise handling and the CF tree fills it will be regrown after placing cases in sparse leaves into a noise leaf A leaf is considered sparse if it contains fewer than the specified percentage of cases of the maximum leaf size After the tree is regrown the outliers will be placed in the CF tree if possible If not the outliers are discarded

bull If you do not select noise handling and the CF tree fills it will be regrown using a larger distance change threshold After final clustering values that cannot be assigned to a cluster are labeled outliers The outlier cluster is given an identification number of ndash1 and is not included in the count of the number of clusters

Memory Allocation This group allows you to specify the maximum amount of memory in megabytes (MB) that the cluster algorithm should use If the procedure exceeds this maximum it will use the disk to store information that will not fit in memory Specify a number greater than or equal to 4

bull Consult your system administrator for the largest value that you can specify on your system

bull The algorithm may fail to find the correct or desired number of clusters if this value is too low

Variable standardization The clustering algorithm works with standardized continuous variables Any continuous variables that are not standardized should be left as variables in the To be Standardized list To save some time and computational effort you can select any continuous variables that you have already standardized as variables in the Assumed Standardized list

Advanced Options

CF Tree Tuning Criteria The following clustering algorithm settings apply specifically to the cluster features (CF) tree and should be changed with care

bull Initial Distance Change Threshold This is the initial threshold used to grow the CF tree If inserting a given case into a leaf of the CF tree would yield tightness less than the threshold the leaf is not split If the tightness exceeds the threshold the leaf is split

bull Maximum Branches (per leaf node) The maximum number of child nodes that a leaf node can have

bull Maximum Tree Depth The maximum number of levels that the CF tree can have

bull Maximum Number of Nodes Possible This indicates the maximum number of CF tree nodes that could potentially be generated by the procedure based on the function (bd+1 ndash 1) (b ndash 1) where b is the maximum branches and d is the maximum tree depth Be aware that an overly

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 10: TwoStep Cluster Analysis

large CF tree can be a drain on system resources and can adversely affect the performance of the procedure At a minimum each node requires 16 bytes

Cluster Model Update This group allows you to import and update a cluster model generated in a prior analysis The input file contains the CF tree in XML format The model will then be updated with the data in the active file You must select the variable names in the main dialog box in the same order in which they were specified in the prior analysis The XML file remains unaltered unless you specifically write the new model information to the same filename See the topic TwoStep Cluster Analysis Output for more information

If a cluster model update is specified the options pertaining to generation of the CF tree that were specified for the original model are used More specifically the distance measure noise handling memory allocation or CF tree tuning criteria settings for the saved model are used and any settings for these options in the dialog boxes are ignored

Note When performing a cluster model update the procedure assumes that none of the selected cases in the active dataset were used to create the original cluster model The procedure also assumes that the cases used in the model update come from the same population as the cases used to create the original model that is the means and variances of continuous variables and levels of categorical variables are assumed to be the same across both sets of cases If your new and old sets of cases come from heterogeneous populations you should run the TwoStep Cluster Analysis procedure on the combined sets of cases for the best results

TwoStep Cluster Analysis OutputModel Viewer Output This group provides options for displaying the clustering results

bull Charts and tables Displays model-related output including tables and charts Tables in the model view include a model summary and cluster-by-features grid Graphical output in the model view includes a cluster quality chart cluster sizes variable importance cluster comparison grid and cell information

bull Evaluation fields This calculates cluster data for variables that were not used in cluster creation Evaluation fields can be displayed along with the input features in the model viewer by selecting them in the Display subdialog Fields with missing values are ignored

Working Data File This group allows you to save variables to the active dataset

bull Create cluster membership variable This variable contains a cluster identification number for each case The name of this variable is tsc_n where n is a positive integer indicating the ordinal of the active dataset save operation completed by this procedure in a given session

XML Files The final cluster model and CF tree are two types of output files that can be exported in XML format

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 11: TwoStep Cluster Analysis

bull Export final model The final cluster model is exported to the specified file in XML (PMML) format You can use this model file to apply the model information to other data files for scoring purposes See the topic Scoring Wizard for more information

bull Export CF tree This option allows you to save the current state of the cluster tree and update it later using newer data See TwoStep Cluster Analysis Options for more information on reading this file

The Cluster ViewerCluster models are typically used to find groups (or clusters) of similar records based on the variables examined where the similarity between members of the same group is high and the similarity between members of different groups is low The results can be used to identify associations that would otherwise not be apparent For example through cluster analysis of customer preferences income level and buying habits it may be possible to identify the types of customers who are more likely to respond to a particular marketing campaign

There are two approaches to interpreting the results in a cluster display

bull Examine clusters to determine characteristics unique to that cluster Does one cluster contain all the high-income borrowers Does this cluster contain more records than the others

bull Examine fields across clusters to determine how values are distributed among clusters Does ones level of education determine membership in a cluster Does a high credit score distinguish between membership in one cluster or another

Using the main views and the various linked views in the Cluster Viewer you can gain insight to help you answer these questions

The Cluster Viewer is made up of two panels the main view on the left and the linked or auxiliary view on the right There are two main views

bull Model Summary (the default) See the topic Model Summary View for more information

bull Clusters See the topic Clusters View for more information

There are four linkedauxiliary views

bull Predictor Importance See the topic Cluster Predictor Importance View for more information

bull Cluster Sizes (the default) See the topic Cluster Sizes View for more information

bull Cell Distribution See the topic Cell Distribution View for more information

bull Cluster Comparison See the topic Cluster Comparison View for more information

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 12: TwoStep Cluster Analysis

The Model Summary view shows a snapshot or summary of the cluster model including a Silhouette measure of cluster cohesion and separation that is shaded to indicate poor fair or good results This snapshot enables you to quickly check if the quality is poor in which case you may decide to return to the modeling node to amend the cluster model settings to produce a better result

The results of poor fair and good are based on the work of Kaufman and Rousseeuw (1990) regarding interpretation of cluster structures In the Model Summary view a good result equates to data that reflects Kaufman and Rousseeuws rating as either reasonable or strong evidence of cluster structure fair reflects their rating of weak evidence and poor reflects their rating of no significant evidence

The silhouette measure averages over all records (BminusA) max(AB) where A is the records distance to its cluster center and B is the records distance to the nearest cluster center that it doesnt belong to A silhouette coefficient of 1 would mean that all cases are located directly on their cluster centers A value of minus1 would mean all cases are located on the cluster centers of some other cluster A value of 0 means on average cases are equidistant between their own cluster center and the nearest other cluster

The summary includes a table that contains the following information

bull Algorithm The clustering algorithm used for example TwoStep

bull Input Features The number of fields also known as inputs or predictors

bull Clusters The number of clusters in the solution

The Clusters view contains a cluster-by-features grid that includes cluster names sizes and profiles for each cluster

The columns in the grid contain the following information

bull Cluster The cluster numbers created by the algorithm

bull Label Any labels applied to each cluster (this is blank by default) Double-click in the cell to enter a label that describes the cluster contents for example Luxury car buyers

bull Description Any description of the cluster contents (this is blank by default) Double-click in the cell to enter a description of the cluster for example 55+ years of age professionals earning over $100000

bull Size The size of each cluster as a percentage of the overall cluster sample Each size cell within the grid displays a vertical bar that shows the size percentage within the cluster a size percentage in numeric format and the cluster case counts

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 13: TwoStep Cluster Analysis

bull Features The individual inputs or predictors sorted by overall importance by default If any columns have equal sizes they are shown in ascending sort order of the cluster numbers

Overall feature importance is indicated by the color of the cell background shading the most important feature is darkest the least important feature is unshaded A guide above the table indicates the importance attached to each feature cell color

When you hover your mouse over a cell the full namelabel of the feature and the importance value for the cell is displayed Further information may be displayed depending on the view and feature type In the Cluster Centers view this includes the cell statistic and the cell value for example ldquoMean 432rdquo For categorical features the cell shows the name of the most frequent (modal) category and its percentage

Within the Clusters view you can select various ways to display the cluster information

bull Transpose clusters and features See the topic Transpose Clusters and Features for more information

bull Sort features See the topic Sort Features for more information

bull Sort clusters See the topic Sort Clusters for more information

bull Select cell contents See the topic Cell Contents for more information

Transpose Clusters and FeaturesBy default clusters are displayed as columns and features are displayed as rows To reverse this display click the Transpose Clusters and Features button to the left of the Sort Features By buttons For example you may want to do this when you have many clusters displayed to reduce the amount of horizontal scrolling required to see the data

Sort FeaturesThe Sort Features By buttons enable you to select how feature cells are displayed

bull Overall Importance This is the default sort order Features are sorted in descending order of overall importance and sort order is the same across clusters If any features have tied importance values the tied features are listed in ascending sort order of the feature names

bull Within-Cluster Importance Features are sorted with respect to their importance for each cluster If any features have tied importance values the tied features are listed in ascending sort order of the feature names When this option is chosen the sort order usually varies across clusters

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 14: TwoStep Cluster Analysis

bull Name Features are sorted by name in alphabetical order

bull Data order Features are sorted by their order in the dataset

Sort ClustersBy default clusters are sorted in descending order of size The Sort Clusters By buttons enable you to sort them by name in alphabetical order or if you have created unique labels in alphanumeric label order instead

Features that have the same label are sorted by cluster name If clusters are sorted by label and you edit the label of a cluster the sort order is automatically updated

Cell ContentsThe Cells buttons enable you to change the display of the cell contents for features and evaluation fields

bull Cluster Centers By default cells display feature nameslabels and the central tendency for each clusterfeature combination The mean is shown for continuous fields and the mode (most frequently occurring category) with category percentage for categorical fields

bull Absolute Distributions Shows feature nameslabels and absolute distributions of the features within each cluster For categorical features the display shows bar charts overlaid with categories ordered in ascending order of the data values For continuous features the display shows a smooth density plot which use the same endpoints and intervals for each cluster

The solid red colored display shows the cluster distribution whilst the paler display represents the overall data

bull Relative Distributions Shows feature nameslabels and relative distributions in the cells In general the displays are similar to those shown for absolute distributions except that relative distributions are displayed instead

The solid red colored display shows the cluster distribution while the paler display represents the overall data

bull Basic View Where there are a lot of clusters it can be difficult to see all the detail without scrolling To reduce the amount of scrolling select this view to change the display to a more compact version of the table

The Cluster Sizes view shows a pie chart that contains each cluster The percentage size of each cluster is shown on each slice hover the mouse over each slice to display the count in that slice

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 15: TwoStep Cluster Analysis

Below the chart a table lists the following size information

bull The size of the smallest cluster (both a count and percentage of the whole)

bull The size of the largest cluster (both a count and percentage of the whole)

bull The ratio of size of the largest cluster to the smallest cluster

Cluster Comparison ViewShow details

The Cluster Comparison view consists of a grid-style layout with features in the rows and selected clusters in the columns This view helps you to better understand the factors that make up the clusters it also enables you to see differences between clusters not only as compared with the overall data but with each other

To select clusters for display click on the top of the cluster column in the Clusters main panel Use either Ctrl-click or Shift-click to select or deselect more than one cluster for comparison

Note You can select up to five clusters for display

Clusters are shown in the order in which they were selected while the order of fields is determined by the Sort Features By option When you select Within-Cluster Importance fields are always sorted by overall importance

The background plots show the overall distributions of each features

bull Categorical features are shown as dot plots where the size of the dot indicates the most frequentmodal category for each cluster (by feature)

bull Continuous features are displayed as boxplots which show overall medians and the interquartile ranges

Overlaid on these background views are boxplots for selected clusters

bull For continuous features square point markers and horizontal lines indicate the median and interquartile range for each cluster

bull Each cluster is represented by a different color shown at the top of the view

Navigating the Cluster ViewerThe Cluster Viewer is an interactive display You can

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 16: TwoStep Cluster Analysis

bull Select a field or cluster to view more details

bull Compare clusters to select items of interest

bull Alter the display

bull Transpose axes

Using the Toolbars

You control the information shown in both the left and right panels by using the toolbar options You can change the orientation of the display (top-down left-to-right or right-to-left) using the toolbar controls In addition you can also reset the viewer to the default settings and open a dialog box to specify the contents of the Clusters view in the main panel

Control Cluster View Display

To control what is shown in the Clusters view on the main panel click the Display button the Display dialog opens

Features Selected by default To hide all input features deselect the check box

Evaluation Fields Choose the evaluation fields (fields not used to create the cluster model but sent to the model viewer to evaluate the clusters) to display none are shown by default Note This check box is unavailable if no evaluation fields are available

Cluster Descriptions Selected by default To hide all cluster description cells deselect the check box

Cluster Sizes Selected by default To hide all cluster size cells deselect the check box

Maximum Number of Categories Specify the maximum number of categories to display in charts of categorical features the default is 20

Filtering Recordsf you want to know more about the cases in a particular cluster or group of clusters you can select a subset of records for further analysis based on the selected clusters

Select the clusters in the Cluster view of the Cluster Viewer To select multiple clusters use Ctrl-click

From the menus choose

Generate gt Filter Records

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 17: TwoStep Cluster Analysis

Enter a filter variable name Records from the selected clusters will receive a value of 1 for this field All other records will receive a value of 0 and will be excluded from subsequent analyses until you change the filter status

Click OK

Hierarchical Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases (or variables) based on selected characteristics using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left You can analyze raw variables or you can choose from a variety of standardizing transformations Distance or similarity measures are generated by the Proximities procedure Statistics are displayed at each stage to help you select the best solution

Example Are there identifiable groups of television shows that attract similar audiences within each group With hierarchical cluster analysis you could cluster television shows (cases) into homogeneous groups based on viewer characteristics This can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Agglomeration schedule distance (or similarity) matrix and cluster membership for a single solution or a range of solutions Plots dendrograms and icicle plots

Hierarchical Cluster Analysis MethodCluster Method Available alternatives are between-groups linkage within-groups linkage nearest neighbor furthest neighbor centroid clustering median clustering and Wards method

Measure Allows you to specify the distance or similarity measure to be used in clustering Select the type of data and the appropriate distance or similarity measure

bull Interval Available alternatives are Euclidean distance squared Euclidean distance cosine Pearson correlation Chebychev block Minkowski and customized

bull Counts Available alternatives are chi-square measure and phi-square measure

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 18: TwoStep Cluster Analysis

bull Binary Available alternatives are Euclidean distance squared Euclidean distance size difference pattern difference variance dispersion shape simple matching phi 4-point correlation lambda Anderbergs D dice Hamann Jaccard Kulczynski 1 Kulczynski 2 Lance and Williams Ochiai Rogers and Tanimoto Russel and Rao Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Sokal and Sneath 4 Sokal and Sneath 5 Yules Y and Yules Q

Transform Values Allows you to standardize data values for either cases or values before computing proximities (not available for binary data) Available standardization methods are z scores range minus1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

Transform Measures Allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available alternatives are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Method

Hierarchical Cluster Analysis Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Pearson correlation The product-moment correlation between two vectors of values

bull Cosine The cosine of the angle between two vectors of values

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 19: TwoStep Cluster Analysis

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Hierarchical Cluster Analysis Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Hierarchical Cluster Analysis Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Dispersion This similarity index has a range of minus1 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 20: TwoStep Cluster Analysis

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Phi 4-point correlation This index is a binary analog of the Pearson correlation coefficient It has a range of minus1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from minus1 to 1

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 21: TwoStep Cluster Analysis

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however the software assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as a predictor of the other are averaged to compute this value

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of minus1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of minus1 to 1

You may optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Hierarchical Cluster Analysis Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range minus1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 22: TwoStep Cluster Analysis

Additionally you can choose how standardization is done Alternatives are By variable or By case

Hierarchical Cluster Analysis StatisticsAgglomeration schedule Displays the cases or clusters combined at each stage the distances between the cases or clusters being combined and the last cluster level at which a case (or variable) joined the cluster

Proximity matrix Gives the distances or similarities between items

Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters Available options are single solution and range of solutions

Hierarchical Cluster Analysis PlotsDendrogram Displays a dendrogram Dendrograms can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep

Icicle Displays an icicle plot including all clusters or a specified range of clusters Icicle plots display information about how cases are combined into clusters at each iteration of the analysis Orientation allows you to select a vertical or horizontal plot

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Plots

Hierarchical Cluster Analysis Save New VariablesCluster Membership Allows you to save cluster memberships for a single solution or a range of solutions Saved variables can then be used in subsequent analyses to explore other differences between groups

Saving New VariablesThis feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 23: TwoStep Cluster Analysis

From the menus choose

Analyze gt Classify gt Hierarchical Cluster

In the Hierarchical Cluster Analysis dialog box click Save

K-Means Cluster Analysis This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics using an algorithm that can handle large numbers of cases However the algorithm requires you to specify the number of clusters You can specify initial cluster centers if you know this information You can select one of two methods for classifying cases either updating cluster centers iteratively or classifying only You can save cluster membership distance information and final cluster centers Optionally you can specify a variable whose values are used to label casewise output You can also request analysis of variance F statistics While these statistics are opportunistic (the procedure tries to form groups that do differ) the relative size of the statistics provides information about each variables contribution to the separation of the groups

Example What are some identifiable groups of television shows that attract similar audiences within each group With k-means cluster analysis you could cluster television shows (cases) into k homogeneous groups based on viewer characteristics This process can be used to identify segments for marketing Or you can cluster cities (cases) into homogeneous groups so that comparable cities can be selected to test various marketing strategies

Statistics Complete solution initial cluster centers ANOVA table Each case cluster information distance from cluster center

K-Means Cluster Analysis EfficiencyThe k-means cluster analysis command is efficient primarily because it does not compute the distances between all pairs of cases as do many clustering algorithms including the algorithm that is used by the hierarchical clustering command

For maximum efficiency take a sample of cases and select the Iterate and classify method to determine cluster centers Select Write final as Then restore the entire data file and select Classify only as the method and select Read initial from to classify the entire file using the centers that are estimated from the sample You can write to and read from a file or a dataset Datasets are available for subsequent use in the same session but are not saved as files unless explicitly saved prior to the end of the session Dataset names must conform to variable-naming rules See the topic Variable names for more information

K-Means Cluster Analysis Iterate

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 24: TwoStep Cluster Analysis

Note These options are available only if you select the Iterate and classify method from the K-Means Cluster Analysis dialog box

Maximum Iterations Limits the number of iterations in the k-means algorithm Iteration stops after this many iterations even if the convergence criterion is not satisfied This number must be between 1 and 999

To reproduce the algorithm used by the Quick Cluster command prior to version 50 set Maximum Iterations to 1

Convergence Criterion Determines when iteration ceases It represents a proportion of the minimum distance between initial cluster centers so it must be greater than 0 but not greater than 1 If the criterion equals 002 for example iteration ceases when a complete iteration does not move any of the cluster centers by a distance of more than 2 of the smallest distance between any initial cluster centers

Use running means Allows you to request that cluster centers be updated after each case is assigned If you do not select this option new cluster centers are calculated after all cases have been assigned

K-Means Cluster Analysis SaveYou can save information about the solution as new variables to be used in subsequent analyses

Cluster membership Creates a new variable indicating the final cluster membership of each case Values of the new variable range from 1 to the number of clusters

Distance from cluster center Creates a new variable indicating the Euclidean distance between each case and its classification center

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Save

K-Means Cluster Analysis OptionsStatistics You can select the following statistics initial cluster centers ANOVA table and cluster information for each case

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 25: TwoStep Cluster Analysis

bull Initial cluster centers First estimate of the variable means for each of the clusters By default a number of well-spaced cases equal to the number of clusters is selected from the data Initial cluster centers are used for a first round of classification and are then updated

bull ANOVA table Displays an analysis-of-variance table which includes univariate F tests for each clustering variable The F tests are only descriptive and the resulting probabilities should not be interpreted The ANOVA table is not displayed if all cases are assigned to a single cluster

bull Cluster information for each case Displays for each case the final cluster assignment and the Euclidean distance between the case and the cluster center used to classify the case Also displays Euclidean distance between final cluster centers

Missing Values Available options are Exclude cases listwise or Exclude cases pairwise

bull Exclude cases listwise Excludes cases with missing values for any clustering variable from the analysis

bull Exclude cases pairwise Assigns cases to clusters based on distances that are computed from all variables with nonmissing values

This feature requires the Statistics Base option

From the menus choose

Analyze gt Classify gt K-Means Cluster

In the K-Means Cluster dialog box click Options

DistancesThis procedure calculates any of a wide variety of statistics measuring either similarities or dissimilarities (distances) either between pairs of variables or between pairs of cases These similarity or distance measures can then be used with other procedures such as factor analysis cluster analysis or multidimensional scaling to help analyze complex datasets

Example Is it possible to measure similarities between pairs of automobiles based on certain characteristics such as engine size MPG and horsepower By computing similarities between

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 26: TwoStep Cluster Analysis

autos you can gain a sense of which autos are similar to each other and which are different from each other For a more formal analysis you might consider applying a hierarchical cluster analysis or multidimensional scaling to the similarities to explore the underlying structure

Statistics Dissimilarity (distance) measures for interval data are Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized for count data chi-square or phi-square for binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams Similarity measures for interval data are Pearson correlation or cosine for binary data Russel and Rao simple matching Jaccard dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion

To Obtain Distance Matrices

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

Select at least one numeric variable to compute distances between cases or select at least two numeric variables to compute distances between variables

Select an alternative in the Compute Distances group to calculate proximities either between cases or between variables

Distances Dissimilarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval count or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Euclidean distance squared Euclidean distance Chebychev block Minkowski or customized

bull Count data Chi-square measure or phi-square measure

bull Binary data Euclidean distance squared Euclidean distance size difference pattern difference variance shape or Lance and Williams (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 27: TwoStep Cluster Analysis

standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 or standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Dissimilarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Dissimilarity Measures for Interval DataThe following dissimilarity measures are available for interval data

bull Euclidean distance The square root of the sum of the squared differences between values for the items This is the default for interval data

bull Squared Euclidean distance The sum of the squared differences between the values for the items

bull Chebychev The maximum absolute difference between the values for the items

bull Block The sum of the absolute differences between the values of the item Also known as Manhattan distance

bull Minkowski The pth root of the sum of the absolute differences to the pth power between the values for the items

bull Customized The rth root of the sum of the absolute differences to the pth power between the values for the items

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 28: TwoStep Cluster Analysis

Distances Dissimilarity Measures for Count DataThe following dissimilarity measures are available for count data

bull Chi-square measure This measure is based on the chi-square test of equality for two sets of frequencies This is the default for count data

bull Phi-square measure This measure is equal to the chi-square measure normalized by the square root of the combined frequency

Distances Dissimilarity Measures for Binary DataThe following dissimilarity measures are available for binary data

bull Euclidean distance Computed from a fourfold table as SQRT(b+c) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other

bull Squared Euclidean distance Computed as the number of discordant cases Its minimum value is 0 and it has no upper limit

bull Size difference An index of asymmetry It ranges from 0 to 1

bull Pattern difference Dissimilarity measure for binary data that ranges from 0 to 1 Computed from a fourfold table as bc(n2) where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations

bull Variance Computed from a fourfold table as (b+c)4n where b and c represent the diagonal cells corresponding to cases present on one item but absent on the other and n is the total number of observations It ranges from 0 to 1

bull Shape This distance measure has a range of 0 to 1 and it penalizes asymmetry of mismatches

bull Lance and Williams Computed from a fourfold table as (b+c)(2a+b+c) where a represents the cell corresponding to cases present on both items and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other This measure has a range of 0 to 1 (Also known as the Bray-Curtis nonmetric coefficient)

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 29: TwoStep Cluster Analysis

Distances Similarity MeasuresFrom the Measure group select the alternative that corresponds to your type of data (interval or binary) then from the drop-down list select one of the measures that corresponds to that type of data Available measures by data type are

bull Interval data Pearson correlation or cosine

bull Binary data Russell and Rao simple matching Jaccard Dice Rogers and Tanimoto Sokal and Sneath 1 Sokal and Sneath 2 Sokal and Sneath 3 Kulczynski 1 Kulczynski 2 Sokal and Sneath 4 Hamann Lambda Anderbergs D Yules Y Yules Q Ochiai Sokal and Sneath 5 phi 4-point correlation or dispersion (Enter values for Present and Absent to specify which two values are meaningful Distances will ignore all other values)

The Transform Values group allows you to standardize data values for either cases or variables before computing proximities These transformations are not applicable to binary data Available standardization methods are z scores range ndash1 to 1 range 0 to 1 maximum magnitude of 1 mean of 1 and standard deviation of 1

The Transform Measures group allows you to transform the values generated by the distance measure They are applied after the distance measure has been computed Available options are absolute values change sign and rescale to 0ndash1 range

This feature requires the Statistics Base option

From the menus choose

Analyze gt Correlate gt Distances

With Similarities selected click Measures

From the Measure group select the alternative that corresponds to your type of data

From the drop-down list select a measure that corresponds to that type of measure

Distances Similarity Measures for Interval DataThe following similarity measures are available for interval data

bull Pearson correlation The product-moment correlation between two vectors of values This is the default similarity measure for interval data

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 30: TwoStep Cluster Analysis

bull Cosine The cosine of the angle between two vectors of values

Distances Similarity Measures for Binary DataThe following similarity measures are available for binary data

bull Russel and Rao This is a binary version of the inner (dot) product Equal weight is given to matches and nonmatches This is the default for binary similarity data

bull Simple matching This is the ratio of matches to the total number of values Equal weight is given to matches and nonmatches

bull Jaccard This is an index in which joint absences are excluded from consideration Equal weight is given to matches and nonmatches Also known as the similarity ratio

bull Dice This is an index in which joint absences are excluded from consideration and matches are weighted double Also known as the Czekanowski or Sorensen measure

bull Rogers and Tanimoto This is an index in which double weight is given to nonmatches

bull Sokal and Sneath 1 This is an index in which double weight is given to matches

bull Sokal and Sneath 2 This is an index in which double weight is given to nonmatches and joint absences are excluded from consideration

bull Sokal and Sneath 3 This is the ratio of matches to nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 1 This is the ratio of joint presences to all nonmatches This index has a lower bound of 0 and is unbounded above It is theoretically undefined when there are no nonmatches however Distances assigns an arbitrary value of 9999999 when the value is undefined or is greater than this value

bull Kulczynski 2 This index is based on the conditional probability that the characteristic is present in one item given that it is present in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Sokal and Sneath 4 This index is based on the conditional probability that the characteristic in one item matches the value in the other The separate values for each item acting as predictor of the other are averaged to compute this value

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 31: TwoStep Cluster Analysis

bull Hamann This index is the number of matches minus the number of nonmatches divided by the total number of items It ranges from -1 to 1

bull Lambda This index is Goodman and Kruskals lambda Corresponds to the proportional reduction of error (PRE) using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Anderbergs D Similar to lambda this index corresponds to the actual reduction of error using one item to predict the other (predicting in both directions) Values range from 0 to 1

bull Yules Y This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals It has a range of -1 to 1 Also known as the coefficient of colligation

bull Yules Q This index is a special case of Goodman and Kruskals gamma It is a function of the cross-ratio and is independent of the marginal totals It has a range of -1 to 1

bull Ochiai This index is the binary form of the cosine similarity measure It has a range of 0 to 1

bull Sokal and Sneath 5 This index is the squared geometric mean of conditional probabilities of positive and negative matches It is independent of item coding It has a range of 0 to 1

bull Phi 4-point correlation This index is a binary analogue of the Pearson correlation coefficient It has a range of -1 to 1

bull Dispersion This index has a range of -1 to 1

You can optionally change the Present and Absent fields to specify the values that indicate that a characteristic is present or absent The procedure will ignore all other values

Distances Transform ValuesThe following alternatives are available for transforming values

bull Z scores Values are standardized to z scores with a mean of 0 and a standard deviation of 1

bull Range -1 to 1 Each value for the item being standardized is divided by the range of the values

bull Range 0 to 1 The procedure subtracts the minimum value from each item being standardized and then divides by the range

bull Maximum magnitude of 1 The procedure divides each value for the item being standardized by the maximum of the values

bull Mean of 1 The procedure divides each value for the item being standardized by the mean of the values

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 32: TwoStep Cluster Analysis

bull Standard deviation of 1 The procedure divides each value for the variable or case being standardized by the standard deviation of the values

Additionally you can choose how standardization is done Alternatives are By variable or By case

CLUSTER provides three procedures for clustering Hierarchical Clustering K-Clustering and Additive Trees The Hierarchical Clustering procedure comprises hierarchical linkage methods The K-Clustering procedure splits a set of objects into a selected number of groups by maximizing between-cluster variation and minimizing within-cluster variation The Additive Trees Clustering procedure produces a Sattath-Tversky additive tree clustering

Hierarchical Clustering clusters cases variables individually or both cases and variables simultaneously K-Clustering clusters cases only and Additive Trees clusters a similarity or dissimilarity matrix Several distance metrics are available with Hierarchical Clustering and K-Clustering including metrics for binary quantitative and frequency count data Hierarchical Clustering has ten methods for linking clusters and displays the results as a tree (dendrogram) or a polar dendrogram When the MATRIX option is used to cluster cases and variables SYSTAT uses a gray-scale or color spectrum to represent the values Resampling procedures are available only in Hierarchical Clustering

SYSTAT further provides five indices viz statistical criteria by which an appropriate number of clusters can be chosen from the Hierarchical Tree Options for cutting (or pruning) and coloring the hierarchical tree are also provided

In the K-Clustering procedure SYSTAT offers two algorithms KMEANS and KMEDIANS for partitioning Further SYSTAT provides nine methods for selecting initial seeds for both KMEANS and KMEDIANS

Hierarchical clustering produces hierarchical clusters that are displayed in a tree Initially each object (case or variable) is considered a separate cluster SYSTAT begins by joining the two ldquoclosestrdquo objects as a cluster and continues (in a stepwise manner) joining an object with another object an object with a cluster or a cluster with another cluster until all objects are combined into one cluster

You must select the elements of the data file to cluster (Join)

Rows Rows (cases) of the data matrix are clustered Columns Columns (variables) of the data matrix are clustered Matrix Rows and columns of the data matrix are clusteredmdashthey are permuted to bring

similar rows and columns next to one another

Linkage allows you to specify the type of joining algorithm used to amalgamate clusters (that is define how distances between clusters are measured)

Average Average linkage averages all distances between pairs of objects in different clusters to decide how far apart they are

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 33: TwoStep Cluster Analysis

Centroid Centroid linkage uses the average value of all objects in a cluster (the cluster centroid) as the reference point for distances to other objects or clusters

Complete Complete linkage uses the most distant pair of objects in two clusters to compute between-cluster distances This method tends to produce compact globular clusters If you use a similarity or dissimilarity matrix from a SYSTAT file you get Johnsonrsquos ldquomaxrdquo method

Flexibeta Flexible beta linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are You can choose the value of the weight β The range of β is between ndash1 and 1

K-nbdKth nearest neighborhood method is a density linkage method The estimated density is proportional to the number of cases in the smallest sphere containing the K th

nearest neighbor A new dissimilarity matrix is then constructed using the density estimate Finally the single linkage cluster analysis is performed You can specify the number k its range is between 1 and the total number of cases in the dataset

Median Median linkage uses the median distances between pairs of objects in different clusters to decide how far apart they are

Single Single linkage defines the distance between two objects or clusters as the distance between the two closest members of those clusters This method tends to produce long stringy clusters If you use a SYSTAT file that contains a similarity or dissimilarity matrix you get clustering via Johnsonrsquos ldquominrdquo method

Uniform Uniform Kernel method is a density linkage method The estimated density is proportional to the number of cases in a sphere of radius r A new dissimilarity matrix is then constructed using the density estimate Finally single linkage cluster analysis is performed You can choose the number r its range is the positive real line

Ward Wardrsquos method averages all distances between pairs of objects in different clusters with adjustments for covariances to decide how far apart the clusters are

Weighted Weighted average linkage uses a weighted average distance between pairs of objects in different clusters to decide how far apart they are The weights used are proportional to the size of the cluster

For some data some methods cannot produce a hierarchical tree with strictly increasing amalgamation distances In these cases you may see stray branches that do not connect to others If this happens you should consider Single or Complete linkage For more information on these problems see Fisher and Van Ness (1971)

These reviewers concluded that these and other problems made Centroid Average Median and Ward (as well as K-Means) ldquoinadmissiblerdquo clustering procedures In practice and in Monte Carlo simulations however they sometimes perform better than Single and Complete linkage which Fisher and Van Ness considered ldquoadmissiblerdquo Milligan (1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation of clustering algorithms Consult his paper for further details

In addition the following options can be specified

Distance Specifies the distance metric used to compare clusters Polar Produces a polar (circular) cluster tree

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 34: TwoStep Cluster Analysis

Save Save provides two options either to save cluster identifiers or to save cluster identifiers along with data You can specify the number of clusters to identify for the saved file If not specified two clusters are identified

In the Mahalanobis Tab you can specify the covariance matrix to compute Mahalanobis distance

Covariance matrix Specify the covariance matrix to compute the Mahalanobis distance Enter the covariance matrix either through the keyboard or from a SYSTAT file Otherwise by default SYSTAT computes the matrix from the data Select a grouping variable for inter-group distance measures

Cut cluster tree at You can choose the following options for cutting the cluster tree

Height Provides the option of cutting the cluster tree at a specified distance Leaf nodes Provides the option of cutting the cluster tree by number of leaf nodes

Color clusters by The colors in the cluster tree can be assigned by two different methods

Length of terminal node As you pass from node to node in order down the cluster tree the color changes when the length of a node on the distance scale changes between less than and greater than the specified length of terminal nodes (on a scale of 0 to 1)

Proportion of total nodes Colors are assigned based on the proportion of members in a cluster

ValidityProvides five validity indices to evaluate the partition quality In particular it is used to find out the appropriate number of clusters for the given data set

RMSSTD Provides root-mean-square standard deviation of the clusters at each step in hierarchical clustering

Pseudo F Provides pseudo F-ratio for the clusters at each step in hierarchical clustering Pseudo T-square Provides pseudo T-square statistic for cluster assessment DB Provides Davies-Bouldinrsquos index for each hierarchy of clustering This index is

applicable for rectangular data only Dunn Provides Dunnrsquos cluster separation measure Maximum groups Performs the computation of indices up to this specified number of

clusters The default value is square root of number of objects

K-Clustering dialog box provides the options for K-Means clustering and K-Medians clustering Both clustering methods splits a set of objects into a selected number of groups by maximizing between-cluster variation relative to the within-cluster variation It is similar to doing a one-way analysis of variance where the groups are unknown and the largest F value is sought by reassigning members to each group

By default the algorithms start with one cluster and split it into two clusters by picking the case farthest from the center as a seed for a second cluster and assigning each case to the nearest center It continues splitting one of the clusters into two (and reassigning cases) until a specified number of clusters are formed The reassigning of cases continue until the within-groups sum of

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values
Page 35: TwoStep Cluster Analysis

squares can no longer be reduced The initial seeds or partitions can be chosen from a possible set of nine options

Algorithm Provides K-Means and K-Medians clustering options

K-means Requests K-Means clustering K-medians Requests K-Medians clustering

Groups Enter the number of desired clusters Default number (Groups) is two

Iterations Enter the maximum number of iterations If not stated the maximum is 20

Distance Specifies the distance metric used to compare clusters

Save Save provides three options to save either cluster identifiers cluster identifiers along with data or final cluster seeds to a SYSTAT file

Click More above for descriptions of the distance metrics

None Starts with one cluster and splits it into two clusters by picking the case farthest from the center as a seed for the second cluster and then assigning each case optimally It continues splitting and reassigning the cases until k clusters are formed

First kConsiders the first k non-missing cases as initial seeds Last kConsiders the last k non-missing cases as initial seeds Random k Chooses randomly (without replacement) k non-missing cases as initial

seeds Random segmentation Assigns each case to any of k partitions randomly Computes

seeds from each initial partition taking the mean or the median of the observations whichever is applicable

  • Two Step Cluster Analysis
    • To Obtain a TwoStep Cluster Analysis
      • TwoStep Cluster Analysis Options
        • Advanced Options
          • TwoStep Cluster Analysis Output
          • The Cluster Viewer
          • Transpose Clusters and Features
          • Sort Features
          • Sort Clusters
          • Cell Contents
          • Cluster Comparison View
          • Navigating the Cluster Viewer
            • Using the Toolbars
            • Control Cluster View Display
              • Filtering Records
              • Hierarchical Cluster Analysis
              • Hierarchical Cluster Analysis Method
              • Hierarchical Cluster Analysis Measures for Interval Data
              • Hierarchical Cluster Analysis Measures for Count Data
              • Hierarchical Cluster Analysis Measures for Binary Data
              • Hierarchical Cluster Analysis Transform Values
              • Hierarchical Cluster Analysis Statistics
              • Hierarchical Cluster Analysis Plots
              • Hierarchical Cluster Analysis Save New Variables
              • Saving New Variables
              • K-Means Cluster Analysis
              • K-Means Cluster Analysis Efficiency
              • K-Means Cluster Analysis Iterate
              • K-Means Cluster Analysis Save
              • K-Means Cluster Analysis Options
              • Distances
                • To Obtain Distance Matrices
                  • Distances Dissimilarity Measures
                  • Distances Dissimilarity Measures for Interval Data
                  • Distances Dissimilarity Measures for Count Data
                  • Distances Dissimilarity Measures for Binary Data
                  • Distances Similarity Measures
                  • Distances Similarity Measures for Interval Data
                  • Distances Similarity Measures for Binary Data
                  • Distances Transform Values