# cluster analysis

Post on 20-Nov-2014

322 views

Embed Size (px)

TRANSCRIPT

1

Cluster AnalysisHierarchical agglomerative cluster analysis Use of a created cluster variable in secondary analysis

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

2 KEY CONCEPTS ***** Cluster Analysis

Research questions addressed by cluster analysis Cluster analysis assumptions Alternative names for cluster analysis Caveats in using cluster analysis Similarity/dissimilarity matrix, also called a distance matrix Squared Euclidean distance Euclidean distance Cosine of vector variables City block (Manhattan distance) Chebychev distance metric Distances in absolute power metric Pearson product-moment correlation coefficient Minkowski metric Mahalanobis D2 Jaccard's coefficient(s) Gower's coefficient Simple matching coefficient Cluster-seeking vs. cluster-imposing methods Clustering algorithms Hierarchical Methods Agglomerative Methods Single average/linkage (nearest neighbor) Complete average/linkage (furthest neighbor) Average linkage Ward's error sum of squares Centroid method Median clustering -Divisive Methods K-means clustering Trace methods A Splinter-Average Distance method Automatic Interaction Detection (AID) Non-Hierarchical Methods Iterative Methods Sequential threshold method Parallel threshold method Optimizing methods

KEY CONCEPTS (CONT.)

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

3

Factor Analysis Q-Analysis Density Methods Multivariate probability approaches (NORMIX, NORMAP) Clumping Methods Graphic Methods Glyphs & Metroglyphs Fourier Series Chernoff Faces Agglomeration Schedule Fusion coefficient Alternative ways to determine the optimal number of clusters Criteria: clusters as internally homogeneous and significantly different from each other Dendrogram Scaled distance Cluster scores Profiling clusters Using a cluster variable as an IV or DV in secondary analysis Sokal, Robert & Smeath, Peter, Principles of Numerical Taxonomy (1963) Steps in cluster analysis Variable selection, construction of data base, testing assumptions Selecting measure of similarity/distance Selecting clustering algorithm Determining number of clusters Profile clusters Validation

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

4

Cluster AnalysisInterdependency Technique

Designed to group a sample of subjectsInto significantly different groups Based upon a number of variables

The groups are constructed to be as different as statistically possibleAnd as internally homogeneous as statistically possible

Assumptions

The sample needs to be representative of the population Multiple collinearity among the variables should be minimal Absence of outliers & good N to k ratio

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

5

Cluster Analysis by Other NamesSimilar techniques have been independently developed in various fields, giving rise to different names for this statistical technique (e.g. biology, archeology. etc.) Cluster Analysis Numerical Taxonomy Q-Analysis Typology Analysis Classification Analysis There are a number of different clustering techniques depending upon The procedure used to measure the similarity or distance among subjects And the clustering algorithm used.

Caveats in Using Cluster AnalysisCluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

6

There is no one best way to perform a cluster analysis There are many methods and most lack rigorous statistical reasoning or proofs Cluster analysis is used in different disciplines, which favor different techniques for:Measuring the similarity or distance among subjects relative to the variables And the clustering algorithm used

Different clustering techniques can produce different cluster solutions Cluster analysis is supposed to be cluster -seeking, but in fact it is cluster - imposing

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

7

Applications of Cluster AnalysisCluster analysis seeks to reduce a sample of cases to a few statistically different groups, i.e. clusters, based upon differences/similarities across a set of multiple variables A useful tool for constructing typologies among cases Example Is each case filed with court unique, or can cases be sorted into distinctly different types based upon the amount of the evidence, quality of the defense, complexity of the charges, etc.? Example Is a murder a murder, or can cases be sorted into distinctively different types on the basis of victim/offender characteristics, circumstances, motives, etc.?

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

8

The Logic of Cluster AnalysisStep 1 Cluster analysis begins with an N x k database

Step 2 Using one of several methods, an N x N matrix is created that indicates the similarity (or dissimilarity) of very case to every other case, based on the k number of variables Matrix of DissimilaritiesSubjects 1 2 3 n 1.782 2.538 47.236 0.821 39.902 41.652 1 2 1.782 3 2.538 0.821 N 47.236 39.902 41.652

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

9 The Logic of Cluster Analysis (cont.)

Step 3 Using one of several clustering algorithms, the subjects are sorted into significantly different groups where The subjects within each group are as homogeneous as possible, and The groups are as different from one another as possible

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

10

Measures of Similarity or DifferenceCluster analysis begins by creating a matrix indicating the similarity between (or the distance between) each pair of subjects relative to the k variables in the database. There are a number of ways that this can be done.TechniqueSquared Euclidean Distance *

TechniquePearson Correlation Coefficient *

Euclidean Distance *

Mahalanobis D 2 *

Cosine of Vector Variables *

Minkowski Metric *

City Block or Manhattan Distances * Chebychev Distance Metric *

Jaccards Coefficient Gowers Coefficient

Distances in the Absolute Power Metric

Simple Matching Coefficient

* Available in SPSS

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

11

An Example of Squared Euclidean DistancesSubjects Variables Subject 1 Subject 2 X1 X2 X3 X4 X5 X6 X7 Totals 18 15 9 12 0 1 9 NA 19 17 10 10 1 1 8 NA (Si - Sj) -1 -2 -1 +2 +1 0 +1 NA (Si - Sj) 2 1 4 1 4 1 0 1 12

Squared Euclidean Distance = 7 (Si - Sj) 2 = 12

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

12

A Variety of Clustering Algorithms There is no proven best way to cluster subjects into homogeneous groups Different techniques have been developed in different fields based upon different logics (e.g. biology, archeology, etc.) Given the same database, similar clustering results can be achieved using different clustering algorithms, but not always. Clustering algorithms are generally classified into two broad types Hierarchical methods Non-hierarchical methods

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

13

Hierarchical Clustering Algorithms

Agglomerative Methods

Divisive Methods

Single Average (Nearest Neighbor) * K-Means Clustering * Complete Average (Furthest Neighbor) * Average Linkage * Wards Error Sum of Squares * Trace Methods

A-Splinter-Average Distance Method Automatic Interaction Detection (AID)

Centroid Method * Median Clustering

* Available in SPSS

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

14

Non-hierarchical Clustering AlgorithmsIterative Methods Sequential Threshold Method Parallel Threshold Method Optimization Methods Factor Analysis Q-Factor Analysis Density Methods Multivariate Probability Approaches NORMIX NORMAP Clumping Methods Graphic Methods Glyphs Metroglyphs Fourier Series Chernoff Faces

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

15

An Example of a Clustering Algorithm Wards Errors Sum of Squares AlgorithmImagine that data on seven variables (Xk) was gathered on 70 subjects (n) Imagine further that a dissimilarity matrix was constructed indicating the differences among all pairs of subjects using squared Euclidean distances Step 1 Ward's algorithm begins with each of 70 subjects in their own cluster Step 2 Next it finds the two subjects that are most similar and creates a cluster with two subjects Now there are 69 clusters, one with two subjects, and 68 with one subject each Step 3 Now it finds the next two most similar subjects and creates a two-subject cluster Now there are 68 clusters, two with two subjects each, and 66 with one subject each

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

16 An Example of a Clustering Algorithm Wards Errors Sum of Squares Algorithm (cont.)

As Ward's algorithm progresses it will begin to combine a single subject into a pre-existing cluster, And then begins to combine one pre-existing cluster with another This process is continued until all 70 subjects are finally combined into one cluster Ward's algorithm forms clusters by selecting that subject (or another cluster if comb

Recommended