Download - Cluster Analysis

Cluster Analysis

Hierarchical agglomerative cluster analysis

Use of a created cluster variable in secondary analysis

Cluster Analysis: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

1

KEY CONCEPTS*****

Cluster Analysis

Research questions addressed by cluster analysisCluster analysis assumptionsAlternative names for cluster analysisCaveats in using cluster analysisSimilarity/dissimilarity matrix, also called a distance matrix

• Squared Euclidean distance• Euclidean distance• Cosine of vector variables• City block (Manhattan distance)• Chebychev distance metric• Distances in absolute power metric• Pearson product-moment correlation coefficient• Minkowski metric• Mahalanobis D2• Jaccard's coefficient(s)• Gower's coefficient• Simple matching coefficient

Cluster-seeking vs. cluster-imposing methodsClustering algorithms

• Hierarchical MethodsAgglomerative Methods

Single average/linkage (nearest neighbor)Complete average/linkage (furthest neighbor)Average linkageWard's error sum of squaresCentroid methodMedian clustering

-Divisive MethodsK-means clusteringTrace methodsA Splinter-Average Distance methodAutomatic Interaction Detection (AID)

• Non-Hierarchical MethodsIterative Methods

Sequential threshold methodParallel threshold methodOptimizing methods

KEY CONCEPTS (CONT.)


2

Factor AnalysisQ-Analysis

Density MethodsMultivariate probability approaches(NORMIX, NORMAP)

Clumping MethodsGraphic Methods

Glyphs & MetroglyphsFourier SeriesChernoff Faces

Agglomeration ScheduleFusion coefficientAlternative ways to determine the optimal number of clustersCriteria: clusters as internally homogeneous and significantly different from each otherDendrogramScaled distanceCluster scoresProfiling clustersUsing a cluster variable as an IV or DV in secondary analysisSokal, Robert & Smeath, Peter, Principles of Numerical Taxonomy (1963)Steps in cluster analysis

Variable selection, construction of data base, testing assumptionsSelecting measure of similarity/distanceSelecting clustering algorithmDetermining number of clustersProfile clustersValidation


3

Cluster Analysis

Interdependency Technique

Designed to group a sample of subjects

Into significantly different groups

Based upon a number of variables

The groups are constructed to be as different as statistically possible

And as internally homogeneous as statistically possible

Assumptions

The sample needs to be representative of the population

Multiple collinearity among the variables should be minimal

Absence of outliers & good N to k ratio


4

Cluster Analysis by Other Names

Similar techniques have been independently developed in various fields, giving rise to different names for this statistical technique (e.g. biology, archeology. etc.)

Cluster Analysis

Numerical Taxonomy

Q-Analysis

Typology Analysis

Classification Analysis

There are a number of different clustering techniques depending upon …

The procedure used to measure the similarity or distance among subjects

And the clustering algorithm used.


5

Caveats in Using Cluster Analysis

There is no one best way to perform a cluster analysis

There are many methods and most lack rigorous statistical reasoning or proofs

Cluster analysis is used in different disciplines, which favor different techniques for:

Measuring the similarity or distance among subjects relative to the variables

And the clustering algorithm used

Different clustering techniques can produce different cluster solutions

Cluster analysis is supposed to be “cluster -seeking”, but in fact it is “cluster - imposing”


6

Applications of Cluster Analysis

Cluster analysis seeks to reduce a sample of cases to a few statistically different groups, i.e. clusters, based upon differences/similarities across a set of multiple variables

A useful tool for constructing typologies among cases

Example

Is each case filed with court unique, or can cases be sorted into distinctly different types based upon the amount of the evidence, quality of the defense, complexity of the charges, etc.?

Example

Is a murder a murder, or can cases be sorted into distinctively different types on the basis of victim/offender characteristics, circumstances, motives, etc.?


7

The Logic of Cluster Analysis

Step 1 Cluster analysis begins with an N x k database

Step 2 Using one of several methods, an N x N matrix is created that indicates the similarity (or dissimilarity) of very case to every other case, based on the k number of variables

Matrix of Dissimilarities

Subjects 1 2 3 … N

1 1.782 2.538 … 47.236

2 1.782 0.821 … 39.902

3 2.538 0.821 … 41.652

… … … … …

n 47.236 39.902 41.652 …


8

The Logic of Cluster Analysis (cont.)

Step 3 Using one of several clustering algorithms, the subjects are sorted into significantly different groups where …

The subjects within each group are as homogeneous as possible, and …

The groups are as different from one another as possible


9

Measures of Similarity or Difference

Cluster analysis begins by creating a matrix indicating the similarity between (or the distance between) each pair of subjects relative to the k variables in the database.

There are a number of ways that this can be done.

Technique Technique

Squared Euclidean Distance * Pearson Correlation Coefficient *

Euclidean Distance * Mahalanobis D 2 *

Cosine of Vector Variables * Minkowski Metric *

City Block or Manhattan Distances * Jaccard’s Coefficient

Chebychev Distance Metric * Gower’s Coefficient

Distances in the Absolute Power Metric

Simple Matching Coefficient

* Available in SPSS


10

An Example ofSquared Euclidean Distances

SubjectsVariables

Subject 1

Subject 2

(Si - Sj) (Si - Sj) 2

X1 18 19 -1 1

X2 15 17 -2 4

X3 9 10 -1 1

X4 12 10 +2 4

X5 0 1 +1 1

X6 1 1 0 0

X7 9 8 +1 1

Totals NA NA NA 12

Squared Euclidean Distance = (Si - Sj) 2 = 12


11

A Variety of Clustering Algorithms

There is no proven best way to cluster subjects into homogeneous groups

Different techniques have been developed in different fields based upon different logics (e.g. biology, archeology, etc.)

Given the same database, similar clustering results can be achieved using different clustering algorithms, but not always.

Clustering algorithms are generally classified into two broad types …

Hierarchical methods

Non-hierarchical methods


12

Hierarchical Clustering Algorithms

Agglomerative Methods Divisive Methods

Single Average (Nearest Neighbor) *

K-Means Clustering *

Complete Average (Furthest Neighbor) *

Trace Methods

Average Linkage * A-Splinter-Average Distance Method

Ward’s Error Sum of Squares * Automatic Interaction Detection (AID)

Centroid Method *

Median Clustering

* Available in SPSS


13

Non-hierarchical Clustering Algorithms

Iterative Methods

Sequential Threshold MethodParallel Threshold MethodOptimization Methods

Factor Analysis

Q-Factor Analysis

Density Methods

Multivariate Probability ApproachesNORMIXNORMAP

Clumping Methods

Graphic Methods

GlyphsMetroglyphsFourier SeriesChernoff Faces


14

An Example of a Clustering AlgorithmWard’s Errors Sum of Squares Algorithm

Imagine that data on seven variables (Xk) was gathered on 70 subjects (n)

Imagine further that a dissimilarity matrix was constructed indicating the differences among all pairs of subjects using squared Euclidean distances

Step 1 Ward's algorithm begins with each of 70 subjects in their own cluster

Step 2 Next it finds the two subjects that are most similar and creates a cluster with two subjects

Now there are 69 clusters, one with two subjects, and 68 with one subject each

Step 3 Now it finds the next two most similar subjects and creates a two-subject cluster

Now there are 68 clusters, two with two subjects each, and 66 with one subject each


15

An Example of a Clustering AlgorithmWard’s Errors Sum of Squares Algorithm (cont.)

As Ward's algorithm progresses it will begin to combine a single subject into a pre-existing cluster,

And then begins to combine one pre-existing cluster with another

This process is continued until all 70 subjects are finally combined into one cluster

Ward's algorithm forms clusters by selecting that subject (or another cluster if combining clusters) which minimizes the within cluster sum of squares (i.e. error sum of squares)


16

A Seven Variable Example of Cluster Analysis

The database 70 subjects and 7 variables

The variables

Sentence in years: sentence

Number of prior convictions: pr_conv

Degree of drug dependency: dr_score

Age: age

Age at first arrest: age_firs

Educational equivalency: educ_eqv

Level of work skill: skl_indx


17

Steps in the Cluster Analysis

Step 1 Transform the seven variables to standard scores, i.e. Z-scores

Step 2 Create a dissimilarity matrix using squared Euclidean distances

Squared Euclidean Distances

Subjects 1 2 3 … 70

1 1.782 2.538 … 47.236

2 1.782 0.821 … 39.902

3 2.538 0.821 … 41.652

… … … … …

70 47.236 39.902 41.652 …


18

Steps in the Cluster Analysis (cont.)

Step 3 Use Ward's algorithm to cluster the 70 subjects, beginning with 70 clusters of one subject each and terminating with one cluster containing all 70 subjects

Agglomeration Schedule

Cluster Combined

Coefficients

Stage Cluster

First Appears

Next Stage

Stage Cluster 1 Cluster 2 Cluster 1 Cluster 2

1 62 63 .255 0 0 40 2 31 33 .610 0 0 37 3 2 3 1.021 0 0 43 4 7 8 1.502 0 0 31 5 29 30 1.984 0 0 45 6 14 15 2.495 0 0 31 7 52 67 3.031 0 0 34 8 18 19 3.588 0 0 49 9 46 47 4.191 0 0 35

10 27 28 4.803 0 0 44 11 36 40 5.437 0 0 33 12 9 13 6.095 0 0 49 13 48 49 6.760 0 0 51 14 32 38 7.435 0 0 42 15 20 21 8.128 0 0 39 16 22 64 8.844 0 0 39 17 35 39 9.580 0 0 52 18 5 12 10.324 0 0 36 19 23 24 11.093 0 0 29 20 57 59 11.878 0 0 32 21 37 43 12.702 0 0 42 22 6 10 13.551 0 0 55 23 1 4 14.439 0 0 28 24 11 45 15.358 0 0 46 25 41 44 16.284 0 0 33 26 55 56 17.220 0 0 41 27 51 66 18.237 0 0 48 28 1 50 19.329 23 0 47 29 17 23 20.483 0 19 38 30 54 69 21.732 0 0 41 31 7 14 23.076 4 6 46


19

32 57 58 24.425 20 0 53 33 36 41 25.784 11 25 40 34 52 53 27.173 7 0 51 35 42 46 28.626 0 9 58 36 5 16 30.251 18 0 54 37 31 34 32.018 2 0 62 38 17 68 33.905 29 0 59 39 20 22 35.806 15 16 57 40 36 62 37.855 33 1 56 41 54 55 39.918 30 26 50 42 32 37 42.118 14 21 52 43 2 65 44.428 3 0 47 44 25 27 46.758 0 10 45 45 25 29 49.344 44 5 59 46 7 11 52.395 31 24 54 47 1 2 55.709 28 43 63 48 26 51 59.223 0 27 61 49 9 18 62.772 12 8 57 50 54 70 66.383 41 0 65 51 48 52 70.076 13 34 60 52 32 35 73.798 42 17 58 53 57 60 77.659 32 0 65 54 5 7 81.736 36 46 55 55 5 6 86.189 54 22 64 56 36 61 90.955 40 0 66 57 9 20 97.853 49 39 60 58 32 42 105.430 52 35 62 59 17 25 114.736 38 45 67 60 9 48 125.105 57 51 61 61 9 26 136.517 60 48 63 62 31 32 150.461 37 58 68 63 1 9 167.695 47 61 64 64 1 5 194.756 63 55 66 65 54 57 222.045 50 53 67 66 1 36 258.210 64 56 68 67 17 54 298.955 59 65 69 68 1 31 361.556 66 62 69 69 1 17 483.000 68 67 0


20

Interpretation of the Agglomeration Schedule

Stage 1 Cases 62 and 63 are combined into a cluster. Now there is one cluster with two cases and 68 clusters with one case each, 69 total clusters, or 70 - 1 = 69

Coefficient The squared Euclidean distance over which these two cases were joined = 0.255, called a fusion coefficient

Next Stage The next stage at which one of these cases is joined to a cluster is Stage 40 when case 62 is joined to case 36

Stage 33 Cases 36 and 41 are joined together over a distance = 25.784. At this stage 37 clusters have been formed (70 - 33 = 37)

Stage Cluster first Appears

Cluster 1 Notice that case 36 was previously joined with case 40 at Stage 11

Cluster 2 Again, notice that case 41 was previously joined with case 44 at Stage 25


21

Interpretation of the Agglomeration Schedule (cont.)

Next Stage The next stage at which one of these cases is joined to a cluster is Stage 40 when case 36 is joined with case 62

Stage 69 Case 1 is joined with case 17 at an Euclidean distance of 483.0, clearly two cases that are very dissimilar.

At Stage 69 all 70 cases have been included in a single cluster. Obviously this one cluster is a heterogeneous cluster, containing many very dissimilar cases.


22

How Do You Determine the Optimal Number of Clusters in the Final Solution?

In this example, Ward's algorithm yields clusters ranging from 70 clusters with one case each, to one cluster containing all 70 cases.

Somewhere in between these two extremes is an optimal number of clusters which best satisfies the following conditions …

The clusters are as internally homogeneous as possible (i.e. minimum within sum of squares)

And the various clusters are as different as possible

Determining the optimal number of clusters

Theory about the number of underlying groups

Ease of profiling the groups

Magnitude of change in the fusion coefficient

Dendogram with rescaled distance measure


23

What is a Dendogram? (cont.)

The Scaled Distance

The fusion coefficient transformed to a scale ranging from 0 to 25

The Dendogram

The dendogram shows which cases were joined together into clusters and at what distance, and at latter stages, which clusters were joined together into larger clusters, and at what distance.

Interpretation

The point at which the "foothills" become the "mountain peaks" is probability the optimal number of clusters

Optimal Number of Clusters

A five-cluster solution appears about optimal


25

Computing a Five-Cluster Solution

Having hypothesized that a five-cluster solution may be optimal …

The next step is to compute a five-cluster solution and …

Save the cluster scores

Cluster scores

In this case, a cluster score is a number between 1 and 5 assigned to each case indicating the cluster to which a particular case has been assigned

5-Cluster Solution

This is accomplished by repeating the cluster analysis and specifying that five clusters are to be extracted and the cluster scores saved.


26

Saved Cluster Scores

1.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 110.0 111.0 112.0 113.0 114.0 115.0 116.0 117.0 218.0 119.0 120.0 121.0 122.0 123.0 224.0 225.0 226.0 127.0 2

………

46.0 347.0 348.0 149.0 150.0 151.0 152.0 153.0 154.0 555.0 556.0 557.0 558.0 559.0 560.0 561.0 462.0 463.0 464.0 165.0 166.0 167.0 168.0 269.0 570.0 5


27

Profiling the Five Clusters

One way to profile the characteristics of the five clusters is to compute the means of the seven variables for each of the five clusters

Ward Method

Cluster 1 +------------------------+-----------+ | | Mean | +------------------------+-----------+ |SENTENCE | 4.6 | | | | |PR_CONV | 1.5 | | | | |DR_SCORE | 7.5 | | | | |AGE | 21.6 | | | | |AGE_FIRS | 16.2 | | | | |EDUC_EQV | 7.3 | | | | |SKL_INDX | 6.0 | +------------------------+-----------+

Ward Method



28

Profiling the Five Clusters (cont.)Ward Method

Cluster 3 +------------------------+-----------+ | | Mean | +------------------------+-----------+ |SENTENCE | 2.4 | | | | |PR_CONV | .9 | | | | |DR_SCORE | 3.3 | | | | |AGE | 21.3 | | | | |AGE_FIRS | 19.3 | | | | |EDUC_EQV | 3.3 | | | | |SKL_INDX | 2.5 | +------------------------+-----------+

Ward Method

Cluster 4 +------------------------+-----------+ | | Mean | +------------------------+-----------+ |SENTENCE | 3.1 | | | | |PR_CONV | .9 | | | | |DR_SCORE | 3.0 | | | | |AGE | 20.6 | | | | |AGE_FIRS | 19.0 | | | | |EDUC_EQV | 10.7 | | | | |SKL_INDX | 8.1 | +------------------------+-----------+

Ward Method



29

Ranking the Variable Means of the Five Clusters

Variable Clusters

1 2 3 4 5

Age M H L LL HH

Age_Firs M L HH H LL

Dr_Score H M L LL HH

Educ_Eqv H L LL HH M

Pr_Conv M HH L LL H

Sentence M H LL L HH

Skl_Indx H L LL HH M

LL = lowest L = low M = median H = high HH = Highest


30

Profile Descriptions of the Five Clusters

Cluster 1

Better educated drug users who are highly skilled workers, about median age

Cluster 2

Older offenders, unskilled, poorly educated with some history of drug use, career criminals serving long sentences

Cluster 3

Young 1st offenders, unskilled, poorly educated with little drug history, serving very short sentences

Cluster 4

Very young, highly educated, skilled 1st offenders serving short sentences, little history of drug use

Cluster 5

Severely drug dependent old offenders with long criminal careers serving very long sentences


31

Secondary Applications of the Results of a Cluster Analysis

Some statistical techniques use a priori categorical independent or dependent variables such as analysis of variance or discriminant analysis.

Cluster analysis allows us to create an empirically derived categorical variable wherein the groups or clusters are determined to be homogeneous and significantly different from each other.

Other statistical tests can then be conducted using the cluster variable as a categorical IV or DV.

Example

Do the five clusters of offenders differ significantly in the seriousness of the crime of which they were convicted? This is a one-way ANOVA problem.


32

Secondary Applications of the Results of a Cluster Analysis (cont.)

Univariate Analysis of Variance

Between-Subjects Factors

33

9

12

7

9

1

2

3

4

5

WardMethod

N

Tests of Between-Subj ects Effects

Dependent Variable: SER_I NDX

152. 593a 4 38. 148 19. 471 . 000

853. 296 1 853. 296 435. 527 . 000

152. 593 4 38. 148 19. 471 . 000

127. 350 65 1. 959

1306. 000 70

279. 943 69

SourceCorrect ed Model

I nt ercept

CLU5_1

Error

Tot al

Correct ed Tot al

Type I I I Sumof Squares df Mean Square F Sig.

R Squared = . 545 (Adjust ed R Squared = . 517)a.

Post Hoc TestsWard Method

Interpretation

There are significant mean differences in the crime seriousness of the offences committed by the five clusters of offenders.

Tukey's HSD test is used to determine which group mean differences are significant.


33


M ul t i pl e Com par i sons

Dependent Var iable: SER_I NDX

Tukey HSD

- 2. 5152* . 5264 . 000 - 3. 9921 - 1. 0382

1. 2348 . 4718 . 079 - 8. 9081E- 02 2. 5588

1. 3420 . 5825 . 157 - . 2923 2. 9763

- 2. 8485* . 5264 . 000 - 4. 3254 - 1. 3716

2. 5152* . 5264 . 000 1. 0382 3. 9921

3. 7500* . 6172 . 000 2. 0182 5. 4818

3. 8571* . 7054 . 000 1. 8779 5. 8364

- . 3333 . 6598 . 987 - 2. 1847 1. 5181

- 1. 2348 . 4718 . 079 - 2. 5588 8. 908E- 02

- 3. 7500* . 6172 . 000 - 5. 4818 - 2. 0182

. 1071 . 6657 1. 000 - 1. 7607 1. 9750

- 4. 0833* . 6172 . 000 - 5. 8152 - 2. 3515

- 1. 3420 . 5825 . 157 - 2. 9763 . 2923

- 3. 8571* . 7054 . 000 - 5. 8364 - 1. 8779

- . 1071 . 6657 1. 000 - 1. 9750 1. 7607

- 4. 1905* . 7054 . 000 - 6. 1697 - 2. 2112

2. 8485* . 5264 . 000 1. 3716 4. 3254

. 3333 . 6598 . 987 - 1. 5181 2. 1847

4. 0833* . 6172 . 000 2. 3515 5. 8152

4. 1905* . 7054 . 000 2. 2112 6. 1697

( J) War d M et hod2

3

4

5

1

3

4

5

1

2

4

5

1

2

3

5

1

2

3

4

( I ) War d M et hod1

2

3

4

5

M eanDif f er ence

( I - J) St d. Er r or Sig. Lower Bound Upper Bound

95% Conf idence I nt er val

Based on obser ved m eans.

The m ean dif f er ence is signif icant at t he . 05 level.* .


34


SER_INDX

Tukey HSDa ,b ,c

7 2.1429

12 2.2500

33 3.4848

9 6.0000

9 6.3333

.196 .982

Ward Method4

3

1

2

5

Sig.

N 1 2

Subset

Means for groups in homogeneous subsets are displayed.Based on Type III Sum of SquaresThe error term is Mean Square(Error) = 1.959.

Uses Harmonic Mean Sample Size = 10.445.a.

The group sizes are unequal. The harmonic meanof the group sizes is used. Type I error levels arenot guaranteed.

b.

Alpha = .05.c.


35

Using the Categorical Cluster Variable as a Dependent Variable

Example

To what extent does the type of defense counsel, pretrial jail time, and time to case disposition predict differences among the five groups of offenders?

This is a discriminant analysis problem with the cluster variable as the DV. (If the cluster variable were used as the IV, this would be a MANOVA problem)

Discriminant analysis results

Three discriminant functions were extracted since there are 3 IVs, which is less than 5 groups. (functions: g-1 or k, if kg)

Only the 1st discriminant function is significant.

Z1 = -0.313 - 0.866 council + 0.021 jail_tm -0.002 tm_disp


36

Using the Cluster Variable as a Dependent Variable (cont.)

Discriminant

Group Statistics

33 33.000

33 33.000

33 33.000

9 9.000

9 9.000

9 9.000

12 12.000

12 12.000

12 12.000

7 7.000

7 7.000

7 7.000

9 9.000

9 9.000

9 9.000

70 70.000

70 70.000

70 70.000

COUNSEL

JAIL_TM

TM_DISP

COUNSEL

JAIL_TM

TM_DISP

COUNSEL

JAIL_TM

TM_DISP

COUNSEL

JAIL_TM

TM_DISP

COUNSEL

JAIL_TM

TM_DISP

COUNSEL

JAIL_TM

TM_DISP

Ward Method1

2

3

4

5

Total

Unweighted Weighted

Valid N (listwise)

Analysis 1Summary of Canonical Discriminant Functions

Eige nv a lue s

.4 9 2 a 8 9 .5 8 9 .5 .5 7 4

.0 4 2 a 7 .6 9 7 .1 .2 0 0

.0 1 6 a 2 .9 1 0 0 .0 .1 2 5

Fu n c ti o n1

2

3

Ei g e n v a l u e% o f Va ri a n c eCu mu l a t i v e %Ca n o n i c a lCo rre l a t i o n

F i rs t 3 c a n o n i c a l d i s c ri mi n a n t fu n c t i o n s we re u s e d i n th ea n a l y s i s .

a .

Wilk s ' La m bda

.6 3 3 2 9 .6 8 6 1 2 .0 0 3

.9 4 5 3 .6 7 8 6 .7 2 0

.9 8 4 1 .0 1 9 2 .6 0 1

Te s t o f Fu n c ti o n (s )1 th ro u g h 3

2 th ro u g h 3

3

Wi l k s 'L a mb d a Ch i -s q u a re d f Sig .


37

Using the Cluster Variable as a Dependent Variable (cont.)

Standardized Canonical Discriminant Function Coefficients

.549 .863 .523

-.627 .807 .607

.102 .384 -.962

COUNSEL

JAIL_TM

TM_DISP

1 2 3

Function

Structure Matrix

-.867* .488 .103

.848* .455 .271

-.086 .555 -.827*

J AIL_TM

COUNSEL

TM_DISP

1 2 3

Func tion

Pooled wi thin-groups c orrelations between disc riminatingv ariables and s tandardiz ed c anonic al dis c riminant func tions Variables ordered by abs olute s iz e of c orrelation wi thin func tion.

Larges t abs olute c orrelation between eac h v ariable andany dis c riminant func tion

*.

Canonical Discr im inant Function Coefficients

1.235 1.943 1.176

-.016 .020 .015

.004 .015 -.039

-.304 -3.221 2.205

COUNSEL

JAIL_TM

TM_DISP

(Constant)

1 2 3

Function

Unstandardized coefficients

Functions at Gr oup Centr oids

.213 -.115 -9.76E -02

-.803 .140 -2.89E -02

.673 .366 4.822E -02

.618 -.266 .291

-1.357 -1.51E -03 9.600E -02

W ard Method1

2

3

4

5

1 2 3

Function

Unstandardized canonical discriminant functionsevaluated at group means


38

Download - Cluster Analysis

Top Related