cluster analysis hal whitehead biol4062/5062. what is cluster analysis? non-hierarchical cluster...

Download Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis

Post on 18-Dec-2015

217 views

Category:

Documents

5 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Cluster Analysis Hal Whitehead BIOL4062/5062
  • Slide 2
  • What is cluster analysis? Non-hierarchical cluster analysis K-means Hierarchical divisive cluster analysis Hierarchical agglomerative cluster analysis Linkage: single, complete, average, Cophenetic correlation coefficient Additive trees Problems with cluster analyses
  • Slide 3
  • Cluster Analysis Classification Maximize within cluster homogeneity (similar individuals within cluster) The Search for Discontinuities Discontinuities: places to put divisions between clusters ?
  • Slide 4
  • Discontinuities: Discontinuities generally present: taxonomy social organization community ecology??
  • Slide 5
  • Types of cluster analysis: Uses: data, dissimilarity, similarity matrix Non-hierarchical K-means Hierarchical Hierarchical divisive (repeated K-means, network methods) Hierarchical agglomerative single linkage, average linkage,... Additive trees
  • Slide 6
  • Non-hierarchical Clustering Techniques: K-Means Uses data matrix with Euclidean distances Maximizes between-cluster variance for given number of clusters i.e. Choose clusters to maximize F-ratio in 1- way MANOVA
  • Slide 7
  • K-Means Works iteratively: 1. Choose number of clusters 2. Assigns points to clusters Randomly or some other clustering technique 3. Moves each point to other clusters in turn-- increase in between cluster variance? 4. Repeat step 3. until no improvement possible
  • Slide 8
  • K-means with three clusters
  • Slide 9
  • Variable Between SS df Within SS df F-ratio X 0.536 2 0.007 7 256.163 Y 0.541 2 0.050 7 37.566 ** TOTAL ** 1.078 4 0.058 14
  • Slide 10
  • K-means with three clusters Cluster 1 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 1 0.02 | X 0.41 0.45 0.49 0.04 Case 2 0.11 | Y 0.03 0.19 0.27 0.11 Case 3 0.06 | Case 4 0.05 | Cluster 2 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 7 0.06 | X 0.11 0.15 0.19 0.03 Case 8 0.03 | Y 0.61 0.70 0.77 0.07 Case 9 0.02 | Case 10 0.06 | Cluster 3 of 3 contains 2 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case 5 0.01 | X 0.77 0.77 0.78 0.01 Case 6 0.01 | Y 0.33 0.35 0.36 0.02
  • Slide 11
  • Disadvantages of K-means Reaches optimum, but not necessarily global Must choose number of clusters before analysis How many clusters?
  • Slide 12
  • Example: Sperm whale codas Patterned series of clicks: | | | | | ic1 ic2 ic3 ic4 For 5-click codas: 681 x 4 data set
  • Slide 13
  • 5-click codas: | | | | | ic1 ic2 ic3 ic4 93% of variance in 2 PCs
  • Slide 14
  • 5-click codas: K-means with 10 clusters
  • Slide 15
  • Hierarchical Cluster Analysis Usually represented by: Dendrogram or tree-diagram
  • Slide 16
  • Hierarchical Cluster Analysis Hierarchical Divisive Cluster Analysis Hierarchical Agglomerative Cluster Analysis
  • Slide 17
  • Hierarchical Divisive Cluster Analysis Starts with all units in one cluster, successively splits them Successive use of K-Means, or some other divisive technique, with n=2 Either: Each time use the cluster with the greatest sum of squared distances Or: Split each cluster each time. Hierarchical divisive are good techniques, but rarely used, outside network analysis
  • Slide 18
  • Hierarchical Agglomerative Cluster Analysis Start with each individual units occupying its own cluster The clusters are then gradually merged until just one is left The most common cluster analyses
  • Slide 19
  • Hierarchical Agglomerative Cluster Analysis Works on dissimilarity matrix or negative similarity matrix may be Euclidean, Penrose, distances At each step: 1. There is a symmetric matrix of dissimilarities between clusters 2. The two clusters with least dissimilarity are merged 3. The dissimilarity between the new (merged) cluster and all others is calculated Different techniques do step 3. in different ways:
  • Slide 20
  • Hierarchical Agglomerative Cluster Analysis ABCDE A0.... B0.350... C0.450.670.. D0.110.450.570. E0.220.560.780.190 AD BCE AD0... B?0.. C?0.670. E?0.560.780 First link A and D How to calculate new disimmilarities?
  • Slide 21
  • Hierarchical Agglomerative Cluster Analysis Single Linkage ABCDE A0.... B0.350... C0.450.670.. D0.110.450.570. E0.220.560.780.190 AD BCE AD0... B0.350.. C?0.670. E?0.560.780 d(AD,B)=Min{d(A,B), d(D,B)}
  • Slide 22
  • Hierarchical Agglomerative Cluster Analysis Complete Linkage ABCDE A0.... B0.350... C0.450.670.. D0.110.450.570. E0.220.560.780.190 AD BCE AD0... B0.450.. C?0.670. E?0.560.780 d(AD,B)=Max{d(A,B), d(D,B)}
  • Slide 23
  • Hierarchical Agglomerative Cluster Analysis Average Linkage ABCDE A0.... B0.350... C0.450.670.. D0.110.450.570. E0.220.560.780.190 AD BCE AD0... B0.400.. C?0.670. E?0.560.780 d(AD,B)=Mean{d(A,B), d(D,B)}
  • Slide 24
  • Hierarchical Agglomerative Cluster Analysis Centroid Clustering (uses data matrix, or true distance matrix) V1V2V3 A0.110.750.33 B0.350.990.41 C0.450.670.22 D0.110.710.37 E0.220.560.78 F0.130.140.55 G0.550.900.21 V1(AD)=Mean{V1(A),V1(D)} V1V2V3 AD0.110.730.35 B0.350.990.41 C0.450.670.22 E0.220.560.78 F0.130.140.55 G0.550.900.21
  • Slide 25
  • Hierarchical Agglomerative Cluster Analysis Wards Method Minimizes within-cluster sum-of squares Similar to centroid clustering
  • Slide 26
  • 1 1.00 2 0.00 1.00 4 0.53 0.00 1.00 5 0.18 0.05 0.00 1.00 9 0.22 0.09 0.13 0.25 1.00 11 0.36 0.00 0.17 0.40 0.33 1.00 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00 14 0.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00 15 0.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.09 1.00 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18 0.25 1.00 1 2 4 5 9 11 12 14 15 19 20
  • Slide 27
  • Slide 28
  • Hierarchical Agglomerative Clustering Techniques Single Linkage Produces straggly clusters Not recommended if much experimental error Used in taxonomy Invariant to transformations Complete Linkage Produces tight clusters Not recommended if much experimental error Invariant to transformations Average Linkage, Centroid, Wards Most likely to mimic input clusters Not invariant to transformations in dissimilarity measure
  • Slide 29
  • Cophenetic Correlation Coefficient CCC Correlation between original disimilarity matrix and dissimilarity inferred from cluster analysis CCC >~ 0.8 indicate a good match CCC

Recommended

View more >