clustering: hierarchical - github pages · from k-means to hierarchical clustering there is a...
TRANSCRIPT
![Page 1: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/1.jpg)
Clustering: Hierarchical-Applied Multivariate Analysis-
Lecturer: Darren Homrighausen, PhD
1
![Page 2: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/2.jpg)
From K -means to hierarchical clustering
Recall two properties of K -means clustering
1. It fits exactly K clusters.
2. Final clustering assignments depend on the chosen initialcluster centers.
Alternatively, we can use hierarchical clustering. This has theadvantage that
1. No need to choose the number of clusters before hand.
2. There is no random component (nor choice of starting point).
2
![Page 3: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/3.jpg)
From K -means to hierarchical clustering
There is a catch: we need to choose a way to measure dissimilaritybetween clusters, called the linkage.
Given the linkage, hierarchical clustering produces a sequence ofclustering assignments.
At one end, all points are in their own cluster.
At the other, all points are in one cluster.
In the middle, there are nontrivial solutions.
3
![Page 4: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/4.jpg)
Agglomerative vs. divisive
Two types of hierarchical clustering algorithms
Agglomerative: Start with each point in its own cluster.Merge until all in same cluster.
(ie: top-down) (think of forward selection)
Divisive: Until every point is assigned to its own(ie: bottom-up) cluster, repeatedly split the group into two
parts that result in the biggest dissimilarity(think of backwards selection).
Agglomerative methods are simpler, so we’ll focus on them.
4
![Page 5: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/5.jpg)
Agglomerative example
Given these data points, an agglomerative algorithm might decideon the following clustering sequence:(Important: Different choices of linkage would result in different solutions)
●
●
●
●●
●
●
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
x2
12
3
45
6
7
1. {1}, {2}, {3}, {4}, {5}, {6}, {7}2. {1, 2}, {3}, {4}, {5}, {6}, {7}3. {1, 2}, {3}, {5}, {4, 7}4. {1, 2, 6}, {3}, {5}, {4, 7}5. {1, 2, 4, 6, 7}, {3}, {5}6. {1, 2, 3, 4, 6, 7}, {5}7. {1, 2, 3, 4, 5, 6, 7}
5
![Page 6: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/6.jpg)
Reminder: What’s a dendrogram?We encountered dendrograms when we talked about classificationand regression trees.
Dendrogram: A convenient graphic to display a hierarchicalsequence of clustering assignments. This is simply a tree where:
• Each branch represents a group
• Each leaf (terminal) node is a singleton(ie: a group containing a single data point)
• The root node is a group containing the whole data set
• Each internal node as two daughter nodes (children),representing the groups that were merged to form it.
Remember: the choice of linkage determines how we measuredissimilarity between groups.
Each internal node is drawn at a height proportional to the linkagedistance between its two daughter nodes.
6
![Page 7: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/7.jpg)
We can also represent the sequence of clustering assignments as adendrogram
●
●
●
●●
●
●
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
x2
12
3
45
6
7
5
3
4 7
6
1 2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
hclust (*, "complete")
Hei
ght
Note that cutting the dendrogram horizontally partitions the datapoints into clusters
7
![Page 8: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/8.jpg)
Back to the example
●
●
●
●●
●
●
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
x2
12
3
45
6
7
5
3
4 7
6
1 2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
hclust (*, "complete")
Hei
ght
For instance, the linkage distance between the cluster {4, 7} andthe cluster {1, 2, 6} is about .65.
8
![Page 9: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/9.jpg)
Linkages
Notation: Define X1, . . . ,Xn to be the data
Let the dissimiliarities be dij between each pair Xi ,Xj
At any level, clustering assignments can be expressed by setsG = {i1, i2, . . . , ir}. given the indices of points in this group.Define |G | to be the size of G .
Linkage: The function d(G ,H) that takes two groups G ,H andreturns the linkage distance between them.
Agglomerative clustering, given the linkage:
• Start with each point in its own group
• Until there is only one cluster, repeatedly merge the twogroups G ,H that minimize d(G ,H).
9
![Page 10: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/10.jpg)
Single linkageIn single linkage (a.k.a nearest-neighbor linkage), the linkagedistance between G ,H is the smallest dissimilarity between twopoints in different groups:
dsingle(G ,H) = mini∈G , j∈H
dij
Example: There are two clustersG and H (red and blue).The single linkage distance(i.e. dsingle(G ,H))
is the dissimilarity between theclosest pair(length of black line segment)
●
●
●
●
●
●
●
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x1
x2d(G, H)
10
![Page 11: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/11.jpg)
Single linkage exampleHere n = 60, Xi ∈ R2, dij = ||Xi − Xj ||2. Cutting the tree ath = 0.8 gives the cluster assignments marked by colors
14 247
3756
15 3911
2854 58
314 55 8 50
648
4725 59
309 44
2218 19
4321 40 20
51 5 1223
27 3233 53
313 29 36
217 38 41
45 5257
42 6034
4916 26
3546
1 10
0.0
0.2
0.4
0.6
0.8
1.0
hclust (*, "single")
Hei
ght
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2
−1
01
2X1
X2
Cut interpretation: For each point Xi , there is another point Xj inthe same cluster with dij ≤ 0.8 (assuming more than 1 point incluster). Also, no points in different clusters are closer than 0.8.
11
![Page 12: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/12.jpg)
Complete linkageIn complete linkage (i.e. farthest-neighbor linkage), dissimiliartybetween G ,H is the largest dissimilarity between two points indifferent clusters:
dcomplete(G ,H) = maxi∈G , j∈H
dij .
Example: There are two clustersG and H (red and blue).The complete linkage distance(i.e. dcomplete(G ,H))
is the dissimilarity between thefarthest pair(length of black line segment)
●
●
●
●
●
●
●
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x1
x2
d(G, H)
12
![Page 13: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/13.jpg)
Complete linkage example
Same data as before. Cutting the tree at h = 3.5 gives theclustering assignment
3546
1 10 8 5048
4725 59
2327 32
33 537
15 3911
564 55 22 18 19
230 9 44
43 2051 5 12
3121 40
14 2434
49 16 266
29 363 13
2854 58
3745 52 17
38 4157
42 60
01
23
4
hclust (*, "complete")
Hei
ght
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2−
10
12
X1
X2
Cut interpretation: For each point Xi , every other point Xj in thesame cluster has dij ≤ 3.5.
13
![Page 14: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/14.jpg)
Average linkageIn average linkage, the linkage distance between G ,H is theaverage dissimilarity over all points in different clusters:
daverage(G ,H) =1
|G | · |H|∑
i∈G , j∈Hdij .
Example: There are two clustersG and H (red and blue).The average linkage distance(i.e. daverage(G ,H))
is the average dissimilarity betweenall points in different clusters(average of lengths of colored line segments)
●
●
●
●
●
●
●
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x1
x2
14
![Page 15: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/15.jpg)
Average linkage example
Same data as before. Cutting the tree at h = 1.75 gives theclustering assignment
3546
1 1014 24
3745 52
2854 58
1156
314 55
715 39
21 4043
2051 5 12
8 50 2327 32
33 5330
9 4422
18 1948
4725 59
217
38 4157
42 6034
49 16 266
29 363 13
0.0
0.5
1.0
1.5
2.0
2.5
hclust (*, "average")
Hei
ght
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2−
10
12
X1
X2
Cut interpretation: ??
15
![Page 16: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/16.jpg)
Common properties
Single, complete, and average linkage share the following:
• They all operate on the dissimilarities dij . This means that thepoints we are clustering can be quite general (number ofmutations on a genome, polygons, faces, whatever).
• Running agglomerative clustering with any of these linkagesproduces a dendrogram with no inversions.
No inversions means that the linkage distance between mergedclusters only increases as we run the algorithm.
In other words, we can draw a proper dendrogram, where theheight of a parent is always higher than the height of eitherdaughter.(We’ll return to this again shortly)
16
![Page 17: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/17.jpg)
Shortcomings of single and complete linkage
Single and complete linkage have practical problems:
Single linkage: Often suffers from chaining, that is,we only need a single pair ofpoints to be close to merge two clusters.Therefore, clusters can be too spread outand not compact enough.
Complete linkage: Often suffers from crowding, that is,a point can be closer to points inother clusters than to points in its owncluster. Therefore, the clusters arecompact, but not far enough apart.
Average linkage tries to strike a balance between these two.
17
![Page 18: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/18.jpg)
Example of chaining and crowding●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2
−1
01
2
X1
X2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2
−1
01
2
X1
X2
Single linkage Complete linkage●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2
−1
01
2
X1
X2
Average linkage18
![Page 19: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/19.jpg)
Shortcomings of average linkage
Average linkage isn’t perfect.
• It isn’t clear what properties the resulting clusters have whenwe cut an average linkage tree.
• Results of average linkage clustering can change with amonotone increasing transformation of the dissimilarities (thatis, if we changed the distance, but maintained the ranking ofthe distances, the cluster solution could change).
Neither of these problems afflict single or complete linkage.
19
![Page 20: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/20.jpg)
Example of monotone increasing problem
Average
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2
−1
01
2
X1
X2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2
−1
01
2
X1
X2
Single
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2
−1
01
2
X1
X2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2−
10
12
X1
X2
Left: dij = ||Xi − Xj ||2 Right: dij = ||Xi − Xj ||2220
![Page 21: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/21.jpg)
Hierarchical agglomerative clustering in R
There’s an easy way to generate distance-based dissimilarities in R
> x = c(1,2)
> y = c(2,1)
> dist(cbind(x,y),method=’euclidean’)
1
2 1.414214
> dist(cbind(x,y),method=’euclidean’,diag=T,upper=T)
1 2
1 0.000000 1.414214
2 1.414214 0.000000
21
![Page 22: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/22.jpg)
Hierarchical agglomerative clustering in RThe function hclust is in base R
Delta = dist(x)
out.average = hclust(Delta,method=’average’)
plot(out.average)
abline(h = 1.75,col=’red’)
3546
1 1014 24
3745 52
2854 58
1156
314 55
715 39
21 4043
2051 5 12
8 50 2327 32
33 5330
9 4422
18 1948
4725 59
217
38 4157
42 6034
49 16 266
29 363 13
0.0
0.5
1.0
1.5
2.0
2.5
hclust (*, "average")
Hei
ght
cutree(out.tree,k=3)
cutree(out.tree,h=1.75)
22
![Page 23: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/23.jpg)
Recap
Hierarchical agglomerative clustering: Start with alldata points in their own groups, and repeatedly merge groups,based on linkage function. Stop when points are in one group(this is agglomerative; there is also divisive)
This produces a sequence of clustering assignments, visualized by adendrogram (i.e., a tree). Each node in the tree represents agroup, and its height is proportional to the linkage distance of itsdaughters
Three most common linkage functions: single, complete, averagelinkage. Single linkage measures the least dissimilar pair betweengroups, complete linkage measures the most dissimilar pair,average linkage measures the average dissimilarity over all pairs
Each linkage has its strengths and weaknesses
23
![Page 24: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/24.jpg)
Careful example
24
![Page 25: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/25.jpg)
Careful Example
●
●
●
●
●
●
●
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−1.
0−
0.5
0.0
0.5
1.0
1.5
2.0
x1
1 23
4
567
0.12
0.36
7 6
4 5 1
2 3
0.1
0.2
0.3
0.4
0.5
0.6
0.7
single
hclust (*, "single")Delta
Hei
ght
1
2 3
7
6
4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
complete
7
1
2 3
6
4 5
0.0
0.2
0.4
0.6
0.8
1.0
average
Hei
ght
Distance matrix (∆)
1 2 3 4 5 6 7
1 0.00 0.37 0.49 0.61 0.96 1.24 1.42
2 0.37 0.00 0.12 0.64 0.98 1.44 1.24
3 0.49 0.12 0.00 0.65 0.97 1.48 1.15
4 0.61 0.64 0.65 0.00 0.36 0.85 0.90
5 0.96 0.98 0.97 0.36 0.00 0.71 0.72
6 1.24 1.44 1.48 0.85 0.71 0.00 1.39
7 1.42 1.24 1.15 0.90 0.72 1.39 0.00
(All Merging {1} and {2, 3})Single: 0.37Complete: 0.49Average: (0.37 + 0.49)/2 = 0.43
(Next Agglomeration)Single: Merging {4, 5} & {1, 2, 3}
0.61Complete: Merging {4, 5} & {6}
0.85Average: Merging {4, 5} & {6}
(0.85+0.71)/2 = 0.78
25
![Page 26: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/26.jpg)
Another linkageCentroid linkage is a commonly used and relatively new approach.Assume
• Xi ∈ Rp
• dij = ||Xi − Xj ||2
Let XG and XH denote group averages for G ,H. Then
dcentroid = ||XG − XH ||2
Example: There are two clusters(red and blue). The centroidlinkage distance(dcentroid(G ,H)) is the distancebetween the centroids (blackline segment).
●
●
●
●
●
●
●
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x1
x2
XG
XH
26
![Page 27: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/27.jpg)
Centroid linkage
Centroid linkage is
• ... quite intuitive
• ... widely used
• ... nicely analogous to K -means.
• ... very related to average linkage (and much, much faster)
However, it has a very unsavory feature: inversions.
An inversion is when an agglomeration doesn’t reduce the linkagedistance.
27
![Page 28: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/28.jpg)
Centroid linkage example
Same data as before. We can’t look at cutting the tree, but wecan still look at a 3 cluster solution.
14 2435
461 10
715 39
1156
314 55 8 50
23 27 3233 53
21 4043
2051 5 12
22 18 19 309 44
4847
25 592
17 38 4157 42 60
3449 16 26
63
13 29 3637
2845 52
54 58
0.0
0.5
1.0
1.5
2.0
hclust (*, "centroid")
Hei
ght
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
−2 −1 0 1 2−
10
12
X1
X2
Cut interpretation: Even if there are no inversions, there still is nocut interpretation.
28
![Page 29: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/29.jpg)
Careful Example: Steps 1,2,3
●
●
●
●
●●
●
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−1.
0−
0.5
0.0
0.5
1.0
1.5
2.0
x1
1
23
45 6
7
7
4 6
5
1 2 3
0.0
0.5
1.0
1.5
Cluster Dendrogram
hclust (*, "centroid")Delta
Hei
ght
Distance matrix (∆)
1 2 3 4 5 6 7
1 0.00 0.40 0.35 1.62 1.20 2.16 3.67
2 0.40 0.00 0.34 1.96 0.37 2.54 3.66
3 0.35 0.34 0.00 0.68 0.42 1.05 1.94
4 1.62 1.96 0.68 0.00 1.45 0.04 0.46
5 1.20 0.37 0.42 1.45 0.00 1.86 2.35
6 2.16 2.54 1.05 0.04 1.86 0.00 0.27
7 3.67 3.66 1.94 0.46 2.35 0.27 0.00
(This is squared Euclidean distance)
Centroid(4,6) = (1.68,0.76)Centroid(2,3) = (0.58,1.25)Centroid distance2: 1.46
29
![Page 30: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/30.jpg)
Careful Example: Step 4
●
●
●
●
●●
●
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−1.
0−
0.5
0.0
0.5
1.0
1.5
2.0
x1
1
23
45 6
7
0.29
0.31
0.36
7
4 6
5
1 2 3
0.0
0.5
1.0
1.5
centroid
hclust (*, "centroid")Delta
Hei
ght
Distance matrix (∆)
1 2 3 4 5 6 7
1 0.00 0.40 0.35 1.62 1.20 2.16 3.67
2 0.40 0.00 0.34 1.96 0.37 2.54 3.66
3 0.35 0.34 0.00 0.68 0.42 1.05 1.94
4 1.62 1.96 0.68 0.00 1.45 0.04 0.46
5 1.20 0.37 0.42 1.45 0.00 1.86 2.35
6 2.16 2.54 1.05 0.04 1.86 0.00 0.27
7 3.67 3.66 1.94 0.46 2.35 0.27 0.00
(This is squared Euclidean distance)
Which one gets merged?
({1} and {2, 3})
method = ’centroid’
out = hclust(Delta,method=method)
rect.hclust(out,k=5,
border=c(’white’,’red’,’white’,’white’,’blue’))
30
![Page 31: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/31.jpg)
Careful Example: Step 4
●
●
●
●
●●
●
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−1.
0−
0.5
0.0
0.5
1.0
1.5
2.0
x1
1
23
45 6
7
0.29
0.31
0.36
7
4 6
5
1 2 3
0.0
0.5
1.0
1.5
centroid
hclust (*, "centroid")Delta
Hei
ght
Distance matrix (∆)
1 2 3 4 5 6 7
1 0.00 0.40 0.35 1.62 1.20 2.16 3.67
2 0.40 0.00 0.34 1.96 0.37 2.54 3.66
3 0.35 0.34 0.00 0.68 0.42 1.05 1.94
4 1.62 1.96 0.68 0.00 1.45 0.04 0.46
5 1.20 0.37 0.42 1.45 0.00 1.86 2.35
6 2.16 2.54 1.05 0.04 1.86 0.00 0.27
7 3.67 3.66 1.94 0.46 2.35 0.27 0.00
(This is squared Euclidean distance)
Which one gets merged?({1} and {2, 3})
method = ’centroid’
out = hclust(Delta,method=method)
rect.hclust(out,k=5,
border=c(’white’,’red’,’white’,’white’,’blue’))
30
![Page 32: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/32.jpg)
Careful Example: Step 4
●
●
●
●
●●
●
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−1.
0−
0.5
0.0
0.5
1.0
1.5
2.0
x1
1
23
45 6
7
0.29
0.31
0.36
7
4 6
5
1 2 3
0.0
0.5
1.0
1.5
centroid
hclust (*, "centroid")Delta
Hei
ght
Distance matrix (∆)
1 2 3 4 5 6 7
1 0.00 0.40 0.35 1.62 1.20 2.16 3.67
2 0.40 0.00 0.34 1.96 0.37 2.54 3.66
3 0.35 0.34 0.00 0.68 0.42 1.05 1.94
4 1.62 1.96 0.68 0.00 1.45 0.04 0.46
5 1.20 0.37 0.42 1.45 0.00 1.86 2.35
6 2.16 2.54 1.05 0.04 1.86 0.00 0.27
7 3.67 3.66 1.94 0.46 2.35 0.27 0.00
(This is squared Euclidean distance)
Which one gets merged?({1} and {2, 3})
method = ’centroid’
out = hclust(Delta,method=method)
rect.hclust(out,k=5,
border=c(’white’,’red’,’white’,’white’,’blue’))
30
![Page 33: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/33.jpg)
Linkage Summary
31
![Page 34: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/34.jpg)
Linkages summary
No inversions?
Unchangedw/ monotonetransforma-tion?
Cutinterpre-tation?
Notes
Single X X X chaining
Complete X X X crowding
Average X X X
Centroid X X X inversions
Final notes:
• None of this helps determine what is the best linkage• Use the linkage that seems the most appropriate for the types
of clusters you want to get32
![Page 35: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/35.jpg)
Designing a clever radio system
We have a lot of songs and dissimilarities between them (dij)
We want to build a clever radio system that takes a song specifiedby the user and produces a song of the “same” type
We ask the user how “risky” he or she wants to be
How can we use hierarchical clustering and with what linkage?
33
![Page 36: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/36.jpg)
Linkages summary: Cut interpretations
Suppose we cut the tree at height h = 1.
Single For each point Xi , there is another point Xj in thesame cluster with dij ≤ 1 (assuming more than1 point in cluster). Also, no points in different clustersare closer than 1.
Complete For each point Xi , every other point Xj in the samecluster has dij ≤ 1.
34
![Page 37: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/37.jpg)
Data analysis example
Diffuse large B-cell lymphoma (DLBCL) is the most common typeof non-Hodgkin’s lymphoma
It is clinically heterogeneous:
• 40% of patients respond well
• 60% of patients succumb to the disease
The researchers propose that this difference is due to unrecognizedmolecular heterogeneity in the tumors
We examine the extent to which genomic-scale gene expressionprofiling can further the understanding of B-cell malignancies.
35
![Page 38: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/38.jpg)
Data analysis exampleHere, we have gene expression data at 2,000 genes for 62 cancercells.
There are 3 cancer diagnoses: FL, CLL, DLBCL. Each correspondsto a type of malignant lymphoma.
We want to use hierarchical clustering to understand this data setbetter.
load(’../data/alizadeh.RData’)
genesT = alizadeh$x
genes = t(genesT)
Yfull = alizadeh$type
Y = as.vector(Yfull)
Y[Yfull == "DLBCL-A"] = ’DLBCL’
Y[Yfull == "DLBCL-G"] = ’DLBCL’
Y = as.factor(Y)
dist.mat = dist(genes)36
![Page 39: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/39.jpg)
PCA plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−40 −20 0 20 40
−30
−20
−10
010
20
PC1
PC
2
●
●
●
CLLDLBCLFL
Two clear groups forFL and CLL
DLBCL somewhatappears to be 1 group,but it is much morediffuse.
37
![Page 40: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/40.jpg)
PCA plot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−40 −20 0 20 40
−30
−20
−10
010
20
PC1
PC
2
●
●
●
●
CLLDLBCL−ADLBCL−GFL
Here are the twosub-types identified bythe researchers
Let’s look at theirresults further.
38
![Page 41: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/41.jpg)
Four hierarchical cluster solutions
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
CLL
CLL
CLL
CLL
CLL
CLL
CLL
CLL
CLL
CLL
CLL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL DLB
CL
DLB
CL
DLB
CL
FL
FL
FL
FL F
LF
LF
LF
LF
LD
LBC
LD
LBC
L
2030
4050
60
hclust (*, "single")
Hei
ght
FL
FL
FL
FL
FL
FL
FL F
LF
LC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
L DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL D
LBC
LD
LBC
LD
LBC
L DLB
CL
DLB
CL D
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
L
2040
6080
100
hclust (*, "complete")
Hei
ght
Single Complete
DLB
CL
DLB
CL
FL
FL
FL
FL
FL
FL
FL F
LF
LC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
L DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL D
LBC
LD
LBC
L DLB
CL
DLB
CL
DLB
CL
DLB
CL D
LBC
LD
LBC
L
2030
4050
6070
80
hclust (*, "average")
Hei
ght
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
CLL
CLL
CLL
CLL C
LL CLL
CLL CLL
FL
FL
DLB
CL FL F
LF
LF
LF
LF
LF
L CLL
CLL
CLL
DLB
CL
DLB
CL
2025
3035
4045
50
hclust (*, "centroid")
Hei
ght
Average Centroid
39
![Page 42: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/42.jpg)
Complete linkage: A closer look
FL
FL
FL
FL
FL
FL
FL F
LF
LC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLC
LLD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
L DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL
DLB
CL D
LBC
LD
LBC
LD
LBC
L DLB
CL
DLB
CL D
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
LD
LBC
L
2040
6080
100
hclust (*, "complete")
Hei
ght
out.com = hclust(dist.mat,
method=’complete’)
plot(out.com,xlab=’’,
main=’’,labels=Y)
rect.hclust(out.com,k=12)
out.cut = cutree(out.com,
k=12)
Notice that FL and CLL aredistinctly grouped, while thereare many clusters inside theDLBCL type.
40
![Page 43: Clustering: Hierarchical - GitHub Pages · From K-means to hierarchical clustering There is a catch: we need to choose a way to measure dissimilarity between clusters, called thelinkage](https://reader034.vdocuments.mx/reader034/viewer/2022050314/5f7720cc7ac4d33f9e3e7e60/html5/thumbnails/43.jpg)
Data analysis example: Conclusion
We find evidence of more sub-types of distinct cancers inpreviously diagnoses
This result is quite interesting
Clinically heterogenous outcomes could be driven by inadequateclusterings
41