data clusteringangom.myweb.cs.uwindsor.ca/.../592-st-nsb-dataclustering.pdf5 data clustering find...
TRANSCRIPT
![Page 1: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/1.jpg)
DATA CLUSTERING
Dr. Alioune Ngom
School of Computer Science
University of Windsor
Winter 2013
![Page 2: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/2.jpg)
Clustering
Cluster: a collection of data objects Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications to get insight into data
as a preprocessing step
we will use it for image segmentation
![Page 3: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/3.jpg)
Why clustering? 3
A cluster is a group of related objects
In biological nets, a group of “related” genes/proteins
Application in PPI nets:
Protein function prediction
Protein complex identification
Are you familiar with Gene Ontology?
![Page 4: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/4.jpg)
Clustering 4
Data clustering (Lecture 6) vs. Graph clustering
![Page 5: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/5.jpg)
5
Data Clustering
Find relationships and patterns in the data
Get insights in underlying biology
Find groups of “similar” genes/proteins/samples
Deal with numerical values of biological data
They have many features (not just color)
![Page 6: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/6.jpg)
6
Data Clustering
(homogeneity)
(separation)
![Page 7: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/7.jpg)
7
There are many possible distance metrics between
objects
Theoretical properties of distance metrics, d:
d(a,b) >= 0
d(a,a) = 0
d(a,b) = 0 a=b
d(a,b) = d(b,a) – symmetry
d(a,c) <= d(a,b) + d(b,c) – triangle inequality
Data Clustering
![Page 8: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/8.jpg)
8
Example distances:
Euclidean (L2) distance
Manhattan (L1) distance
Lm: (|x1-x2|m+|y1-y2|
m)1/m
L∞: max(|x1-x2|,|y1-y2|)
Inner product: x1x2+y1y2
Correlation coefficient
For simplicity, we will concentrate on Euclidean distance
Clustering Algorithms – Review
![Page 9: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/9.jpg)
Distance/Similarity matrices:
• Clustering is based on distances –
distance/similarity matrix: • Represents the distance between objects
• Only need half the matrix, since it is symmetric
Clustering Algorithms – Review
9
![Page 10: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/10.jpg)
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
![Page 11: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/11.jpg)
Notion of a Cluster can be Ambiguous
How many clusters?
Four Clusters Two Clusters
Six Clusters
![Page 12: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/12.jpg)
Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any point not in the cluster.
8 contiguous clusters
![Page 13: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/13.jpg)
Types of Clusters: Density-Based
Density-based A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
6 density-based clusters
![Page 14: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/14.jpg)
Euclidean Density – Cell-based
Simplest approach is to divide region into a number
of rectangular cells of equal volume and define
density as # of points the cell contains
![Page 15: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/15.jpg)
Euclidean Density – Center-based
Euclidean density is the number of points within a
specified radius of the point
![Page 16: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/16.jpg)
Data Structures in Clustering
Data matrix
(two modes)
Dissimilarity matrix
(one mode)
npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
![Page 17: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/17.jpg)
Interval-valued variables
Standardize data
Calculate the mean squared deviation:
where
Calculate the standardized measurement (z-score)
Using mean absolute deviation could be more robust than using
standard deviation
.)...21
1nffff
xx(xn m
)2||...2||2|(|121 fnffffff
mxmxmxns
f
fif
if s
mx z
![Page 18: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/18.jpg)
Similarity and Dissimilarity Between Objects
Euclidean distance:
Properties
d(i,j) 0
d(i,j) = 0 iff i=j
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity measures.
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
![Page 19: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/19.jpg)
The set of 5 observations, measuring 3 variables,
can be described by its mean vector and covariance matrix.
The three variables, from left to right are
length, width, and height of a certain object, for example.
Each row vector Xrow is another observation
of the three variables (or components) for row=1, …, 5.
Covariance Matrix
![Page 20: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/20.jpg)
The mean vector consists of the means of each variable. The covariance matrix
consists of the variances of the variables along the main diagonal and the
covariances between each pair of variables in the other matrix positions.
0.025 is the variance of the length variable,
0.0075 is the covariance between the length and the width variables,
0.00175 is the covariance between the length and the height variables,
0.007 is the variance of the width variable.
where n = 5
for this example
n
row
krowkjrowjjk
n
row
rowrow
xXxXn
s
xXxXn
XXn
S
1
1
))((1
1
)')((1
1'
1
1
![Page 21: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/21.jpg)
Mahalanobis Distance
n
i
kikjijkj XXXXn 1
, ))((1
1
Tqpqpqpsmahalanobi )()(),( 1
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
is the covariance matrix of
the input data X
![Page 22: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/22.jpg)
Mahalanobis Distance
3.02.0
2.03.0
Covariance Matrix:
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
![Page 23: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/23.jpg)
Cosine Similarity
If x1 and x
2 are two document vectors, then
cos( x1, x
2 ) = (x
1 x
2) / ||x
1|| ||x
2|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
x1 = 3 2 0 5 0 0 0 2 0 0
x2 = 1 0 0 0 0 0 0 1 0 2
x1 x2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||x1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||x2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( x1, x
2 ) = .3150
![Page 24: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/24.jpg)
Correlation
Correlation measures the linear relationship between
objects
To compute correlation, we standardize data
objects, p and q, and then take their dot product
)(/))(( pstdpmeanpp kk
)(/))(( qstdqmeanqq kk
qpqpncorrelatio ),(
![Page 25: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/25.jpg)
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
![Page 26: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/26.jpg)
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
![Page 27: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/27.jpg)
k-means Clustering
An algorithm for partitioning (or clustering) N data
points into K disjoint subsets Sj containing Nj data
points so as to minimize the sum-of-squares criterion
2
1
|| j
K
j Sn
n
j
xJ
where xn is a vector representing the nth data point and j is
the geometric centroid of the data points in Sj
![Page 28: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/28.jpg)
K-means Clustering – Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
K-means will converge for common distance functions.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few points change clusters’
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
![Page 29: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/29.jpg)
Two different K-means Clusterings
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
• Importance of choosing initial centroids
![Page 30: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/30.jpg)
Evaluating K-means Clusters
K
i Cx
i
i
xmdistSSE1
2 ),(
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
x is a data point in cluster Ci and mi is the representative point for cluster Ci
can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
![Page 31: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/31.jpg)
Solutions to Initial Centroids Problem
Multiple runs Helps, but probability is not on your side
Sample and use hierarchical clustering to determine initial centroids
Select more than k initial centroids and then select among these initial centroids Select most widely separated
Postprocessing
Bisecting K-means Not as susceptible to initialization issues
Basic K-means algorithm can yield empty clusters
Handling Empty Clusters
![Page 32: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/32.jpg)
Pre-processing and Post-processing
Pre-processing
Normalize the data
Eliminate outliers
Post-processing
Eliminate small clusters that may represent outliers
Split ‘loose’ clusters, i.e., clusters with relatively high SSE
Merge clusters that are ‘close’ and that have relatively low
SSE
![Page 33: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/33.jpg)
Bisecting K-means
Bisecting K-means algorithm Variant of K-means that can produce a partitional or a hierarchical
clustering
![Page 34: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/34.jpg)
Bisecting K-means Example
![Page 35: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/35.jpg)
Limitations of K-means
K-means has problems when clusters are of differing
Sizes
Densities
Non-globular shapes
K-means has problems when the data contains
outliers.
![Page 36: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/36.jpg)
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
![Page 37: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/37.jpg)
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
![Page 38: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/38.jpg)
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
![Page 39: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/39.jpg)
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
![Page 40: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/40.jpg)
Overcoming K-means Limitations
Original Points K-means Clusters
![Page 41: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/41.jpg)
Variations of the K-Means Method
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
Handling a mixture of categorical and numerical data: k-
prototype method
![Page 42: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/42.jpg)
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of the
resulting clustering
PAM works effectively for small data sets, but does not scale well for large
data sets
CLARA (Kaufmann & Rousseeuw, 1990)
draws multiple samples of the data set, applies PAM on each sample, and
gives the best clustering as the output
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
![Page 43: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/43.jpg)
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree-like diagram that records the sequences of
merges or splits
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
![Page 44: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/44.jpg)
Strengths of Hierarchical Clustering
No assumptions on the number of clusters
Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level
Hierarchical clusterings may correspond to meaningful taxonomies
Example in biological sciences (e.g., phylogeny reconstruction, etc), web (e.g., product catalogs) etc
![Page 45: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/45.jpg)
Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or there are k clusters)
Traditional hierarchical algorithms use a similarity or distance matrix
Merge or split one cluster at a time
![Page 46: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/46.jpg)
Complexity of hierarchical clustering
Distance matrix is used for deciding which clusters to
merge/split
At least quadratic in the number of data points
Not usable for large datasets
![Page 47: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/47.jpg)
Agglomerative clustering algorithm
Most popular hierarchical clustering technique
Basic algorithm 1. Compute the distance matrix between the input data points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains
Key operation is the computation of the distance between two clusters Different definitions of the distance between clusters lead to
different algorithms
![Page 48: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/48.jpg)
Input/ Initial setting
Start with clusters of individual points and a
distance/proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 49: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/49.jpg)
Intermediate State
After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 50: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/50.jpg)
Intermediate State
Merge the two closest clusters (C2 and C5) and update the distance
matrix.
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance/Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 51: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/51.jpg)
After Merging
“How do we update the distance matrix?”
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5 C1
C1
C3
C4
C2 U C5
C3 C4
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 52: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/52.jpg)
Distance between two clusters
Each cluster is a set of points
How do we define distance between two sets of
points
Lots of alternatives
Not an easy task
![Page 53: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/53.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error
Proximity Matrix
![Page 54: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/54.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error
![Page 55: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/55.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error
![Page 56: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/56.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error
![Page 57: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/57.jpg)
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error
![Page 58: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/58.jpg)
Hierarchical Clustering: Comparison
Group Average
Ward’s Method
1
2
3
4
5
6 1
2
5
3
4
MIN MAX
1
2
3
4
5
6
1
2
5
3 4
1
2
3
4
5
6
1
2 5
3
4 1
2
3
4
5
6
1
2
3
4
5
![Page 59: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/59.jpg)
Distance between two clusters
Single-link distance between clusters Ci and Cj is
the minimum distance between any object in Ci and
any object in Cj
The distance is defined by the two most similar
objects
jiyxjisl CyCxyxdCCD ,),(min, ,
![Page 60: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/60.jpg)
Single-link clustering: example
Determined by one pair of points, i.e., by one link
in the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
![Page 61: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/61.jpg)
Single-link clustering: example
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
![Page 62: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/62.jpg)
Strengths of single-link clustering
Original Points Two Clusters
• Can handle non-elliptical shapes
![Page 63: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/63.jpg)
Limitations of single-link clustering
Original Points Two Clusters
• Sensitive to noise and outliers
• It produces long, elongated clusters
![Page 64: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/64.jpg)
Distance between two clusters
Complete-link distance between clusters Ci and Cj
is the maximum distance between any object in Ci
and any object in Cj
The distance is defined by the two most dissimilar
objects
jiyxjicl CyCxyxdCCD ,),(max, ,
![Page 65: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/65.jpg)
Complete-link clustering: example
Distance between clusters is determined by the two
most distant points in the different clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
![Page 66: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/66.jpg)
Complete-link clustering: example
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
6
1
2 5
3
4
![Page 67: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/67.jpg)
Strengths of complete-link clustering
Original Points Two Clusters
• More balanced clusters (with equal diameter)
• Less susceptible to noise
![Page 68: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/68.jpg)
Limitations of complete-link clustering
Original Points Two Clusters
• Tends to break large clusters
• All clusters tend to have the same diameter – small
clusters are merged with larger ones
![Page 69: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/69.jpg)
Distance between two clusters
Group average distance between clusters Ci and Cj
is the average distance between any object in Ci and
any object in Cj
ji CyCxji
jiavg yxdCC
CCD,
),(1
,
![Page 70: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/70.jpg)
Average-link clustering: example
Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
![Page 71: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/71.jpg)
Average-link clustering: example
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
1
2
3
4
5
6
1
2
5
3
4
![Page 72: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/72.jpg)
Average-link clustering: discussion
Compromise between Single and Complete
Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters
![Page 73: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/73.jpg)
Distance between two clusters
Centroid distance between clusters Ci and Cj is the
distance between the centroid ri of Ci and the
centroid rj of Cj
),(, jijicentroids rrdCCD
![Page 74: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/74.jpg)
Distance between two clusters
Ward’s distance between clusters Ci and Cj is the difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster Cij
ri: centroid of Ci
rj: centroid of Cj
rij: centroid of Cij
ijji Cx
ij
Cx
j
Cx
ijiw rxrxrxCCD222
,
![Page 75: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/75.jpg)
Ward’s distance for clusters
Similar to group average and centroid distance
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of k-means
Can be used to initialize k-means
![Page 76: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/76.jpg)
Hierarchical Clustering: Comparison
Group Average
Ward’s Method
1
2
3
4
5
6
1
2
5
3
4
MIN MAX
1
2
3
4
5
6
1
2
5
3 4
1
2
3
4
5
6
1
2 5
3
4 1
2
3
4
5
6
1
2
3
4
5
![Page 77: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/77.jpg)
Hierarchical Clustering: Time and Space
requirements
For a dataset X consisting of n points
O(n2) space; it requires storing the distance matrix
O(n3) time in most of the cases
There are n steps and at each step the size n2 distance matrix must be updated and searched
Complexity can be reduced to O(n2 log(n) ) time for some approaches by using appropriate data structures
![Page 78: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/78.jpg)
Divisive hierarchical clustering
Start with a single cluster composed of all data points
Split this into components
Continue recursively
Monothetic divisive methods split clusters using one variable/dimension at a time
Polythetic divisive methods make splits on the basis of all variables together
Any intercluster distance measure can be used
Computationally intensive, less widely used than agglomerative methods
![Page 79: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/79.jpg)
79
Nearest neighbours “clustering:”
Clustering Algorithms
![Page 80: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/80.jpg)
80
Nearest neighbours “clustering:” Example:
Clustering Algorithms
Pros and cons:
1. No need to know the number of clusters to discover beforehand (different than
in k-means and hierarchical).
2. We need to define the threshold .
![Page 81: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/81.jpg)
81
k-nearest neighbors “clustering” -- classification algorithm, but we use the idea here to do clustering:
For point v, create the cluster containing v and top k closest points to v,
e.g., based on Euclidean distance.
Do this for all points v.
All of the clusters are of size k, but they can overlap.
The challenge: choosing k.
Clustering Algorithms
![Page 82: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/82.jpg)
82
k-Nearest Neighbours (k-NN)
Classification
An object is classified by a majority vote of its neighbors
It is assigned to the class most common amongst its k
nearest neighbors
Example:
- The test sample (green circle) should
be classified either to the first class of
blue squares or to the second class of
red triangles.
- If k = 3 it is classified to the second
class (2 triangles vs only 1 square).
- If k = 5 it is classified to the first class
(3 squares vs. 2 triangles).
![Page 83: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/83.jpg)
83
k-Nearest Neighbours (k-NN)
Classification
An object is classified by a majority vote of its neighbors
It is assigned to the class most common amongst its k
nearest neighbors
Example:
- The test sample (green circle) should
be classified either to the first class of
blue squares or to the second class of
red triangles.
- If k = 3 it is classified to the second
class (2 triangles vs only 1 square).
- If k = 5 it is classified to the first class
(3 squares vs. 2 triangles).
![Page 84: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/84.jpg)
84
What is Classification?
The goal of data classification is to organize and
categorize data into distinct classes.
A model is first created based on the training data (learning).
The model is then validated on the testing data.
Finally, the model is used to classify new data.
Given the model, a class can be predicted for new data.
Example:
Application: medical diagnosis, treatment effectiveness analysis,
protein function prediction, interaction prediction, etc.
![Page 85: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/85.jpg)
85
Classification = Learning the Model
![Page 86: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/86.jpg)
86
What is Clustering?
There is no training data (objects are not labeled)
We need a notion of similarity or distance
Should we know a priori how many clusters exist?
![Page 87: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/87.jpg)
87
Supervised and Unsupervised
Classification = Supervised approach
We know the class labels and the number of classes
Clustering = Unsupervised approach
We do not know the class labels and may not know the number
of classes
![Page 88: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/88.jpg)
Classification vs. Clustering
(we can compute it without the need
of knowing the correct solution)
88
![Page 89: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/89.jpg)
Model-based clustering
Assume data generated from k probability
distributions
Goal: find the distribution parameters
Algorithm: Expectation Maximization (EM)
Output: Distribution parameters and a soft
assignment of points to clusters
![Page 90: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/90.jpg)
Model-based clustering
Assume k probability distributions with parameters:
(θ1,…, θk)
Given data X, compute (θ1,…, θk) such that
Pr(X|θ1,…, θk) [likelihood] or ln(Pr(X|θ1,…, θk))
[loglikelihood] is maximized.
Every point xєX need not be generated by a single
distribution but it can be generated by multiple
distributions with some probability [soft clustering]
![Page 91: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/91.jpg)
EM Algorithm
Initialize k distribution parameters (θ1,…, θk); Each distribution parameter corresponds to a cluster center
Iterate between two steps
Expectation step: (probabilistically) assign points to clusters
Maximation step: estimate model parameters that maximize the likelihood for the given assignment of points
![Page 92: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/92.jpg)
EM Algorithm
Initialize k cluster centers
Iterate between two steps
Expectation step: assign points to clusters
Maximation step: estimate model parameters
j
jikiki CxCxCx ) |Pr() |Pr() Pr(
n
i
k
ji
kik
Cx
Cx
nr
1 ) Pr(
) Pr(1
n
Cx
w i
ki
k
) Pr(
![Page 93: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/93.jpg)
MST: Divisive Hierarchical Clustering
Build MST (Minimum Spanning Tree)
Start with a tree that consists of any point
In successive steps, look for the closest pair of points (p, q) such that one
point (p) is in the current tree but the other (q) is not
Add q to the tree and put an edge between p and q
![Page 94: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/94.jpg)
MST: Divisive Hierarchical Clustering
Use MST for constructing hierarchy of clusters
![Page 95: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/95.jpg)
More on Hierarchical Clustering Methods
Major weakness of agglomerative clustering methods
do not scale well: time complexity of at least O(n2), where n is the number
of total objects
can never undo what was done previously
Integration of hierarchical with distance-based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-
clusters
CURE (1998): selects well-scattered points from the cluster and then shrinks
them towards the center of the cluster by a specified fraction
CHAMELEON (1999): hierarchical clustering using dynamic modeling
![Page 96: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/96.jpg)
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as
density-connected points
Major features: Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
![Page 97: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/97.jpg)
Graph-Based Clustering
Graph-Based clustering uses the proximity graph
Start with the proximity matrix
Consider each point as a node in a graph
Each edge between two nodes has a weight which is the
proximity between the two points
Initially the proximity graph is fully connected
MIN (single-link) and MAX (complete-link) can be viewed as
starting with this graph
In the simplest case, clusters are connected
components in the graph.
![Page 98: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/98.jpg)
Graph-Based Clustering: Sparsification
Clustering may work better Sparsification techniques keep the connections to the most similar
(nearest) neighbors of a point while breaking the connections to less similar points.
The nearest neighbors of a point tend to belong to the same class as the point itself.
This reduces the impact of noise and outliers and sharpens the
distinction between clusters.
Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms. Chameleon and Hypergraph-based Clustering
![Page 99: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/99.jpg)
Sparsification in the Clustering Process
![Page 100: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/100.jpg)
Cluster Validity
For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall
For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?
Then why do we want to evaluate them? To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
![Page 101: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/101.jpg)
Clusters found in Random Data
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random
Points
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Complete
Link
![Page 102: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/102.jpg)
Measures of Cluster Validity
Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
External Index: Used to measure the extent to which cluster labels match
externally supplied class labels. Entropy
Internal Index: Used to measure the goodness of a clustering structure
without respect to external information. Sum of Squared Error (SSE)
Relative Index: Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropy
Sometimes these are referred to as criteria instead of indices
However, sometimes criterion is the general strategy and index is the numerical
measure that implements the criterion.
![Page 103: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/103.jpg)
Internal Measures: Cohesion and Separation
Cluster Cohesion: Measures how closely related are objects in a cluster
Example: SSE
Cluster Separation: Measure how distinct or well-separated a cluster is
from other clusters
Example: Squared Error
Cohesion is measured by the within cluster sum of squares (SSE)
Separation is measured by the between cluster sum of squares
Where |Ci| is the size of cluster i
i Cx
ii
mxWSS2)(
i
ii mmCBSS 2)(
![Page 104: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/104.jpg)
Internal Measures: Cohesion and Separation
Example:
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(
22
2222
Total
BSS
WSS
1 2 3 4 5 m1 m2
m
K=2 clusters:
10010
0)33(4
10)35()34()32()31(
2
2222
Total
BSS
WSSK=1 cluster:
![Page 105: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/105.jpg)
Internal Measures: Cohesion and Separation
A proximity graph based approach can also be used for
cohesion and separation.
Cluster cohesion is the sum of the weight of all links within a cluster.
Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.
cohesion separation
![Page 106: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/106.jpg)
Graph clustering 106
Overlapping terminology:
Clustering algorithm for graphs =
“Community detection” algorithm for networks
Community structure in networks =
Cluster structure in graphs
Partitioning vs. clustering
Overlap?
![Page 107: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/107.jpg)
Graph clustering 107
Decompose a network into subnetworks based on
some topological properties
Usually we look for dense subnetworks
![Page 108: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/108.jpg)
Graph clustering 108 Why?
Protein complexes in a PPI network
![Page 109: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/109.jpg)
E.g., Nuclear Complexes 109
![Page 110: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/110.jpg)
Graph clustering 110
Algorithms:
Exact: have proven solution quality and time complexity
Approximate: heuristics are used to make them efficient
Example algorithms:
Highly connected subgraphs (HCS)
Restricted neighborhood search clustering (RNSC)
Molecular Complex Detection (MCODE)
Markov Cluster Algorithm (MCL)
![Page 111: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/111.jpg)
Highly connected subgraphs (HCS) 111
Definitions:
HCS - a subgraph with n nodes such that more than n/2 edges must be removed in order to disconnect it
A cut in a graph - partition of vertices into two non-overlapping sets
A multiway cut - partition of vertices into several disjoint sets
The cut-set - the set of edges whose end points are in different sets
Edges are said to be crossing the cut if they are in its cut-set
The size/weight of a cut - the number of edges crossing the cut
The HCS algorithm partitions the graph by finding the minimum graph cut and by repeating the process recursively until highly connected components (subgraphs) are found
![Page 112: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/112.jpg)
Highly connected subgraphs (HCS) 112
HCS algorithm:
Input: graph G
Does G satisfy a stopping criterion?
If yes: it is declared a “kernel”
Otherwise, G is partitioned into two subgraphs, separated by a minimum weight edge cut
Recursively proceed on the two subgraphs
Output: list of kernels that are basis of possible clusters
![Page 113: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/113.jpg)
Highly connected subgraphs (HCS) 113
![Page 114: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/114.jpg)
Highly connected subgraphs (HCS) 114
Clusters satisfy two properties:
They are homogeneous, since the diameter of each cluster is at most 2 and each cluster is at least half as dense as clique
They are well separeted, since any non-trivial split by the algorithm happens on subgraphs that are likely to be of diameter at least 3
Running time complexity of HCS algorithm:
Bounded by 2N f(n,m)
N is the number of clusters found (often N << n)
f(n,m) is time complexity of computing a minimum edge cut of G with n nodes and m edges
The fastest deterministic min edge cut alg. for unweighted graphs has time complexity O(nm); for weighted graphs it’s O(nm+n2log n)
More in survey chapter: N. Przulj, “Graph Theory Analysis of Protein-Protein Interactions,” a chapter in
“Knowledge Discovery in Proteomics,” edited by I. Jurisica and D. Wigle, CRC Press, 2005
![Page 115: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/115.jpg)
Highly connected subgraphs (HCS) 115
Several heuristics used to speed it up
E.g., removing low degree nodes
If an input graph has many low degree nodes (remember, bio nets have power-law degree distributions), one iteration of the minimum edge cut algorithm many only separate a low degree node from the rest of the graph contributing to increased computational cost at a low informative value in terms of clustering
After clustering is over, singletons can be “adopted” by clusters, say by the cluster with which a singleton node has the most neighbors
![Page 116: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/116.jpg)
Restricted neighborhood search clust. (RNSC) 116
RNSC algorithm - partitions the set of nodes in the network into
clusters by using a cost function to evaluate the partitioning
The algorithm starts with a random cluster assignment
It proceeds by reassigning nodes, so as to maximize the scores
of partitions
At the same time, the algorithm keeps a list of already
explored partitions to avoid their reprocessing
Finally, the clusters are filtered based on their size, density and
functional homogeneity
A. D. King, N. Przulj and I. Jurisica, “Protein complex prediction via cost-based clustering,”
Bioinformatics, 20(17): 3013-3020, 2004.
![Page 117: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/117.jpg)
117
Restricted neighborhood search clust. (RNSC)
A cost function to evaluate the partitioning:
Consider node v in G and clustering C of G
αv is the number of “bad connections” incident with v
A bad connection incident to v is an edge that exist between v and a
node in a different cluster from that where v is, or one that does not
exist between v and node u in the same cluster as v
The cost function is then:
Cn(G,C) = ½ ∑v∈V αv
There are other cost functions, too
Goal of each cost function: clustering in which the nodes of
a cluster are all connected to each other and there are no
other connections
117 A. D. King, N. Przulj and I. Jurisica, “Protein complex prediction via cost-based clustering,”
Bioinformatics, 20(17): 3013-3020, 2004.
![Page 118: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/118.jpg)
Molecular Complex Detection (MCODE)
118 Step 1: node weighting
Based on the core clustering coefficient
Clustering coefficient of a node: the density of its neighborhood
A graph is called a “k-core” if the minimal degree in it is k
“Core clustering coefficient” of a node: the density of the k-core of its immediate neighborhood
It increases the weights of heavily interconnected graph regions while giving small weights to the less connected vertices, which are abundant in the scale-free networks
Step 2: the algorithm traverses the weighted graph in a greedy fashion to isolate densely connected regions
Step 3: The post-processing step filters or adds proteins based on connectivity criteria
Implementation available as a Cytoscape plug-in
http://baderlab.org/Software/MCODE -- a Cytoscape plugin
![Page 119: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/119.jpg)
119
Molecular Complex Detection (MCODE)
Example:
Fir1
Ktr3
Ufd1
Bud14
Ref2
Ycl046w Sec13
Fun12
Rna14
Has1
Hgh1
Kin28
Pwp2
Pcf11
Tfb1
Pta1
Cbf5
Glc7
Pub1
Ynl092w
Sik1
Srp1
Tcp1
Rsa3
Ysh1
Tye7
Cct2
Rgm1
Pfk1
Tif11
Pti1
Fip1 Yth1
Nop12
Nop1
Tif4632
Pfs2
Nop58
Cct5 Cct6
Cft1
Pap1
Yhl035c
Uba2
Cft2
Mpe1
Yml030w
Hca4
Sro9
Ssu72 Vps53
Yor179c
![Page 120: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/120.jpg)
120
Fir1
Ktr3
Ufd1
Bud14
Ref2
Ycl046w Sec13
Fun12
Rna14
Has1
Hgh1
Kin28
Pwp2
Pcf11
Tfb1
Pta1
Cbf5
Glc7
Pub1
Ynl092w
Sik1
Srp1
Tcp1
Rsa3
Ysh1
Tye7
Cct2
Rgm1
Pfk1
Tif11
Pti1
Fip1 Yth1
Nop12
Nop1
Tif4632
Pfs2
Nop58
Cct5 Cct6
Cft1
Pap1
Yhl035c
Uba2
Cft2
Mpe1
Yml030w
Hca4
Sro9
Ssu72 Vps53
Yor179c
Input Network
![Page 121: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/121.jpg)
121
Ref2
Pta1
Tif11
Pti1
Fip1 Yth1
Pfs2 Cft1
Pap1
Cft2
Mpe1
Hca4
Yor179c
Find neighbors of Pti1
![Page 122: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/122.jpg)
122
Ref2
Pta1
Pti1
Fip1 Yth1
Pfs2 Cft1
Pap1
Cft2
Mpe1
Find highest k-core (8-core)
Removes low degree nodes
in power-law networks
![Page 123: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/123.jpg)
123
Ref2
Pta1
Pti1
Fip1 Yth1
Pfs2 Cft1
Pap1
Cft2
Mpe1
Find graph density
Density= Number edges
Number possible edges
= 44/55 = 0.8
![Page 124: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/124.jpg)
124
Calculate score for Pti1
Ref2
Pta1
Pti1
Fip1 Yth1
Pfs2 Cft1
Pap1
Cft2
Mpe1
Score = highest k-core * density = 8 * 0.8 = 6.4 =
High
Low
Score
![Page 125: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/125.jpg)
125
Fir1
Ktr3
Ufd1
Bud14
Ref2
Ycl046w Sec13
Fun12
Rna14
Has1
Hgh1
Kin28
Pwp2
Pcf11
Tfb1
Pta1
Cbf5
Glc7
Pub1
Ynl092w
Sik1
Srp1
Tcp1
Rsa3
Ysh1
Tye7
Cct2
Rgm1
Pfk1
Tif11
Pti1
Fip1 Yth1
Nop12
Nop1
Tif4632
Pfs2
Nop58
Cct5 Cct6
Cft1
Pap1
Yhl035c
Uba2
Cft2
Mpe1
Yml030w
Hca4
Sro9
Ssu72 Vps53
Yor179c
Repeat for entire network
![Page 126: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/126.jpg)
126
Find dense regions:
-Pick highest scoring vertex
-’Paint’ outwards until threshold score reached
(% score from
seed node)
Ref2
Rna14
Pcf11
Pta1
Ysh1
Pti1
Fip1 Yth1
Pfs2 Cft1
Pap1
Cft2
Mpe1
Hca4
Ssu72
Yor179c
High
Low
Score
![Page 127: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/127.jpg)
Markov Cluster Algorithm (MCL) 127
Network flow
Imagine a graph as a network of interconnected pipes
Suppose water gets into one or more vertices (sources) from the outside, and can exit the network at certain other vertices (sinks)
Then, it will spread in the pipes and reach other nodes, until it exits at sinks
The capacities of the edges (i.e., how much the pipe can carry per unit time) and the input at the sources determine the amount of flow along every edge (i.e., how much each pipe actually carries) and the amount exiting at each sink
![Page 128: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/128.jpg)
Markov Cluster Algorithm (MCL) 128
Graph power
The kth power of a graph G: a graph with the same set
of vertices as G and an edge between two vertices iff
there is a path of length at most k between them
The number of paths of length k between any two
nodes can be calculated by raising adjacency matrix of
G to the exponent k
Then, G’s kth power is defined as the graph whose
adjacency matrix is given by the sum of the first k
powers of the adjacency matrix:
![Page 129: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/129.jpg)
Markov Cluster Algorithm (MCL) 129
G G2 G3
![Page 130: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/130.jpg)
Markov Cluster Algorithm (MCL) 130
The MCL algorithm simulates flow on a graph and computes its successive powers to increase the contrast between regions with high flow and regions with a low flow
This process can be shown to converge towards a partition of the graph into high-flow regions separated by regions of no flow
Very efficient for PPI networks Brohee S, van Helden J: Evaluation of clustering algorithms for protein-
protein interaction networks. BMC bioinformatics 2006, 7:488.
Vlasblom, J, Wodak, SJ: Markov clustering versus affinity propagation for the partitioning of protein interaction graphs, BMC Bioinformatics 2009, 10:99.
![Page 131: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/131.jpg)
Markov Cluster Algorithm (MCL) 131
Flow between different dense regions that are sparsely connected eventually
“evaporates,” showing cluster structure present in the input graph.
![Page 132: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/132.jpg)
Hierarchical, k-means… clustering 132
Of course, you can always cluster data using these methods and an appropriate topological distance measure Shortest path distances Many ties
Czekanowski-Dice distance Assigns the maximum distance value to two nodes having no
common interactors
Assigns zero value to those nodes interacting with exactly the same set of neighbors
Form clusters of nodes sharing a high percentage of edges
GDV-similarity
Do they satisfy all of the distance metric rules?
![Page 133: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/133.jpg)
133
Correctness of methods
Clustering is used for making predictions:
E.g., protein function, involvement in disease, interaction
prediction
Other methods are used for classifying the data
(have disease or not) and making predictions
Have to evaluate the correctness of the predictions
made by the approach
Commonly used method for this is ROC Curves
![Page 134: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/134.jpg)
134
Definitions (e.g., for PPIs):
A true positive (TP) interaction: an interaction exists in the cell and is discovered by an
experiment (biological or computational).
A true negative (TN) interaction: an interaction does not exist and is not discovered by an
experiment.
A false positive (FP) interaction: an interaction does not exist in the cell, but is discovered by an
experiment.
A false negative (FN) interaction: an interaction exists in the cell, but is not discovered by an
experiment.
Correctness of methods
![Page 135: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/135.jpg)
If TP stands for true positives, FP for false positives, TN for true negatives, and FN for false negatives, then:
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
Sensitivity measures the fraction of items out of all possible ones that truly exist in the biological system that our method successfully identifies (fraction of correctly classified existing items)
Specificity measures the fraction of the items out of all items that truly do not exist in the biological system for which our method correctly determines that they do not exist (fraction of
correctly classified non-existing items)
Thus, 1-Specificity measures the fraction of
all non-existing items in the system that are
incorrectly identified as existing
Correctness of methods
![Page 136: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/136.jpg)
136
Receiver Operating Curves (ROC curves) provide a standard measure of the ability of a test to correctly classify objects.
E.g., the biomedical field uses ROC curves extensively to assess the efficacy of diagnostic tests in discriminating between healthy and diseased individuals.
ROC curve is a graphical plot of the true positive rate, i.e., sensitivity, vs. false positive rate, i.e., (1−specificity), for a binary classifier system as its discrimination threshold is varied (see above for definitions).
It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test; the closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. The area under the curve (AUC) is a measure of a test’s accuracy.
ROC Curve
![Page 137: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/137.jpg)
137
ROC curve Example:
Embed nodes of a PPI network into 3-D Euclidean unit box
(use MDS – knowledge of MDS not required in this class, see reference in the footer, if interested)
Like in GEO, choose a radius r to determine node connectivity
Vary r between 0 and sqrt(3) (diagonal of the box) r=0 makes a graph with no edges (TP=0, FP=0)
r=sqrt(3) makes a complete graph (all possible edges, FN=TN=0)
For each r in [0, sqrt(3)]: measure TP, TN, FP, FN
compute sensitivity and 1- specificity
draw the point
Set of these points is the ROC curve
Note:
For r=0, sensitivity=0 and 1-specificity=0, since TP=0, FP=0 (no edges)
For r=sqrt(3), sensitivity=1 and 1-specificity=1 (or 100%), since FN=0, TN=0
D. J. Higham, M. Rasajski, N. Przulj, “Fitting a Geometric Graph to a Protein-Protein Interaction Network”, Bioinformatics, 24(8), 1093-1099, 2008.
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
![Page 138: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/138.jpg)
Precision and recall 138
Information about true negatives is often not available
Precision - a measure of exactness
Recall - a measure of completeness
E.g., given that we produce n cancer gene predictions
Precision is the number of known cancer genes in our n predictions, divided by n
Recall is the number of known cancer genes in our n predictions divided by the total number of known cancer genes
F-score – measures test accuracy, weighted average of precission and recall (in [0,1]):
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
![Page 139: DATA CLUSTERINGangom.myweb.cs.uwindsor.ca/.../592-ST-NSB-DataClustering.pdf5 Data Clustering Find relationships and patterns in the data Get insights in underlying biology Find groups](https://reader034.vdocuments.mx/reader034/viewer/2022050116/5f4d1f35eba9cd59707ca58e/html5/thumbnails/139.jpg)
Hypergeometric distribution 139
Probability distribution that describes the number of successes in a sequence of n draws from a finite population of size N without replacement
For draws with replacement, use binomial distribution
N - the total number of objects (e.g., nodes in a network)
m - the number of objects out of n objects with a given “function” (color)
n - number of draws from N (e.g., the size of a cluster)
k - the number of objects out of n objects that have the given function
- To get the enrichment p-value for a cluster of size n, sum over i=k,k+1,…,m
- Use hygecdf function in Matlab (but use 1-hygecdf(…)), since it computes probability to get 0 to k elements of a given function