cluster validation
DESCRIPTION
Cluster Validation. Cluster validation Assess the quality and reliability of clustering results. Why validation? To avoid finding clusters formed by chance To compare clustering algorithms To choose clustering parameters e.g., the number of clusters in the K-means algorithm. DBSCAN. - PowerPoint PPT PresentationTRANSCRIPT
University at Buffalo The State University of New York
Cluster Validation
Cluster validation Assess the quality and reliability of clustering
results.
Why validation?To avoid finding clusters formed by chanceTo compare clustering algorithmsTo choose clustering parameters
e.g., the number of clusters in the K-means algorithm
University at Buffalo The State University of New York
Clusters found in Random Data
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random Points
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Complete Link
University at Buffalo The State University of New York
Aspects of Cluster Validation
Comparing the clustering results to ground truth (externally known results). External Index
Evaluating the quality of clusters without reference to external information. Use only the data Internal Index
Determining the reliability of clusters. To what confidence level, the clusters are not
formed by chance Statistical framework
University at Buffalo The State University of New York
Comparing to Ground Truth
Notation N: number of objects in the data set; P={P1,…,Pm}: the set of “ground truth” clusters; C={C1,…,Cn}: the set of clusters reported by a clustering
algorithm. The “incidence matrix”
N N (both rows and columns correspond to objects). Pij = 1 if Oi and Oj belong to the same “ground truth”
cluster in P; Pij=0 otherwise. Cij = 1 if Oi and Oj belong to the same cluster in C; Cij=0
otherwise.
University at Buffalo The State University of New York
External Index
A pair of data object (Oi,Oj) falls into one of the following categories SS: Cij=1 and Pij=1; (agree) DD: Cij=0 and Pij=0; (agree) SD: Cij=1 and Pij=0; (disagree) DS: Cij=0 and Pij=1; (disagree)
Rand index
may be dominated by DDJaccard Coefficient
||||||||
||||
||||
||
DDDSSDSS
DDSS
DisagreeAgree
AgreeRand
||||||
||
DSSDSS
SStcoefficienJaccard
University at Buffalo The State University of New York
Internal Index
“Ground truth” may be unavailable Use only the data to measure cluster
quality Measure the “homogeneity” and “separation” of
clusters. SSE: Sum of squared errors.
Calculate the correlation between clustering results and distance matrix.
University at Buffalo The State University of New York
Sum of Squared Error
Homogeneity is measured by the within cluster sum of squares
Exactly the objective function of K-means.Separation is measured by the between cluster sum of
squares
Where |Ci| is the size of cluster i, m is the centroid of the whole data set.
BSS + WSS = constantA larger number of clusters tend to result in smaller WSS.
i Cx
i
i
mxWSS 2)(
i
ii mmCBSS 2)(
University at Buffalo The State University of New York
Sum of Squared Error
10100
10)35(1)34(1)32(1)31(1
0)55()44()22()11(2222
2222
Total
BSS
WSS
1 2 3 4 5 m1 m2
m
K=2 :
10010
0)33(4
10)35()34()32()31(2
2222
Total
BSS
WSSK=1 :
K=4:
1091
9)35.4(2)5.13(2
1)5.45()5.44()5.12()5.11(22
2222
Total
BSS
WSS
University at Buffalo The State University of New York
Can also be used to estimate the number of clusters.
Sum of Squared Error
2 5 10 15 20 25 300
1
2
3
4
5
6
7
8
9
10
K
SS
E
5 10 15
-6
-4
-2
0
2
4
6
University at Buffalo The State University of New York
Internal Measures: SSE
SSE curve for a more complicated data set
1 2
3
5
6
4
7
SSE of clusters found using K-means
University at Buffalo The State University of New York
Correlation with Distance Matrix
Distance Matrix Dij is the similarity between object Oi and Oj.
Incidence Matrix Cij=1 if Oi and Oj belong to the same cluster, Cij=0
otherwise
Compute the correlation between the two matricesOnly n(n-1)/2 entries needs to be calculated.
High correlation indicates good clustering.
University at Buffalo The State University of New York
Given Distance Matrix D = {d11,d12, …, dnn } and Incidence Matrix C= { c11, c12,…, cnn } .
Correlation r between D and C is given by
n
jiij
n
jiij
n
jiijij
ccdd
ccdd
r
1,1
2_
1,1
2_
1,1
__
)()(
))((
Correlation with Distance Matrix
University at Buffalo The State University of New York
Measuring Cluster Validity Via Correlation
Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235 Corr = -0.5810
University at Buffalo The State University of New York
Clusters found in Random Data
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random Points
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Complete Link
University at Buffalo The State University of New York
Order the similarity matrix with respect to cluster labels and inspect visually.
Using Similarity Matrix for Cluster Validation
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Po
ints
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
University at Buffalo The State University of New York
PointsP
oin
ts
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
University at Buffalo The State University of New York
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
PointsP
oin
ts20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Complete Link
University at Buffalo The State University of New York
Using Similarity Matrix for Cluster Validation
Clusters in random data are not so crisp
PointsP
oin
ts20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
University at Buffalo The State University of New York
Reliability of Clusters
Need a framework to interpret any measure. For example, if our measure of evaluation has the
value, 10, is that good, fair, or poor?
Statistics provide a framework for cluster validity The more “atypical” a clustering result is, the more
likely it represents valid structure in the data.
University at Buffalo The State University of New York
ExampleCompare SSE of 0.005 against three clusters in random dataSSE Histogram of 500 sets of random data points of size 100
distributed over the range 0.2 – 0.8 for x and y values
Statistical Framework for SSE
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340
5
10
15
20
25
30
35
40
45
50
SSE
Co
unt
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
SSE = 0.005
University at Buffalo The State University of New York
Correlation of incidence and distance matrices for the K-means of the following two data sets.
Statistical Framework for Correlation
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235 Corr = -0.5810
Correlation histogram of random data
University at Buffalo The State University of New York
Hyper-geometric Distribution
Given the total number of genes in the data set associated with term T is M, if randomly draw n genes from the data set N, what is the probability that m of the selected n genes will be associated with T?
n
N
mn
MN
m
M
nMNm ),,|Pr(
University at Buffalo The State University of New York
P-Value
Based on Hyper-geometric distribution, the probability of having m genes or fewer associated to T in N can be calculated by summing the probabilities of a random list of N genes having 1, 2, …, m genes associated to T. So the p-value of over-representation is as follows:
),min( nM
mi
n
N
in
MN
i
M
p