cluster validation

University at Buffalo The State University of New York

Cluster Validation

Cluster validation Assess the quality and reliability of clustering

results.

Why validation?To avoid finding clusters formed by chanceTo compare clustering algorithmsTo choose clustering parameters

e.g., the number of clusters in the K-means algorithm


Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link


Aspects of Cluster Validation

Comparing the clustering results to ground truth (externally known results). External Index

Evaluating the quality of clusters without reference to external information. Use only the data Internal Index

Determining the reliability of clusters. To what confidence level, the clusters are not

formed by chance Statistical framework


Comparing to Ground Truth

Notation N: number of objects in the data set; P={P1,…,Pm}: the set of “ground truth” clusters; C={C1,…,Cn}: the set of clusters reported by a clustering

algorithm. The “incidence matrix”

N N (both rows and columns correspond to objects). Pij = 1 if Oi and Oj belong to the same “ground truth”

cluster in P; Pij=0 otherwise. Cij = 1 if Oi and Oj belong to the same cluster in C; Cij=0

otherwise.


External Index

A pair of data object (Oi,Oj) falls into one of the following categories SS: Cij=1 and Pij=1; (agree) DD: Cij=0 and Pij=0; (agree) SD: Cij=1 and Pij=0; (disagree) DS: Cij=0 and Pij=1; (disagree)

Rand index

may be dominated by DDJaccard Coefficient

||||||||

||||

||||

||

DDDSSDSS

DDSS

DisagreeAgree

AgreeRand

||||||

||

DSSDSS

SStcoefficienJaccard


Internal Index

“Ground truth” may be unavailable Use only the data to measure cluster

quality Measure the “homogeneity” and “separation” of

clusters. SSE: Sum of squared errors.

Calculate the correlation between clustering results and distance matrix.


Sum of Squared Error

Homogeneity is measured by the within cluster sum of squares

Exactly the objective function of K-means.Separation is measured by the between cluster sum of

squares

Where |Ci| is the size of cluster i, m is the centroid of the whole data set.

BSS + WSS = constantA larger number of clusters tend to result in smaller WSS.

i Cx

i

i

mxWSS 2)(

i

ii mmCBSS 2)(



10100

10)35(1)34(1)32(1)31(1

0)55()44()22()11(2222

2222

Total

BSS

WSS

1 2 3 4 5 m1 m2

m

K=2 :

10010

0)33(4

10)35()34()32()31(2

2222

Total

BSS

WSSK=1 :

K=4:

1091

9)35.4(2)5.13(2

1)5.45()5.44()5.12()5.11(22

2222

Total

BSS

WSS


Can also be used to estimate the number of clusters.


2 5 10 15 20 25 300

1

2

3

4

5

6

7

8

9

10

K

SS

E

5 10 15

-6

-4

-2

0

2

4

6


Internal Measures: SSE

SSE curve for a more complicated data set

1 2

3

5

6

4

7

SSE of clusters found using K-means


Correlation with Distance Matrix

Distance Matrix Dij is the similarity between object Oi and Oj.

Incidence Matrix Cij=1 if Oi and Oj belong to the same cluster, Cij=0

otherwise

Compute the correlation between the two matricesOnly n(n-1)/2 entries needs to be calculated.

High correlation indicates good clustering.


Given Distance Matrix D = {d11,d12, …, dnn } and Incidence Matrix C= { c11, c12,…, cnn } .

Correlation r between D and C is given by

n

jiij

n

jiij

n

jiijij

ccdd

ccdd

r

1,1

2_

1,1

2_

1,1

__

)()(

))((

Correlation with Distance Matrix


Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810


Clusters found in Random Data

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Complete Link


Order the similarity matrix with respect to cluster labels and inspect visually.

Using Similarity Matrix for Cluster Validation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Po

ints

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


PointsP

oin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Clusters in random data are not so crisp

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y




0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

PointsP

oin

ts20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link




PointsP

oin

ts20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y


Reliability of Clusters

Need a framework to interpret any measure. For example, if our measure of evaluation has the

value, 10, is that good, fair, or poor?

Statistics provide a framework for cluster validity The more “atypical” a clustering result is, the more

likely it represents valid structure in the data.


ExampleCompare SSE of 0.005 against three clusters in random dataSSE Histogram of 500 sets of random data points of size 100

distributed over the range 0.2 – 0.8 for x and y values

Statistical Framework for SSE

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

50

SSE

Co

unt

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

SSE = 0.005


Correlation of incidence and distance matrices for the K-means of the following two data sets.

Statistical Framework for Correlation

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Correlation histogram of random data


Hyper-geometric Distribution

Given the total number of genes in the data set associated with term T is M, if randomly draw n genes from the data set N, what is the probability that m of the selected n genes will be associated with T?

n

N

mn

MN

m

M

nMNm ),,|Pr(


P-Value

Based on Hyper-geometric distribution, the probability of having m genes or fewer associated to T in N can be calculated by summing the probabilities of a random list of N genes having 1, 2, …, m genes associated to T. So the p-value of over-representation is as follows:

),min( nM

mi

n

N

in

MN

i

M

p

cluster validation

Documents